Scaling Up MoE: Deep Dive in to Energy Efficiency (Llama 4, Mixtral and Others)

Scott Chamberlin

5/25/2025

7 mins

Scaling Up MoE: Deep Dive in to Energy Efficiency (Llama 4, Mixtral and Others)

This is the second post in a series about evaluating the energy efficiency of mixture of experts models (MoE).

The first post here examined at which point the overhead to select and average experts overcame the efficiencies of having experts. This raised the question of how these toy models compared against current state of the art dense and MoE models.

Framing the problem.

In part 1 I created toy models to be able to directly compare number of tunable parameters. Per a suggestion on LinkedIn I am going to cover a few different models here. We'll start with comparing the MoE model Mixtral 8x7b (Q4_0 quanitization) with Llama 2 13b (Q4_0 quantization) and Gemma 12b (Q4_K_M quantization). Then we'll add in comparisons Llama 3.3 70b (Q4_K_M quantization) dense model with the Llama 4 scout 109B (Q4_K_M quantization) MoE model to see what we find.

The first three models are picked because they all have close to the same number of active parameters (12b for Mixtral and Gemma, and 13b for Llama2) though due to the advancements in quantization Gemma has a much more advanced quantization.

For the Llama comparisons I picked this since they both came from Meta and likely had some similar lineages but also, they were close enough in total trainable parameters (70B for Llama 3.3 and 109B for Llama 4 Scout) that I could try some experimentation with some new normalization I've been thinking about.

Llama 4 scout has 109B total parameters and 16 experts with 17B active parameters per forward pass so it should be much faster and more energy efficient since the number of parameters per pass 17B versus 70B. Total trainable parameters are closer than some other test options we could come up with (109B versus 70B). Also, we are well above the threshold discussed in part 1 where we should easily be overcoming the overhead to select and average experts.

Evaluation

Lets look at the first set of comparisons which were generated against our text prompt test suite.

Model	Total Time	Total Watt Seconds	Total Tokens	Tokens per Watt Second	Tokens per Second	Watt Seconds per Response
llama 2 13b	0 days 00:02:18.03	39505.79	15084	0.38	109.28	1012.97
gemma 3 12b	0 days 00:06:11:8	96855.51	26091	0.27	70.17	2483.47
mixtral 8x7b	0 days 00:02:39.39	45220.38	16937	0.37	106.26	1159.5

First of all, we see a trend which will continue. First, we can see that on all metrics mixtral and llama 2 are roughly on par with each other at this size. This is reasonable since they have close to the same parameter counts are roughly the same generation and are using the same quantization.

If we add in a comparison of gemma which has the exact same number of active parameters are mixtral we see some differences worth noting. First of all, the newer models (gemma) are outputting more tokens (even if they aren't explicitly chain of thought models) which is something we are seeing across most model updates. While also noting that the dense gemma 3 uses the more advanced Q4_K_M quantization it still is less energy efficient across all stats than mixtral even when normalizing for token length outputs. Certainly, this is an opportunity to normalized against quality scores to decide whether this is a worthwhile tradeoff—I'm working on some quality normalization capabilities for my next post so stay tuned. But in terms of energy efficiency mixtral and llama 2, while being older models, definitely perform better than the newer gemma.

Let's take a look at the more recent Llama models to explore MoE gains further. As expected, the Llama 4 Scout model is much more energy efficient than the dense model.

Model	Total Time	Total Watt Seconds	Total Tokens	Tokens per Watt Second	Tokens per Second	Watt Seconds per Response
llama 3.3 70b	0 days 00:12:09.38	216192.35	16597	0.07	22.75	5543.39
llama 4 Scout	0 days 00:03:56.69	64751.78	27092	0.41	114.45	1660.30

We can see here the significant energy savings across all the metrics with the tokens per watt-second value at 5.8x higher energy efficiency for Scout (MoE) versus 3.3 (dense). This is in spite of the Llama 4 model having 1.5x more trainable parameters. I generally prefer normalizing across token lengths since that represents some of the core energy values regardless of verbosity of the models. We also see that Scout is quite a bit more verbose on the same prompt set than Llama 3.3 which is a trend we've also seen with chain of thought models. We can also just normalize across responses showing Scout 3.3x more energy efficient than Llama 3.3 (still a lot better given its longer responses) per average response.

What can we learn if we normalize against different parameter properties?

I have been curious as to whether just knowing the number of active parameters of new models as they are released could give us much insight in to their energy efficiency. This is a good opportunity to evaluate that. First lets see how it looks between sense and sparse architectures.

Model	Watt Seconds per Response per Trainable Parameter	Watt Seconds per Response per Active Parameter
llama 3.3 70b	7.91E-08	7.91E-08
llama 4 Scout	1.52E-08	9.76E-08

After normalizing against trainable parameters, we see that Scout is again 5.2x more energy efficient than Llama 3.3 and normalizing against active parameters it's about 80% as energy efficient as Llama 3.3. So this is the first normalization where we are seeing the relative energy show up more for the dense model. We also have confounding results when normalizing against different values. Humm, maybe not so helpful?

Given we learned about the MoE overhead in part one I have a hypothesis that if we control for tokens output and executable parameters then MoE should have less energy efficiency than dense models based on what we learned in part 1.

Let's see:

Model	Watt Seconds per Token per Active Parameter
llama 3.3 70b	1.86E-10
llama 4 Scout	1.41E-10

I guess my hypothesis didn't hold. When normalized against token output and active parameters Scout is about 1.3x more energy efficient than Llama 3.3. There is a good technical overview of the core Llama 4 implementation here and it's likely that the router and other performance optimizations are minimizing the impact of the model selection relative to the rest of the computations.

Ok well what if we just try and compare sparse models (remember we aren't using the same quantization here though).

Model	Total Tokens	Watt Seconds per Response per Trainable Parameter	Watt Seconds per Token per Active Parameter
llama 4 Scout	27092	9.77E-08	1.41E-10
mixtral 8x7b	16937	9.66E-08	2.22E-10

Unexpectedly the normalized comparison per response is very similar but when normalized for the output tokens the mixtral response is almost twice as expensive (same energy per response for less token output). There might be some capabilities to normalize in this manner as long as you ensure you are only comparing similar architectures. I think more examples are necessary to build confidence in making assumptions based on these high-level model attributes.

I also believe due to the confounding results when normalizing different architectures and the lack of key takeaways from this analysis I'm not confident that we can easily make estimations of future model's energy efficiencies from their high-level parameter values (trainable/active). We probably need to be looking into the specific layer implementations to start to draw additional conclusions.

Takeaways

Compared to the toy models in part 1 for production scale SotA levels the overhead of MoE expert selection is insignificant by the performance gains from reduced computations per forward pass.

Llama 4 Scout seems to have great per token efficiency capabilities with compared against Mixtral while offering a lot more active parameters (17B versus 12B) though due to its longer outputs its a bit behind in per response energy efficiency. When compared to dense models Llama 4 scout is surprisingly efficient even when producing more verbose outputs.

Our analysis shows that normalizing against active or trainable parameters does not provide meaningful insights into efficiency or support better business decisions when comparing vastly different model architectures. Its possible estimations are possible within close model architectures but in the long-term advancements in specific layer optimizations are making scaling behavior diverge from simple parameter count assumptions. As model architectures evolve, real-world benchmarking remains the only reliable method for understanding energy efficiency trends.

Notes

Tests were conducted on fixed sets of prompts against Nvidia A100 80gb using Ollama models all quantized and served in the same format. Test were hosted on Crusoe Cloud. All energy values are GPU only and batch size of one. Internally we've migrated to doing most of our tests using vllm for this post I chose to use Ollama for this due to challenges getting comparable quantized models installed and running easily.

Neuralwatt is building the energy efficiency layer for ai factories. If you want to learn more about making turning energy into revenue efficiently then please get in touch.