Mixture-of-Experts (MoE)
An architecture where each token activates only a subset of the model's parameters.
An architecture where each token only activates a subset of the model's parameters. Mixtral 8×7B has 47B total but only ~13B active per token — so it has the cost of a 13B but the quality of something between 13B and 47B.