Skip to main content
Research

Routing Strategies for Mixture-of-Experts on Edge Silicon

Status: Research

Mechanism

Expert-selection policies that respect on-die memory layout and activation-tensor traffic patterns on edge silicon. Datacenter-scale mixture-of-experts routing assumes uniform interconnect bandwidth and ample HBM, so the standard top-k softmax gate can treat every expert as equally cheap to activate. Edge silicon does not satisfy that assumption. The realistic memory hierarchy is tiered (on-die SRAM plus LPDDR plus NAND) with order-of-magnitude bandwidth disparities between tiers, and at any given step only a small subset of experts is resident in the fastest tier. The router should be aware of which experts are resident at which level and price activation accordingly. In this framing the router is itself an optimization problem: minimize cumulative activation transfer subject to a quality target, where the constraint set is set by the device, not by the model architect.

Why this matters

  • Datacenter MoE routing strategies do not transfer to edge silicon without modification. Activation-tensor traffic dominates inference latency at the tier where on-die SRAM holds only one or two experts, and the standard gate has no representation of that constraint.
  • Adjacent filed work (K-Pool LoRA at US Prov. 64/060,315) establishes the gradient-detached frozen-encoder routing intuition on a fixed-size adapter pool. The same intuition extends to MoE-style multi-expert layers, where the router must remain stable across the inference distribution and cannot drift with the language-modeling gradient at every step.
  • This is a research line, not a filed asset. Expected public output is on the order of a workshop preprint within the next 12 to 18 months, contingent on empirical validation of the proposed mechanism against published MoE-LoRA baselines.

Status and what's next

Active research. A filing decision is deferred until the empirical work clears the prior-art landscape (MILE, MINGLE, KeepLoRA, HiMoLE, GainLoRA, MixLoRA all explore adjacent surfaces, and a non-trivial subset of the routing-policy claim space is already crowded). Honest disclosure: this work has not been benchmarked against a held-out edge-silicon test bed, and published results from the lab on this specific surface are TBD.