Introduction

We propose Chain-of-Experts (CoE), which fundamentally changes sparse Large Language Model (LLM) processing by implementing sequential communication between intra-layer experts within Mixture-of-Experts (MoE) models.

Mixture-of-Experts (MoE) models process information independently in parallel between experts and have high memory requirements. CoE introduces an iterative mechanism enabling experts to "communicate" by processing tokens on top of outputs from other experts.

Experiments show that CoE significantly outperforms previous MoE models in multiple aspects:

These advantages constitute a "free lunch" effect, enabling efficient scaling of LLMs.

Code: https://github.com/ZihanWang314/coe

English Blog: Chain-of-Experts: Unlocking the Communication Power of MoEs

Chinese Blog: Chain-of-Experts: 释放MoE专家的沟通潜能

Chain-of-Experts: Unlocking the Communication Power of MoE models

Large Language Models (LLMs) continue to push the boundaries of artificial intelligence possibilities, but efficiently scaling these models remains a major challenge. Mixture of Experts (MoE) models have emerged as a promising approach to address this challenge by activating only a portion of parameters for each token, theoretically achieving more efficient scaling. However, MoE models have the following limitations:

  1. Independent token processing: MoE models typically process tokens in parallel and independently, with limited communication between experts.
  2. Low memory efficiency: Due to sparse activation patterns, MoE models have a larger total parameter amount and require substantial GPU memory resources.

Chain-of-Experts (CoE) Introduction