01/06 2026
458
Large language models (LLMs) employ a uniform computational approach across all tokens. This method, which treats every token equally, squanders computational resources on segments that are locally predictable, while not allocating enough resources to transitions that are semantically crucial.
The ByteDance Seed team has unveiled the Dynamic Large Concept Model (DLCM), a hierarchical language modeling framework. This framework discerns semantic boundaries from latent representations and shifts computation from individual tokens to a more efficient, compressed concept space for reasoning purposes.
DLCM has the capability to identify concepts of varying lengths in an end-to-end manner, without the need for predefined linguistic units. The hierarchical compression strategy fundamentally transforms how the model scales.
The team has also introduced the first compressive sensing scaling law. This law separates token-level capacity, concept-level reasoning capacity, and compression ratio, facilitating a logical distribution of computational resources within a fixed FLOPs (floating-point operations per second) framework.
To ensure stable training of this diverse architecture, Seed has further developed a decoupled μP parameterization technique. This method supports zero-shot hyperparameter transfer across different model widths and compression mechanisms, redirecting roughly one-third of the inference computation to a reasoning backbone with greater capacity.
When compared under the same inference FLOPs, an average performance boost of +2.69% is observed across 12 zero-shot benchmarks.
DLCM processes token sequences in four distinct stages:
Researchers have deliberately separated discrete segmentation decisions from language modeling losses to prevent interference during the optimization process. This design choice sacrifices complete end-to-end discreteness in favor of training stability and manageable compression, which is essential for large-scale deployments.
The decoder reconstructs token-level predictions by focusing on the reasoned concepts, incorporating two main components: concept smoothing and causal cross-attention.
The team has implemented an independent kernel analysis method, Flash Attention Varlen, yielding three significant insights:
The results indicate that DLCM achieves an average accuracy of 43.92%, marking a 2.69% improvement over the baseline score of 41.23%. However, these enhancements are not uniformly distributed across tasks, highlighting a notable disparity between benchmarks dominated by reasoning and those reliant on fine-grained token-level alignment.
Performance consistently, and often significantly, improves on benchmarks that emphasize multi-step reasoning, hypothesis selection, and implicit commonsense reasoning.
DLCM focuses computation on regions of structural importance by compressing locally predictable segments and allocating the majority of the model's capacity to high-dimensional concept backbones.
The encoder-compression-decoder approach inevitably reduces token-level granularity within concepts, potentially obscuring the micro-level distinctions necessary for certain tasks. It's important to note that this performance dip is localized rather than widespread: While boundary tokens are modeled with greater precision, tokens in the middle of concepts may sacrifice some fine-grained accuracy for enhanced global coherence.
On knowledge base and multilingual benchmarks, DLCM's structural optimization is geared towards reasoning in environments with non-uniform information density, rather than uniform memory-based retrieval.
Experimental findings also support the research team's fundamental design principle: Redirecting computation from redundant token-level processing to dense concept-level reasoning significantly boosts effective capacity without a corresponding increase in inference costs.
In ablation experiments, the team juxtaposed two boundary prediction mechanisms for sequence compression: a learned neural predictor with compression rate regularization and a rule-based predictor utilizing cosine similarity.
The learned predictor showed significant instability. After initially compressing to roughly 2000 tokens, the compressed length gradually increased, eventually stabilizing at around 4300 tokens as the model slowly learned to reduce compression over time. Conversely, the rule-based predictor (depicted in purple) exhibited superior stability, swiftly converging to approximately 2000 tokens and maintaining this level throughout the training process.
Moreover, there are notable variations in compression density across different content types. With an 8x compression target, technical English retains significantly more tokens per concept (10.58) compared to technical Chinese (6.09) or code (6.14).
This confirms that the global regularization mechanism effectively separates compression targets from strict sequence-level constraints. Instead of imposing uniform segment lengths, the model adjusts granularity according to inherent characteristics.
References:
https://arxiv.org/pdf/2512.24617