Full Paper List
updated
Ultra-Sparse Memory Network
Paper
•
2411.12364
•
Published
•
23
Paper
•
2409.19606
•
Published
•
23
Polynomial Composition Activations: Unleashing the Dynamics of Large
Language Models
Paper
•
2411.03884
•
Published
•
28
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
Paper
•
2501.16975
•
Published
•
31
Scale-Distribution Decoupling: Enabling Stable and Effective Training of
Large Language Models
Paper
•
2502.15499
•
Published
•
15
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid
Normalization
Paper
•
2503.04598
•
Published
•
21
Frac-Connections: Fractional Extension of Hyper-Connections
Paper
•
2503.14125
•
Published
•
22
Efficient Pretraining Length Scaling
Paper
•
2504.14992
•
Published
•
20
Scaling Law for Quantization-Aware Training
Paper
•
2505.14302
•
Published
•
76
Stepsize anything: A unified learning rate schedule for
budgeted-iteration training
Paper
•
2505.24452
•
Published
•
5
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior
Long-Context Learning
Paper
•
2508.18756
•
Published
•
36