Model Card for CrystaLLM-pi_base
Model Details
Model Description
CrystaLLM-pi_base is an unconditional generative model designed for the generation of valid inorganic crystal structures. It serves as the foundational pre-trained model for the CrystaLLM-pi framework. Based on a GPT-2 decoder-only architecture, it is trained on a large corpus of Crystallographic Information Files (CIFs) to learn the syntax, symmetry, and chemical rules governing crystalline matter.
This model does not accept property conditioning vectors. It generates structures based on text prompts (e.g., chemical composition or space group) or unconditionally (ab-initio generation).
- Developed by: Bone et al. (University College London)
- Model type: Autoregressive Transformer (GPT-2)
- Language(s): CIF (Crystallographic Information File) syntax
- License: MIT
Model Sources
- Repository: GitHub: CrystaLLM-pi
- Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
- Dataset: HuggingFace: c-bone/lematerial_clean
Uses
Direct Use
The model is intended for:
- Unconditional Generation: Exploring the general chemical space of stable crystals.
- Composition/Space Group Completion: Generating valid structures given a partial prompt (e.g., a chemical formula).
- Fine-tuning base: Serving as the pre-trained initialization for property-conditional models (like
CrystaLLM-pi_bandgaporCrystaLLM-pi_density).
Out-of-Scope Use
- Property Conditioning: This model cannot be steered by properties like band gap or density. Use the specific fine-tuned variants for those tasks.
- Large Unit Cells: Context window limit of 1024 tokens (~20 atoms/cell).
Bias, Risks, and Limitations
- Training Distribution: The model reflects the biases present in the LeMaterial dataset. It is most effective at generating structures similar to known stable inorganic compounds.
- Validity: While it learns CIF syntax robustly, it may still generate physically invalid structures (e.g., overlapping atoms) or chemically unstable compositions.
How to Get Started with the Model
For instructions on how to load and run generation with this model, please refer to the _load_and_generate.py script in the CrystaLLM-pi GitHub Repository.
Training Details
Training Data
The model was pre-trained on the LeMaterial dataset (specifically c-bone/lematerial_clean), a large-scale collection of ~4.35 million augmented CIFs derived from major materials databases.
- Source: LeMaterial (via
c-bone/lematerial_clean) - Preprocessing: CIFs are deduplicated, augmented (with symmetry operations), and tokenized.
Training Procedure
- Architecture: GPT-2 Small (~25.9M parameters).
- Objective: Causal Language Modeling (Next-token prediction).
- Loss Function: Cross-entropy with specific weighting for fixed syntax tokens to accelerate learning of the CIF format.
Evaluation
Metrics
The model is evaluated based on:
- Validity: The rate at which generated sequences can be parsed as valid CIF files.
- Structural Consistency: Adherence to space group symmetry and reasonable bond lengths.
Results
The base model achieves high validity rates and effectively learns to generate chemically plausible structures, serving as a robust foundation for downstream property-conditioning tasks.
Citation
@misc{bone2025discoveryrecoverycrystallinematerials,
title={Discovery and recovery of crystalline materials with property-conditioned transformers},
author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
year={2025},
eprint={2511.21299},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)},
}
- Downloads last month
- 46