Model Card for CrystaLLM-pi_base

Model Details

Model Description

CrystaLLM-pi_base is an unconditional generative model designed for the generation of valid inorganic crystal structures. It serves as the foundational pre-trained model for the CrystaLLM-pi framework. Based on a GPT-2 decoder-only architecture, it is trained on a large corpus of Crystallographic Information Files (CIFs) to learn the syntax, symmetry, and chemical rules governing crystalline matter.

This model does not accept property conditioning vectors. It generates structures based on text prompts (e.g., chemical composition or space group) or unconditionally (ab-initio generation).

Developed by: Bone et al. (University College London)
Model type: Autoregressive Transformer (GPT-2)
Language(s): CIF (Crystallographic Information File) syntax
License: MIT

Model Sources

Repository: GitHub: CrystaLLM-pi
Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
Dataset: HuggingFace: c-bone/lematerial_clean

Uses

Direct Use

The model is intended for:

Unconditional Generation: Exploring the general chemical space of stable crystals.
Composition/Space Group Completion: Generating valid structures given a partial prompt (e.g., a chemical formula).
Fine-tuning base: Serving as the pre-trained initialization for property-conditional models (like CrystaLLM-pi_bandgap or CrystaLLM-pi_density).

Out-of-Scope Use

Property Conditioning: This model cannot be steered by properties like band gap or density. Use the specific fine-tuned variants for those tasks.
Large Unit Cells: Context window limit of 1024 tokens (~20 atoms/cell).

Bias, Risks, and Limitations

Training Distribution: The model reflects the biases present in the LeMaterial dataset. It is most effective at generating structures similar to known stable inorganic compounds.
Validity: While it learns CIF syntax robustly, it may still generate physically invalid structures (e.g., overlapping atoms) or chemically unstable compositions.

How to Get Started with the Model

For instructions on how to load and run generation with this model, please refer to the _load_and_generate.py script in the CrystaLLM-pi GitHub Repository.

Training Details

Training Data

The model was pre-trained on the LeMaterial dataset (specifically c-bone/lematerial_clean), a large-scale collection of ~4.35 million augmented CIFs derived from major materials databases.

Source: LeMaterial (via c-bone/lematerial_clean)
Preprocessing: CIFs are deduplicated, augmented (with symmetry operations), and tokenized.

Training Procedure

Architecture: GPT-2 Small (~25.9M parameters).
Objective: Causal Language Modeling (Next-token prediction).
Loss Function: Cross-entropy with specific weighting for fixed syntax tokens to accelerate learning of the CIF format.

Evaluation

Metrics

The model is evaluated based on:

Validity: The rate at which generated sequences can be parsed as valid CIF files.
Structural Consistency: Adherence to space group symmetry and reasonable bond lengths.

Results

The base model achieves high validity rates and effectively learns to generate chemically plausible structures, serving as a robust foundation for downstream property-conditioning tasks.

Citation

@misc{bone2025discoveryrecoverycrystallinematerials,
      title={Discovery and recovery of crystalline materials with property-conditioned transformers}, 
      author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
      year={2025},
      eprint={2511.21299},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)}, 
}

Downloads last month: 46

Safetensors

Model size

25.9M params

Tensor type

F32

Model tree for c-bone/CrystaLLM-pi_base

Finetunes

4 models

c-bone
/

CrystaLLM-pi_base