BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscience-workshop

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

odegiber authored a paper 3 days ago

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

odegiber authored a paper 3 days ago

Open Machine Translation for Esperanto

israel authored a paper 6 days ago

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

View all activity

RTT1

authored a paper 27 days ago

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

Paper • 2603.13428 • Published Mar 13 • 21

in bigscience/bloom about 1 month ago

[SPAM] Deleted

#289 opened about 1 month ago by

authored a paper about 1 month ago

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

Paper • 2603.09723 • Published Mar 10 • 7

authored a paper about 1 month ago

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Paper • 2603.10913 • Published Mar 11 • 44

in bigscience/bloom about 1 month ago

pretokenizer Regex issues?

#278 opened almost 2 years ago by

authored 2 papers about 1 month ago

Agents of Chaos

Paper • 2602.20021 • Published Feb 23 • 35

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Paper • 2602.08964 • Published Feb 9 • 1

in bigscience/bloom about 1 month ago

Test PR

#286 opened about 1 month ago by

Test discussion

#287 opened about 1 month ago by

Test discussion

#288 opened about 1 month ago by

authored a paper about 2 months ago

References Improve LLM Alignment in Non-Verifiable Domains

Paper • 2602.16802 • Published Feb 18 • 2

RTT1

submitted a paper to Daily Papers 2 months ago

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Paper • 2602.07075 • Published Feb 6 • 19

authored a paper 3 months ago

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Paper • 2601.17277 • Published Jan 24 • 6

authored a paper 3 months ago

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Paper • 2601.09876 • Published Jan 14 • 7

shubhamagarwal92

authored a paper 4 months ago

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Paper • 2511.10338 • Published Nov 13, 2025

authored a paper 4 months ago

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published Dec 23, 2025 • 18

in bigscience/bloomz-560m 4 months ago

Fails to load with transformers v4.57+

#14 opened 4 months ago by

authored a paper 4 months ago

Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem

Paper • 2512.03073 • Published Nov 27, 2025 • 7

posted an update 4 months ago

Post

460

PatchDNA, a DNA foundation model based on Meta's BLT tokenization strategy https://www.biorxiv.org/content/10.1101/2025.11.28.691095v1

authored a paper 5 months ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Paper • 2406.07835 • Published Jun 10, 2024 • 2