Wonho Song
AI & ML interests
Recent Activity
Organizations
@loubnabnl Thank you for your response, it’s been extremely helpful.
While reviewing the dataset sources, I came across a few questions and would like to ask for clarification:
1. Regarding the pes2o dataset, I couldn’t find it in the shared pretraining collection. Is it referring to the allenai/peS2o dataset?
2. Based on the config, I noticed three datasets: pull-requests, jupyter-scripts, and github-issues. Could you please clarify the sources for each of these?
3. For the kaggle dataset, did you use the HuggingFaceTB/issues-kaggle-notebooks source? I saw that there are two subsets, issues and kaggle — did you use only the kaggle subset?
Thanks again!
@loubnabnl Thanks for sharing this - it's been incredibly helpful for reproduction.
I have one more question: I noticed that data dataset sources and their proportions are listed in the linked document. Did HF team use the dataset sources as-is, without any preprocessing?
I’m also curious whether there are any plans to share the datasets under the dataset_folder: path in S3, or if they might be available for sharing.
Hi HF team,
Thanks again for sharing your great research.
I’m trying to reproduce SmolLM3, and I was wondering — if it’s possible within your sharing policy, could you share a wandb link for the training run?
thanks again.
Thanks for sharing.
SmolLM3: smol, multilingual, long-context reasoner
- +21