Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
Recent Activity
View all activity
Organization Card
ššš
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠2k ⢠33 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠907 ⢠10 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.42k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠969 ⢠4
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
š§"raw" pretrained smol_llama checkpoints - WIP š§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation ⢠0.1B ⢠Updated ⢠2k ⢠33 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation ⢠81.3M ⢠Updated ⢠907 ⢠10 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation ⢠0.2B ⢠Updated ⢠2.42k ⢠13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation ⢠58.1M ⢠Updated ⢠969 ⢠4
models 58
BEE-spoke-data/NVIDIA-Nemotron-Parse-v1.2
Image-Text-to-Text ⢠0.9B ⢠Updated ⢠77
BEE-spoke-data/neobert-100k-test
Fill-Mask ⢠0.1B ⢠Updated ⢠1
BEE-spoke-data/tiny-random-MPNetForMaskedLM
Fill-Mask ⢠237k ⢠Updated ⢠2
BEE-spoke-data/bpe-tokenizer-32k-smolNeoX
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-orig
Updated
BEE-spoke-data/wordpiece-tokenizer-32k-en_code-msp
Updated
BEE-spoke-data/pegasus-x-base-synthsumm_open-16k
Summarization ⢠0.3B ⢠Updated ⢠34 ⢠2
BEE-spoke-data/tFINE-680m-e32-d16-infinity_instruct-L2
Text Generation ⢠0.7B ⢠Updated ⢠3
BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
0.7B ⢠Updated
BEE-spoke-data/tFINE-900m-instruct-orpo
0.9B ⢠Updated ⢠1
datasets 82
BEE-spoke-data/SurvivorLib-Nanonets-OCR-s
Viewer ⢠Updated ⢠14.4k ⢠17 ⢠2
BEE-spoke-data/SurvivorLib-rolmOCR
Viewer ⢠Updated ⢠14.6k ⢠51 ⢠1
BEE-spoke-data/govdocs1-pdf-source
Viewer ⢠Updated ⢠235k ⢠883 ⢠4
BEE-spoke-data/napierone-pdf-nanonets-s
Viewer ⢠Updated ⢠9.96k ⢠9
BEE-spoke-data/napierone-pdf-olmOCR
Viewer ⢠Updated ⢠19k ⢠22
BEE-spoke-data/LONGCOT-merged-1M
Viewer ⢠Updated ⢠1.7M ⢠39 ⢠2
BEE-spoke-data/cosmopedia-v2-mincols
Viewer ⢠Updated ⢠39.1M ⢠23 ⢠1
BEE-spoke-data/reddit-title-body-hf
Viewer ⢠Updated ⢠251M ⢠137 ⢠4
BEE-spoke-data/bigpatent-all
Viewer ⢠Updated ⢠2.43M ⢠213
BEE-spoke-data/google_wellformed_query-hf
Viewer ⢠Updated ⢠25.1k ⢠14