Join the conversation
Join the community of Machine Learners and AI enthusiasts.
Sign UpAll HF Hub posts
Parveshiiii
posted an update 3 days ago
Post
2724
Just did something I’ve been meaning to try for ages.
In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3.
Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated.
Turns out it doesn’t have to be.
microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable.
If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for.
I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face.
Blog → https://parveshiiii.github.io/blogs/microtok/
Trained tokenizer → Parveshiiii/microtok
GitHub repo → https://github.com/Parveshiiii/microtok
In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3.
Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated.
Turns out it doesn’t have to be.
microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable.
If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for.
I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face.
Blog → https://parveshiiii.github.io/blogs/microtok/
Trained tokenizer → Parveshiiii/microtok
GitHub repo → https://github.com/Parveshiiii/microtok
prithivMLmods
posted an update 1 day ago
Post
2106
Flux-Klein-KV-Edit-Consistency demo is now available on Spaces. It preserves character identity and delivers high-quality, realistic results after edits. No need for any special prompts, just upload the image, type your prompt, and get the resulting image blazing fast.
🔥 Demo Space: prithivMLmods/flux-klein-kv-edit-consistency
🤗 Model: black-forest-labs/FLUX.2-klein-9b-kv
🤗 Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
🔗 Gradio Server Mode: https://www.gradio.app/main/guides/server-mode
➔ Built with Headless Gradio, an alternative to using gr.Blocks for creating the frontend and triggering events, powered by FastAPI + Gradio. You can now design the frontend however you want, with continued support for APIs, MCP, and ZeroGPU.
➔ Gradio Server Mode is now available from [email protected].
To learn more, visit the app page or the respective model pages.
🔥 Demo Space: prithivMLmods/flux-klein-kv-edit-consistency
🤗 Model: black-forest-labs/FLUX.2-klein-9b-kv
🤗 Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
🔗 Gradio Server Mode: https://www.gradio.app/main/guides/server-mode
➔ Built with Headless Gradio, an alternative to using gr.Blocks for creating the frontend and triggering events, powered by FastAPI + Gradio. You can now design the frontend however you want, with continued support for APIs, MCP, and ZeroGPU.
➔ Gradio Server Mode is now available from [email protected].
To learn more, visit the app page or the respective model pages.
Post
3357
Article Highlight: SyntheticGen, Controllable Diffusion for Long-Tail Remote Sensing
🛰️ Why is remote-sensing segmentation still hard—even with strong models?
Because the issue is not only the model… it’s the data.
In real-world datasets like LoveDA, class distributions are highly imbalanced, and the problem is compounded by Urban/Rural domain shifts, where visual characteristics and class frequencies differ significantly. This leads to poor learning for minority classes and weak generalization.
⚖️ The Idea: Make Data Controllable
Instead of treating data augmentation as a random process, SyntheticGen turns it into a controllable pipeline.
👉 What if you could:
Specify which classes you want more of?
Control how much of each class appears?
Generate data that respects domain (Urban/Rural) characteristics?
That’s exactly what SyntheticGen enables.
🧠 How It Works
SyntheticGen introduces a structured generation process:
Layout Generation (Stage A)
A ratio-conditioned discrete diffusion model generates semantic layouts that match user-defined class distributions.
Image Synthesis (Stage B)
A ControlNet-guided Stable Diffusion pipeline converts layouts into realistic remote-sensing imagery.
💡 This separation between semantic control and visual realism is key—it allows both precision and high-quality generation.
Why It Matters
Tackles long-tail imbalance directly at the data level
Improves minority-class segmentation performance
Enhances cross-domain generalization (Urban ↔ Rural)
Moves toward data-centric AI, where we design training data—not just models
Recent research shows that diffusion-based synthetic data can significantly improve performance in long-tailed settings by generating high-value samples for rare or difficult cases .
SyntheticGen takes this further by making the process explicitly controllable, not just generative.
📄 Paper
https://arxiv.org/abs/2602.04749
💻 Code & Synthetic Data
https://github.com/Buddhi19/SyntheticGen
🛰️ Why is remote-sensing segmentation still hard—even with strong models?
Because the issue is not only the model… it’s the data.
In real-world datasets like LoveDA, class distributions are highly imbalanced, and the problem is compounded by Urban/Rural domain shifts, where visual characteristics and class frequencies differ significantly. This leads to poor learning for minority classes and weak generalization.
⚖️ The Idea: Make Data Controllable
Instead of treating data augmentation as a random process, SyntheticGen turns it into a controllable pipeline.
👉 What if you could:
Specify which classes you want more of?
Control how much of each class appears?
Generate data that respects domain (Urban/Rural) characteristics?
That’s exactly what SyntheticGen enables.
🧠 How It Works
SyntheticGen introduces a structured generation process:
Layout Generation (Stage A)
A ratio-conditioned discrete diffusion model generates semantic layouts that match user-defined class distributions.
Image Synthesis (Stage B)
A ControlNet-guided Stable Diffusion pipeline converts layouts into realistic remote-sensing imagery.
💡 This separation between semantic control and visual realism is key—it allows both precision and high-quality generation.
Why It Matters
Tackles long-tail imbalance directly at the data level
Improves minority-class segmentation performance
Enhances cross-domain generalization (Urban ↔ Rural)
Moves toward data-centric AI, where we design training data—not just models
Recent research shows that diffusion-based synthetic data can significantly improve performance in long-tailed settings by generating high-value samples for rare or difficult cases .
SyntheticGen takes this further by making the process explicitly controllable, not just generative.
📄 Paper
https://arxiv.org/abs/2602.04749
💻 Code & Synthetic Data
https://github.com/Buddhi19/SyntheticGen
unmodeled-tyler
posted an update 2 days ago
Post
1746
Hey Hugging Face!
PRODUCT HUNT LINK: https://www.producthunt.com/products/quanta-intellect?utm_source=other&utm_medium=social
I've been sharing my new AI browser Vessel the last few days and I've gotten some great feedback/interest from a lot of you!
I'm excited to announce that Vessel Browser is now live on Product Hunt! If this is the first you've heard of it, check it out! Vessel is an open source AI browser built specifically for agents on Linux. It's not a fork of an existing browser, and it doesn't assume that the human is the primary operator.
If you've already tried Vessel Browser, feel free to leave a review on Product Hunt of what you thought - or if you'd prefer, send me an email directly or reach out on twitter if you have any questions about it. I'm perpetually online & happy to chat 😀
I'm committed to building the best open source AI browser out there, and Vessel is only going to improve as time goes on!
PRODUCT HUNT LINK: https://www.producthunt.com/products/quanta-intellect?utm_source=other&utm_medium=social
I've been sharing my new AI browser Vessel the last few days and I've gotten some great feedback/interest from a lot of you!
I'm excited to announce that Vessel Browser is now live on Product Hunt! If this is the first you've heard of it, check it out! Vessel is an open source AI browser built specifically for agents on Linux. It's not a fork of an existing browser, and it doesn't assume that the human is the primary operator.
If you've already tried Vessel Browser, feel free to leave a review on Product Hunt of what you thought - or if you'd prefer, send me an email directly or reach out on twitter if you have any questions about it. I'm perpetually online & happy to chat 😀
I'm committed to building the best open source AI browser out there, and Vessel is only going to improve as time goes on!
Post
350
Neural Gas is a classical unsupervised learning algorithm for vector quantization and topology learning, introduced in the early 1990s. It maintains a set of prototype vectors that move through the data space and gradually approximate the underlying distribution by ranking samples and adapting all units accordingly.
While the original formulation is algorithmically elegant, most existing implementations remain procedural and non-differentiable, which limits their integration with modern deep learning systems.
This project introduces a **differentiable** implementation of Neural Gas in PyTorch:
https://github.com/francesco-p/ngas-pytorch
The key idea is to reinterpret the update rules in a way that is compatible with autograd, allowing the algorithm to be embedded inside end-to-end trainable pipelines.
This shift enables several directions that are difficult or impossible with standard implementations:
- joint optimization of Neural Gas with neural networks
- inclusion of topology-learning modules inside differentiable models
- gradient-based tuning of algorithm parameters
- hybrid architectures combining representation learning and vector quantization
The repository provides a clean PyTorch implementation and focuses on making the core mechanism usable as a first-class differentiable component, rather than a standalone preprocessing step.
In parallel, an interactive playground was built to visualize the behavior of Neural Gas during training and better understand how prototypes adapt to the data distribution:
https://francesco-p.github.io/res/neural-gas/playground.html
The goal is to revisit a well-known algorithm and make it compatible with current machine learning workflows, where differentiability is a central constraint rather than an afterthought.
While the original formulation is algorithmically elegant, most existing implementations remain procedural and non-differentiable, which limits their integration with modern deep learning systems.
This project introduces a **differentiable** implementation of Neural Gas in PyTorch:
https://github.com/francesco-p/ngas-pytorch
The key idea is to reinterpret the update rules in a way that is compatible with autograd, allowing the algorithm to be embedded inside end-to-end trainable pipelines.
This shift enables several directions that are difficult or impossible with standard implementations:
- joint optimization of Neural Gas with neural networks
- inclusion of topology-learning modules inside differentiable models
- gradient-based tuning of algorithm parameters
- hybrid architectures combining representation learning and vector quantization
The repository provides a clean PyTorch implementation and focuses on making the core mechanism usable as a first-class differentiable component, rather than a standalone preprocessing step.
In parallel, an interactive playground was built to visualize the behavior of Neural Gas during training and better understand how prototypes adapt to the data distribution:
https://francesco-p.github.io/res/neural-gas/playground.html
The goal is to revisit a well-known algorithm and make it compatible with current machine learning workflows, where differentiability is a central constraint rather than an afterthought.
BibbyResearch
posted an update 3 days ago
Post
2861
🍌 Paper Banana is now live! Create academic illustrations using natural language
We just launched Paper Banana — a tool that lets you generate clean academic illustrations simply by describing them in natural language.
🔗 Try it here: https://trybibby.com/paper-banana
Whether you need diagrams for papers, presentations, or teaching materials, Paper Banana helps you turn ideas into visuals in seconds.
We’d love your feedback:
What did you like?
What features should we add next?
Give it a spin and let us know what you think! 🚀
Dear Huggingface, show this post to all my fellow researchers!
We just launched Paper Banana — a tool that lets you generate clean academic illustrations simply by describing them in natural language.
🔗 Try it here: https://trybibby.com/paper-banana
Whether you need diagrams for papers, presentations, or teaching materials, Paper Banana helps you turn ideas into visuals in seconds.
We’d love your feedback:
What did you like?
What features should we add next?
Give it a spin and let us know what you think! 🚀
Dear Huggingface, show this post to all my fellow researchers!
kanaria007
posted an update about 11 hours ago
Post
53
✅ Article highlight: *Adversarial SI* (art-60-050, v0.1)
TL;DR:
If SI-Core is meant for real deployment, it cannot assume benevolent actors. This article looks at *adversarial SI*: malicious Jumps, malicious RML calls, poisoned Genius Traces, metric gaming, compromised peers, and policy-plane artifacts as attack surfaces.
The core claim is simple: *OBS / ID / MEM / ETH / EVAL / PoLB are not just governance layers — they are also a defensive fabric.*
Read:
kanaria007/agi-structural-intelligence-protocols
Why it matters:
• treats SI-Core invariants as security invariants, not just safety abstractions
• makes abuse structurally expensive through traceability, fail-closed ETH, and scoped capabilities
• reuses *SCover / SCI / CAS* as security and forensics signals
• treats red-teaming as structured experimentation, not ad hoc chaos
What’s inside:
• an SI-native threat taxonomy: malicious Jumps, RML abuse, peer spoofing, metric gaming, policy-plane tampering
• defensive uses of *ID / OBS / MEM / ETH / EVAL / PoLB*
• malicious Genius Traces and how to vet or quarantine them
• *incident response as an SIR-native process*
• federated trust, revocation, quarantine, and graceful degradation
• red-team EvalSurfaces and abuse-resistant PoLB recipes
Key idea:
The goal is not invincibility. It is to make abuse *hard to execute, easy to detect, and easy to learn from* using the same structural language as the rest of SI-Core.
TL;DR:
If SI-Core is meant for real deployment, it cannot assume benevolent actors. This article looks at *adversarial SI*: malicious Jumps, malicious RML calls, poisoned Genius Traces, metric gaming, compromised peers, and policy-plane artifacts as attack surfaces.
The core claim is simple: *OBS / ID / MEM / ETH / EVAL / PoLB are not just governance layers — they are also a defensive fabric.*
Read:
kanaria007/agi-structural-intelligence-protocols
Why it matters:
• treats SI-Core invariants as security invariants, not just safety abstractions
• makes abuse structurally expensive through traceability, fail-closed ETH, and scoped capabilities
• reuses *SCover / SCI / CAS* as security and forensics signals
• treats red-teaming as structured experimentation, not ad hoc chaos
What’s inside:
• an SI-native threat taxonomy: malicious Jumps, RML abuse, peer spoofing, metric gaming, policy-plane tampering
• defensive uses of *ID / OBS / MEM / ETH / EVAL / PoLB*
• malicious Genius Traces and how to vet or quarantine them
• *incident response as an SIR-native process*
• federated trust, revocation, quarantine, and graceful degradation
• red-team EvalSurfaces and abuse-resistant PoLB recipes
Key idea:
The goal is not invincibility. It is to make abuse *hard to execute, easy to detect, and easy to learn from* using the same structural language as the rest of SI-Core.
reaperdoesntknow
posted an update about 24 hours ago
Post
351
We present a methodology for training small language models on CPU at FP32 precision
that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training.
Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross-
architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language
models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces-
sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper-
iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for
FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate
per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum
training progressing from language to logic to transfer to depth; (4) continuous belt-fed data
ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via
AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with
emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard
compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that trans-
formers were designed for GPU hardware rather than mathematical optimality, and that archi-
tectures designed for geometric correctness—metric-space attention, triangle inequality enforce-
ment, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models,
CPU training produces more capable models at a fraction of the cost.
that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training.
Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross-
architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language
models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces-
sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper-
iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for
FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate
per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum
training progressing from language to logic to transfer to depth; (4) continuous belt-fed data
ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via
AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with
emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard
compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that trans-
formers were designed for GPU hardware rather than mathematical optimality, and that archi-
tectures designed for geometric correctness—metric-space attention, triangle inequality enforce-
ment, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models,
CPU training produces more capable models at a fraction of the cost.
prometechinc
posted an update 1 day ago
Post
341
Cicikuş v4-5B (POFUDUK Edition) is a next-generation compact language model engineered for high-efficiency reasoning, adaptive intelligence, and behavioral coherence. Built on the Gemma 4B IT foundation and enhanced through advanced LoRA optimization and selective layer reconstruction, this model delivers powerful performance without the overhead of massive parameter counts.
🔗 Explore the model: pthinc/pofuduk_cicikus_v4_5B
🧠 Why Cicikuş?
In a world dominated by massive LLMs, Cicikuş takes a different path:
⚡ Fast & Efficient — Designed for edge deployment and low-resource environments
🎯 High Reasoning Accuracy — Strong results across MMLU, GSM8K, HumanEval, and more
🧩 Behavior-Aware Intelligence — Powered by the Behavioral Consciousness Engine (BCE)
🔍 Low Hallucination Rate — ~3% with built-in ethical filtering
🌍 Multilingual Capable — Optimized for English and Turkish
🔗 Explore the model: pthinc/pofuduk_cicikus_v4_5B
🧠 Why Cicikuş?
In a world dominated by massive LLMs, Cicikuş takes a different path:
⚡ Fast & Efficient — Designed for edge deployment and low-resource environments
🎯 High Reasoning Accuracy — Strong results across MMLU, GSM8K, HumanEval, and more
🧩 Behavior-Aware Intelligence — Powered by the Behavioral Consciousness Engine (BCE)
🔍 Low Hallucination Rate — ~3% with built-in ethical filtering
🌍 Multilingual Capable — Optimized for English and Turkish