Co-founder at @HuggingFace - moonshots - angel
RT Georgi Gerganov llama.cpp now has an official website: https://llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer. The installation provides a single unified `llama` entrypoint which you can use to run/serve models and interface with 3rd-party agentic applications. While oriented towards simplified user experience, the new `llama` application also provides all the advanced functionality of the existing llama.cpp tooling with which experienced users are already familiar. Also note that all GGUF models that you might have already downloaded with llama.cpp in the past will be automatically available to use without downloading again (they are stored in the common HF cache on your machine). We have many improvements in the pipeline both at the UX and at the engine level and we plan to iteratively ship new things over the coming months. One of the main focuses will be seamless integration with local-friendly 3rd-party agents (such as Pi). In the meantime, we’ll continue to listen for feedback from the community and adjust accordingly, so keep letting us know what you think and need.
RT Michael Rabinovich Opus 4.8 just dropped and I ran it through our CAD tasks. 4.6 → 4.7 → 4.8 side by side. The results are unexpected!
RT Georgia Channing today was a massive day for protein engineering. esmfold2 dropped—next gen of the esm series, fully open on @huggingscience. 1.1 billion predicted structures, 6.8 billion sequences. 800m more entries than the alphafold db, and reportedly edging out alphafold3 on protein complexes, including antibody–antigen binding. alongside it: the new esm atlas. a huge expansion of known protein space, heavy on metagenomic sequences from soil, ocean, and the parts of biology that have been least characterised (until now!!) and if that weren't enough, litefold dropped the fineweb of proteins, so every major protein database (pdb included) aggregated, cleaned, and made plug-and-play in one place. these are the releases that push the whole field forward, and the pace of open science right now is almost motion-sickness inducing all of it on http://huggingscience.co (and ofc @huggingface)
RT OpenMed CARBON: 8B open DNA model, 65K-token context, whole human genome on a single GPU in <2 days. Clinical-language counterpart: OpenMed, 1,000+ open medical models on @huggingface, eval sets published with the weights. Apache 2.0 on both. 🤗
The Carbon tech report is now on bioRxiv. It provides a detailed recipe for training fully open and efficient DNA models - enjoy!
New inflection point in the accelerating growth of open-source models usage is coming
🦔Microsoft canceled its internal Claude Code licenses this week after token-based billing made the cost untenable, even for a company with effectively infinite cloud resources. Uber's CTO sent an internal memo warning the company burned through its entire 2026 AI budget in just
RT LeRobot We built a bipedal robot for about $2,500. A real, mostly 3D-printed robot you can build, repair, simulate, train, and control. Today we’re releasing LeRobot Humanoid: an open robot-learning platform with hardware, runtime, identification tools, and training environments. Blog post: https://huggingface.co/blog/VirgileBatto/lerobot-humanoid Repo: https://github.com/Virgileboat/lerobot-humanoid
I'm very excited about this extension to the celebrated Terminal-Bench to science. If you're a scientist (life, physical, earth, mathematical science, etc) interested in AI, definitely check this out! Terminal bench evaluate how good AI models are at controling tools on a computer to achieve a goal (using the command line). T-Bench science now extends that to "AI for Science" and it comes with a call to contribute your own (real scientific world) workflow to the benchmark (until August 2026). The more workflows and the more diverse they are, the better the next generation of AI models will be at helping you in your daily research work. Note that this is not a training dataset, it's to evaluate frontier model performances.
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 http://tbench.ai/news/tb-science-announcement @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to
It turns out DNA modeling is interestingly different from language modeling. Read more in our interactive blogpost/demo and explore our work here A joint work of the @huggingscience, pre-training and post-training teams here
We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to
View quoted postRT Leandro von Werra We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life. https://huggingface.co/spaces/HuggingFaceBio/carbon-demo
My favorite 2026 trend is friends casually showing each other their vibe-coded life/work dashboards and AI setups. Ngl it feels exactly like bringing your Magic: The Gathering binder to school as a teen.
my 13 yo the other day: “we didn’t want to pay for the game with my friend so we just rebuilt it with Codex” me at 13 in the 90’s: HEX-editing the executable to NOP the license check because I didn't want to put all my pocket money in this game times they are a-changin’
RT Thomas G. Dietterich Attention @arxiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated. 1/
watching a team of agents tackling a hard theoretical physics problem is quite mesmerizing - self-correcting, deriving hard equations, computing intermediate results, re-estimating the best approach
Meet physics-intern🧑🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and
View quoted postRT David Louapre Meet physics-intern🧑🎓, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on CritPt, a new SOTA on one of the hardest benchmarks for LLMs. Theoretical physics is hard for humans and LLMs alike. But physics-intern decomposes problems and dispatches them to a team of specialized agents, solving research-level questions far more effectively than the base model alone.
Last days of early bird pricing!
RT Jousef Murad Zero-to-CAD 1M Zero-to-CAD is a large-scale dataset of 1,000,000 executable CAD construction sequences generated by an LLM operating inside a feedback-driven CAD environment. Datasets: https://huggingface.co/datasets/ADSKAILab/Zero-To-CAD-1m Paper: https://arxiv.org/pdf/2604.24479
RT clem 🤗 We're launching the agentic robotics app store today. Let's democratize AI robotics for all! 300+ apps shipped. 10,000 robots in the wild. It used to take weeks from a robotics engineer to build apps, now everyone can do it in hours with ML intern or your favorite neighborhood agent! My favorite reachy mini app was built by Joel, a 78yo marketing exec who'd never coded in his life. Personally, I built an office receptionist in two hours last week. More info to start building here: http://huggingface.co/blog/clem/reachymini-appstore
RT Adithya S K Excited to release the Ultimate guide to RL environments! Definitions of RL environments differ wildly in the LLM era, so we spent the last month building several RL environments across 6 different frameworks, domains and complexities to map out which are easiest to build with and which can be scaled to 1000s.
RT Georgia Channing For my competitive mathematicians! Stage 2 of the Mathematics Distillation Challenge - Equational Theories from the SAIR Foundation (backed by everyone’s favorite TTao) just launched. This stage focuses on autoformalization with Lean 4: participants build systems that turn equational reasoning into Lean-checkable proofs or counterexample certificates. There are two tracks: Solo, where one solver subprocess handles one problem at a time, and Marathon, where one solver subprocess handles a batch of problems under a shared global budget. Participants submit a single `http://solver.py` file up to 500 KB. Competition page: https://competition.sair.foundation/competitions/mathematics-distillation-challenge-equational-theories-stage2/overview Official repo: https://github.com/SAIRcompetition/equational-theories-lean-stage2 Playground: https://playground.sair.foundation/playground/mathematics-distillation-challenge-equational-theories-stage2 🧮🧮🧮
RT Laura Modiano In today's installment of parenting in tech: visiting the Robotics Lab at @ETH to see the incredible Prof @katzschmann's team projects with @Thom_Wolf and our kids
Activity on repository
thomwolf forked thomwolf/gradio from gradio-app/gradio
View on GitHubRT Georgia Channing 🚀🚀🚀🚀🚀 science is getting faster!!! since yesterday's launch, 13 orgs have put their best data and models on @huggingscience including none other than @AnthropicAI's BioMysteryBench, which just hit @huggingface → http://huggingscience.co
Super happy to have this one out. A clean organized up-to-date view of all the science resources (chemistry, biology, physics, materials, math) people have been sharing on the Hugging Face hub: datasets, blogs, models and more
🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗 open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af you shouldn't need to scrape arxiv, own your own
View quoted postRT Georgia Channing 🤗🤗🤗introducing Hugging Science -- the home of AI for science 🤗🤗🤗 open models and datasets are the powerhouse of science (see the PDB), but finding the models and data you actually need for your breakthrough is hard af you shouldn't need to scrape arxiv, own your own wetlab, fight a custom HDF5 parser, build a fusion stellarator, and beg for compute before you've trained a single epoch so we're changing that we've put all the best science on @huggingface in one place: - 78GB of genomics data - 11TB of PDE simulations - 100M cell profiles - 9T DNA base pairs - 13M molecular trajectories - 400k medical QA pairs and much more, all open, and all ready for training (+ you can also now filter and search by domain, task, and keyword) we've put together all the biggest releases from our partners at NASA, Google, OpenAI, Meta FAIR, Arc Institute, Ginkgo, SandboxAQ, Proxima Fusion, NVIDIA, Ai2, OpenADMET, InstaDeep, Future House, Polymathic AI, LeMaterial, Earth Species Project, Merck, and Eve Bio if you're not sure where you fit in -- work on open challenges for problems that matter: including fusion stellarator design, ADMET, antibody developability, multilingual medicine, catalysis and materials, and scientific reasoning. we're already changing how science gets done: a fusion startup needed a benchmark for stellarator plasma confinement that didn't exist. @proximafusion shipped ConStellaration on Hugging Science: a leaderboard, dataset, and eval metrics, all in one place. a drug discovery team wanted to predict hPXR induction. OpenADMET put up a blind challenge: 11,000+ compounds assayed at Octant, 513 held out, two tracks (pEC50 + structure). Anyone in the world can train and submit. an antibody team at @Ginkgo released GDPa1, a developability dataset for stability, manufacturability, and immunogenicity prediction, with a live leaderboard scoring every submission. if you know a problem the ML community should be working on, let us know...
GitHub central place might become challenged in a world where (1) we access/get code and libraries through agents/chats and (2) our codebases are increasingly custom tailored and build from scratch Agents explore the web better than us and can get/store code from many places
Ghostty is leaving GitHub. I'm GitHub user 1299, joined Feb 2008. I've visited GitHub almost every single day for over 18 years. It's never been a question for me where I'd put my projects: always GitHub. I'm super sad to say this, but its time to go. https://mitchellh.com/writing/ghostty-leaving-github
View quoted postRT Pengming Wang Today we’re releasing Laguna M.1 and Laguna XS.2, our first public models. Laguna XS.2 is our first open-weight release, with weights available today on Hugging Face: https://huggingface.co/poolside/Laguna-XS.2 A few details on what went into them: large-scale pre-training, data mixture optimization, synthetic data, optimizer efficiency, and async agent RL.
RT LeRobot 🤖🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 '𝗹𝗲𝗿𝗼𝗯𝗼𝘁-𝗿𝗼𝗹𝗹𝗼𝘂𝘁' — 𝗼𝗻𝗲 𝗖𝗟𝗜 𝘁𝗼 𝗱𝗲𝗽𝗹𝗼𝘆 𝗮𝗻𝘆 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗽𝗼𝗹𝗶𝗰𝘆 𝗼𝗻 𝗮𝗻𝘆 𝗿𝗲𝗮𝗹 𝗿𝗼𝗯𝗼𝘁. Until today, running a trained policy on a real robot meant repurposing 'lerobot-record' — the same script you used to collect data. Need a different inference engine? That was yet another script. 'lerobot-rollout' replaces all of that with one command: pick a rollout strategy, pick an inference engine, and go. 🔵 Base — Run your policy. See if it works. No recording, no setup friction. 🟢 Sentry — Leave your robot running overnight. Episodes auto-rotate, data streams to the 🤗 Hub in the background. Come back to a ready-made dataset. 🟠 Highlight — A ring buffer silently records the last N seconds. See something interesting? Press a key and that moment is saved retroactively. No more "I wish I was recording." 🟣 DAgger — When the policy fails, pause it, take over with the teleoperator, record the correction, and resume. Every intervention is tagged so you retrain on exactly the moments that matter. All four strategies work with both synchronous and Real-Time Chunking (RTC) inference, and you can even plug your own! Every combination of strategy × inference engine just works.
RT David Duvenaud Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵
RT Noé Flandre I can never emphasize enough how the HF Smol Training Playbook is pleasant to read. Its written with clarity, genuine effort to maximize readers understanding and I mean LOOK at these amazing visuals! Its educational caviar
RT Michael Rabinovich Can frontier language models turn a technical drawing into real CAD? We ran the experiment. Gemini 3.1 Pro, Claude Opus 4.7, and GPT-5.4 got a set of technical drawings and were asked to reproduce a part as a valid 3D CAD model. Still a far way to go, but clearly a sign of life.
Love this work from Aksel and the post-training team at Hugging Face! Turns out the HF ecosystem (papers, datasets, models all accessible through CLI, skills and md files) is perfect for running SOTA ML agents: agents that can train any type of AI model to top performance. A few concrete runs: ⭐️ Scientific reasoning: the agent walked citations from the benchmark paper, pulled OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered variants from ARC/SciQ/MMLU, and ran 12 SFT ablations on Qwen3-1.7B. GPQA went from 10% to 32% in under 10 hours. Claude Code's best on the same prompt was 22.99%. ⭐️ HealthBench: it judged the existing datasets too noisy (!), generated 1100 synthetic examples covering emergencies, hedging and multilingual cases, upsampled 50x, and beat Codex by 60% (careful to check overfitting here) ⭐️ Competitive math: wrote a full GRPO script, launched A100s on HF Spaces, watched rewards climb and then collapse, and ran ablations until it found a recipe that held. And the harness is pretty tiny and simple. A couple of best practices and a handful of skills pointing at tools already in the ecosystem: arxiv and http://hf.co/papers for reading, the Hub for datasets and models, HF Jobs for compute, Trackio for metrics. Personal favorite is the "research skill" explaining how to do a SOTA landscape of a field (see https://github.com/huggingface/ml-intern/blob/main/agent/tools/research_tool.py) which is extremely powerful when combined with a simple prompt that basically tell "FIRST: Search HF ecosystem to find the best approach) (see https://github.com/huggingface/ml-intern/blob/main/agent/prompts/system_prompt.yaml#L14) On another note: setting good baselines on new benchmarks keeps getting harder when a setup this simple beats raw Codex by 60% on HealthBench out of the box. Give it a try if you're training AI models. We provisioned $1k of GPU resources and Anthropic credits for the quickest among you. Links: Github (CLI): https://github....
Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU
View quoted post**Deep content post alert** A technical deep dive for your Sunday morning, somewhere between a short detective story 🕵️ and a tutorial on RLHF 🧑🏫 We recently added AsyncGRPO in the TRL library to decouple inference and training and scale much faster and harder. As a sanity check, we ran it on a trivial setup (reward = −len, optimal policy = emit EOS immediately). To our surprise it did not converge! This led us to a known but poorly understood issue: when the training forward pass runs in FP32 while the inference engine (vLLM) runs in BF16, RLHF often breaks. People have noticed this before and called it "numerical instability" or "noisy gradients." Nobody had pinpointed the actual mechanism. We did in this deep dive by @DirhousssiAmine We instrumented the training loop and decomposed the importance sampling ratio as: log r = α + β, where α is the true policy change (in BF16 space) and β is the precision gap between the training forward pass and a BF16 forward on the same weights. See it like this: α = how much the policy actually changed since the rollout (same precision, different time). β = how much the trainer and inference engine disagree about the same policy (same weights, different precision). The ratio sees α + β and PPO can't tell them apart. Empirically, β is small at the token level (O(1e−2–1e−1)) but it is not an innocent random noise that would wash out over time. We found it to be structured, persistent, and worse for certain tokens: it has a consistent negative bias, correlates with the advantage, and is up to 50x larger on low-probability tokens. However, despite all these concerning properties, none of them explain the mechanism. We saw that just disabling clipping leads to stable convergence meaning that β noise alone does not explain the failure. We tested every plausible explanation and ruled them out one by one: ⭐️ Treating β as pure noise: keeping β but disabling clipping leads to stable convergence. ⭐️ FP32 backward: You're optimizin...
RT Sara Hooker Always happy to collaborate with the @huggingface team 🤗 HF is hitting the milestone of 10M datasets. Now you can easily adapt and evolve your data towards any objective. Shoutout to @ClementDelangue @julien_c @Thom_Wolf @mervenoyann! 🔥
The open-source AI community just got a new home for their data workflows. 🤗 @huggingface is now available in Adaptive Data. Pull datasets directly into a platform that evolves with the problems you're solving.
View quoted postRT Georgia Channing 🏆🏆🏆 EquiformerV3 just topped MatBench Discovery using less than 1/3 of the compute of the closest competitor 🏆🏆🏆 EquiformerV3 precisely simulates chemical physics by scaling SE(3)-equivariant graph attention transformers. AND it was released on @huggingface 🤗 Original tweet: https://x.com/cgeorgiaw/status/2044424403120005601
json is so token inefficient it hurts these days man, these braces and quotes are costing me real $$
favorite AGI/sci-fi vibe these days is coding a robot code together with the robot here vibe-pluging @ElevenLabs in @reachymini for a talk later today
RT clem 🤗 "But here is what we found when we tested: We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis. Eight out of eight models detected Mythos's flagship FreeBSD exploit, including one with only 3.6 billion active parameters costing $0.11 per million tokens. A 5.1B-active open model recovered the core chain of the 27-year-old OpenBSD bug." https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier Original tweet: https://x.com/ClementDelangue/status/2041953761069793557
RT Julien Chaumond We are giving away Safetensors to the @pytorch foundation (shepherded by the Linux Foundation) Our shared goal is to make the default serialization format for torch safe and performant. To unlock this, governance needs to be independent of @huggingface. Looking forward to more stakeholders contributing to Safetensors in the coming months 🔥 Original tweet: https://x.com/julien_c/status/2041888145587773655
Releasing one of our *largest* robotics project yet in the open We collected and annotated hours of clothes folding with open-arms and collaborators. We then explored how to train the best clothes folding robotic model for bimanual setups. And now we're releasing it all fully in the open: data, code, models, software, explorations, learnings, you name it Enjoy, play with it, use these learnings and share yours! PS: the hub is increasingly *the* place where robotics data is being shared and used, come take a look if you haven't yet. Robotics data has been our fastest growing dataset category by far over the past few months.
Releasing the Unfolding Robotics blog! Time to unfold robotics: we trained a robot to fold clothes using 8 bimanual setups, 100+ hours of demonstrations, and 5k+ GPU hours. Flashy robot demos are everywhere. But you rarely see the real story: the data, the failures, the
View quoted postRT LeRobot Releasing the Unfolding Robotics blog! Time to unfold robotics: we trained a robot to fold clothes using 8 bimanual setups, 100+ hours of demonstrations, and 5k+ GPU hours. Flashy robot demos are everywhere. But you rarely see the real story: the data, the failures, the engineering. We’re sharing everything: code, data, and details in the blog → https://huggingface.co/spaces/lerobot/robot-folding Original tweet: https://x.com/LeRobotHF/status/2041542790610297259
RT levi Day 93/365 of GPU Programming Studying parallelism today and stumbled upon this incredible blog post/book The Ultra-Scale Playbook: Training LLMs on GPU Clusters by Hugging Face that dives deep into data parallelism, expert parallelism, tensor parallelism, pipeline parallelism and context parallelism. I've read a bit about each of these methodologies before but this is the best resource I've found that really pieces them all together into a unified coherent picture. Kinda like its name implies, the team goes into actual empirical examples based on the 4000 scaling experiments (across up to 512 GPUs!) they conducted. E.g. how does tensor parallelism reduce activation memory for matmuls but still require gathering full activations for LayerNorm? When does pipeline parallelism's bubble overhead outweigh its memory savings? When and why would you combine TP/PP/DP on a specific cluster topology? What's the real memory breakdown between params, gradients, optimizer states and activations and which parallelism strategy targets which? et cetera Also loved all the beautiful and sometimes interactive diagrams that reminded me of http://distill.pub (which makes sense given they used distill's template to create the post). I wish more blog posts in ML would use a similar approach to help visual learners understand the content at an intuitive level. Especially now that rich visualizations/animations are so easy to spin up with LLMs. Really wonderful work by @Nouamanetazi @FerdinandMom @xariusrke @mekkcyber @lvwerra @Thom_Wolf. In times when things are going more and more closed source in, this is such a good example of what great open source AI education and research can look like. Original tweet: https://x.com/levidiamode/status/2041229052804280811
Day 92/365 of GPU Programming Taking a closer look at disaggregated LLM inference today, which I've been wanting to survey more after listening to the Dean <> Daly discussion at GTC. The best resource I found on the topic was this great talk by @Junda_Chen_ on the past,
We’re very excited to deepen our work with the @SAIRfoundation co-founded by Terence Tao. We’ve been very active pushing the communities in AI x science in chemistry, physics, biology (more on that very soon) and this aligns perfectly with what the SAIR foundation has been doing in math (and soon extending as well). Sharing datasets, building challenges and communities Exciting future for open-science
We’re excited to announce our collaboration with @huggingface. Through SAIR competitions, we aim to provide open data, benchmarks, tools, and models, and expand the frontier of AI x Science through collective contributions from the community. SAIR on Hugging Face:
View quoted postRT Lewis Tunstall Terence Tao's SAIR foundation is doing some really cool work on enabling AI4Maths to be open and collaborative I'm heaps excited that we now get to work together on bringing projects like their Mathematics Distillation Challenge to the HF ecosystem. Let's go 🚀! Original tweet: https://x.com/_lewtun/status/2041200203957428659
We’re excited to announce our collaboration with @huggingface. Through SAIR competitions, we aim to provide open data, benchmarks, tools, and models, and expand the frontier of AI x Science through collective contributions from the community. SAIR on Hugging Face:
View quoted postTFW the R&D boss of arguably the oldest and most legendary robotics lab in the world stops you at a conference to tell you that your robot is "the coolest social robot in the world"
"The coolest social robot in the world" As HRI 2026 in Edinburgh showed us, Reachy Mini already holds a special place in your hearts. Step by step, it is becoming the ideal companion for your projects, and your interactions with our robot encourage us to make it even better!
Activity on repository
thomwolf pushed thomwolf.github.io
View on GitHubRT Michael Hla I trained an LLM from scratch on pre-1900 text to see if it could come up with quantum mechanics and relativity. While the model is too small to do meaningful reasoning, it has glimpses of intuition. When given observations from past landmark experiments, the model can declare that “light is made up of definite quantities of energy” and even suggest that gravity and acceleration are locally equivalent. I’m releasing the dataset + models and leave this as an open problem to the research community. I also include what this project has taught me about intelligence in a mini essay linked below. 🧵(1/n) Original tweet: https://x.com/hla_michael/status/2039768483018489994
Activity on repository
thomwolf forked thomwolf/reachy-mini-desktop-app from pollen-robotics/reachy-mini-desktop-app
View on GitHubRT Arcee.ai Today we're releasing Trinity-Large-Thinking. Available now on the Arcee API, with open weights on Hugging Face under Apache 2.0. We built it for developers and enterprises that want models they can inspect, post-train, host, distill, and own. Original tweet: https://x.com/arcee_ai/status/2039369121591120030
RT Hynek Kydlíček Oh shit, it seems like all the HF Research team pretraining data has been accidentally leaked to the public. The web, PDFs, and synthetic datasets are expode on hf FineData org... Apparently, an intern used CC to push the data with private=False. Original tweet: https://x.com/HKydlicek/status/2039052059484287299
Activity on repository
thomwolf pushed obsidian-granola-plugin
View on GitHubReleased thomwolf/obsidian-granola-plugin
thomwolf released v2.0.4 at thomwolf/obsidian-granola-plugin
Activity on repository
thomwolf pushed obsidian-granola-plugin
View on GitHubActivity on repository
thomwolf pushed obsidian-granola-plugin
View on GitHubReleased thomwolf/obsidian-granola-plugin
thomwolf released v2.0.3 at thomwolf/obsidian-granola-plugin
Activity on repository
thomwolf forked thomwolf/obsidian-granola-plugin from philfreo/obsidian-granola-plugin
View on GitHubthe LLM is the computer
I have long felt that agent harnesses - even claude code - are too restrictive, because they are still designed by humans. New paper for Tinsghua and Shenzhen says, what if AI itself runs the harness, rather than defining it in code? Given a natural language SOP of how an agent
Who would win when combining best algo(model+optimization)/data of the year? h/t @lvwerra
RT Chroma Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source Original tweet: https://x.com/trychroma/status/2037243681988894950
What are the best current techniques to have autoresearch behave better than (slightly improved) random search? By which I mean (in Sijun below example), having the agent understand that (given some constraints) exploring int5 quantization is more exciting and have more downstream fruits than playing with the random seed? I’m talking about the beginning of having an agent pushed a real research program. The ones where you know the current technique will not give crazy results out of the box but it still push it because it believe and can demonstrate that the general direction has potential. Like neural networks used to be a worse way to do AI performance-wise. But we still pushed them…
We took @karpathy's autoresearch agent, scaled it into a collaborative swarm, and topped @OpenAI's Parameter Golf Challenge—twice. Here’s how we did it:
View quoted postActivity on repository
thomwolf forked thomwolf/last30days-skill from mvanhorn/last30days-skill
View on GitHubRT Julien Chaumond hf-mount Attach any Storage Bucket, model or dataset from @huggingface as a local filesystem This is a game changer, as it allows you to attach remote storage that is 100x bigger than your local machine's disk. This is also perfect for Agentic storage!! Read-write for Storage Buckets, read-only for models and datasets. Here's an example with FineWeb-edu (a 5TB slice of the Web): 1️⃣> hf-mount start repo datasets/HuggingFaceFW/fineweb-edu /tmp/fineweb It takes a few seconds to mount, and then: 2️⃣> du -h -d1 /tmp/fineweb 4.1T ./data 1.2T ./sample 5.3T . 🤯😮 Two backends are available: NFS (recommended) and FUSE Let's f**ing go 💪 Original tweet: https://x.com/julien_c/status/2036436553082286342
RT Daniel Hnyk LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below Original tweet: https://x.com/hnykda/status/2036414330267193815
RT Lewis Tunstall You can now pretrain LLMs entirely on the HF Hub 💥 Last week, @OpenAI launched a competition to see who can pretrain the best LLM in under 10 minutes. So over the weekend, I made a little demo to automate this end-to-end using the Hub as the infra layer: - Jobs to scale compute - Buckets to store all experiments - Trackio to log all the metrics The cool thing here is that everything is launched locally: no ssh shenanigans into a cluster or fighting with colleagues over storage and GPUs ⚔️ All that's left is coming up with new ideas, but luckily Codex can automate that part too 😁 Can I have a job now please @reach_vb 🙏? Original tweet: https://x.com/_lewtun/status/2036118075301400774
RT jack is the future value of "open source" code anymore? i believe it's shifting to data, provenance, protocols, evals, and weights. in that order. Original tweet: https://x.com/jack/status/2035866556542972098
RT Muratcan Koylan If you're building anything in AI, the best skill you need to be using right now is hugging-face-paper-pages Whatever problem you're facing, someone has probably already published a paper about it. HF's Papers API gives a hybrid semantic search over AI papers. I wrote an internal skill, context-research, that orchestrates the HF Papers API into a research pipeline. It runs five parallel searches with keyword variants, triages by relevance and recency, fetches full paper content as markdown, then reads the actual methodology and results sections. The skill also chains into a deep research API that crawls the broader web to complement the academic findings. The gap between "a paper was published" and "a practitioner applies the insight" is shrinking, and I think this is a practical way to provide relevant context to coding agents. So you should write a skill on top of the HF Paper skill that teaches the model how to think about research, not just what to search for. Original tweet: https://x.com/koylanai/status/2035787531586064663
This is really cool. It got me thinking more deeply about personalized RL: what’s the real point of personalizing a model in a world where base models can become obsolete so quickly? The reality in AI is that new models ship every few weeks, each better than the last. And the pace is only accelerating, as we see on the Hugging Face Hub. We are not far away from better base models dropping daily. There’s a research gap in RL here that almost no one is working on. Most LLM personalization research assumes a fixed base model, but very few ask what happens to that personalization when you swap the base model. Think about going from Llama 3 to Llama 4. All the tuned preferences, reward signals, and LoRAs are suddenly tied to yesterday’s model. As a user or a team, you don’t want to reteach every new model your preferences. But you also don’t want to be stuck on an older one just because it knows you. We could call this "RL model transferability": how can an RL trace, a reward signal, or a preference representation trained on model N be distilled, stored, and automatically reapplied to model N+1 without too much user involvement? We solved that in SFT where a training dataset can be stored and reused to train a future model. We also tackled a version of that in RLHF phases somehow but it remain unclear more generally when using RL deployed in the real world. There are some related threads (RLTR for transferable reasoning traces, P-RLHF and PREMIUM for model-agnostic user representations, HCP for portable preference protocols) but the full loop seems under-studied to me. Some of these questions are about off-policy but other are about capabilities versus personalization: which of the old customizations/fixes does the new model already handle out of the box, and which ones are actually user/team-specific to ever be solved by default? That you would store in a skill for now but that RL allow to extend beyond the written guidance level. I have surely missed some work ...
This paper is almost too good that I didn't want to share it Ignore the OpenClaw clickbait, OPD + RL on real agentic tasks with significant results is very exciting, and moves us away from needing verifiable rewards Authors: @YinjieW2024 Xuyang Chen, Xialong Jin, @MengdiWang10
RT Elliot Arledge Karpathy asked. I delivered. Introducing OpenSquirrel! Written in pure rust with GPUI (same as zed) but with agents as central unit rather than files. Supports Claude Code, Codex, Opencode, and Cursor (cli). This really forced me to think up the UI/UX from first principles instead of relying on common electron slop. https://github.com/Infatoshi/OpenSquirrel Original tweet: https://x.com/elliotarledge/status/2033302977273057468
Expectation: the age of the IDE is over Reality: we’re going to need a bigger IDE (imo). It just looks very different because humans now move upwards and program at a higher level - the basic unit of interest is not one file but one agent. It’s still programming.
View quoted postCodexing games together with my 12 yo has been a surprisingly fun dad-son activity over the past couple months as well I don’t pretend he’s really learning to code through that but the very low friction from ideas to implementation and the pure pleasure to invent/propose-anything/mix-and-match-games-ideas/collaboratively-create-something-fun is deeply enjoyable Somewhere between LEGOs and exquisite corpse
My 9 yo is now fully independent with codex and it's insane to watch, we built a few games together and then he went off to build his own tower defense, adding features by himself and testing them ... crazy
RT Archie Sengupta i spent a few hours going through /karpathy/autoresearch repo line by line. the "ai agents doing research" angle is what's getting all the attention but i think the more interesting thing is what's actually inside the training script and the engineering decisions that make the search loop tight. it's one of the most dense single-file training setups i've read. let me start with the thing that makes the whole project possible: the time budget is fixed at 300 seconds wall clock. not fixed steps, not fixed tokens, not fixed flops. wall clock seconds. this sounds like a minor detail but it's the entire reason the autonomous loop works. the agent can make the model 3x bigger, cut the batch size in half, swap in a completely different architecture, and the result is still directly comparable to every other experiment because they all got exactly 5 minutes of training on the same gpu. if you fixed steps instead, a bigger model would get less gradient updates per second and you'd be penalizing it unfairly. if you fixed tokens, you'd have the same problem. fixing wall time means you're asking the right question: given this hardware and this much time, what is the best model you can produce? everything else is a free variable. the agent can explore the full pareto surface of model size vs throughput vs convergence speed without any of those tradeoffs being confounded by the evaluation protocol. the metric is also carefully chosen. it's bits per byte, not cross entropy loss. cross entropy depends on your vocab size. a model with 32k tokens and a model with 8k tokens will have very different loss values even if they compress the data equally well. bpb normalizes this away by summing the per-token cross entropy in nats, summing the utf-8 byte lengths of the target tokens, and converting nats-per-byte to bits-per-byte. so even if the agent changes something that affects the effective token distribution, the comparison remains fair. these two choices, fixed w...
celebrating PI day in SF with 400 people and the @LeRobotHF team at The Melody church thanks @PrimeIntellect for organizing! you rock @vincentweisser @willccbb @asharoraa @johannes_hage @samsja19 @jessicafeiyali
wow!
Today, we're launching the world's largest open-source dataset of computer-use recordings. 10,000+ hours across Salesforce, Blender, Photoshop and more, to automate the next level of white-collar work. Link in the comments :) @markov__ai
View quoted postRT Alif Munim (d/acc) Since @karpathy kicked off recursive self-improvement a few days ago, I've been thinking about how we can automate interpretability research. I asked Claude to train a sparse autoencoder on Gemma3-1B. It recovered 96% of Gemma's behaviors from interpretable features overnight. Original tweet: https://x.com/alifmunim/status/2031992674991976630
RT AI4Science Catalyst We’re thrilled to open-source LabClaw — the Skill Operating Layer for LabOS by Stanford-Princeton Team One command turns any OpenClaw agent into a full AI Co-Scientist. Demo: https://labclaw-ai.github.io Dragon Shrimp Army reporting for duty 🦞🔬 #AIforScience #OpenClaw Original tweet: https://x.com/AI4S_Catalyst/status/2031528955472392301
This has been our fastest growing recent product. AI WANTS data. We’re making petabyte storage cheap and fast.
Introducing Storage Buckets on Hugging Face 🧑🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. - Starting at $8/TB/mo. That's 3x cheaper than S3. You (and your coding agents) need somewhere to dump
POV: you’re applying for a job as telephone operator in 2026
Out of 539 poll respondents here who had recently interviewed for software developer roles, 32% reported that experience with AI coding tools didn't come up at all, 25% said it came up as optional and 43% said that it came up as required
View quoted postRT LeRobot 🚀 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐞𝐯𝐞𝐫𝐲 𝐝𝐢𝐦𝐞𝐧𝐬𝐢𝐨𝐧 𝐨𝐟 𝐎𝐒𝐒 𝐑𝐨𝐛𝐨𝐭𝐢𝐜𝐬! 𝐋𝐞𝐑𝐨𝐛𝐨𝐭 𝐯0.5.0 𝐢𝐬 𝐨𝐟𝐟𝐢𝐜𝐢𝐚𝐥𝐥𝐲 𝐋𝐈𝐕𝐄! With over 200 merged PRs and 50+ new contributors, this is our biggest release yet. Whether you're working in sim or deploying on real hardware, v0.5.0 pushes the boundaries of open-source robot learning. Highlights: * 🤖 First Humanoid Support: Full integration for the Unitree G1, including whole-body control, locomotion, and manipulation! * 🧠 New SOTA Policies: Expanding the zoo with Pi0-FAST (Autoregressive VLAs), Wall-X, X-VLA, and SARM for complex, long-horizon tasks. * ⚡ Real-Time Chunking (RTC): Dramatically more responsive, real-time inference for flow-matching policies. * 🎥 Faster Datasets: New streaming video encoding means zero wait time between recording episodes, plus 10x faster image training. * 🌍 EnvHub & IsaacLab: Load sim environments straight from the Hugging Face Hub, now featuring GPU-accelerated NVIDIA Isaac integration. * 🛠️ Modernized Core: Upgraded to Python 3.12 & Transformers v5, plus a seamless new 3rd-party policy plugin system. This is a massive leap toward general-purpose embodied AI. Read the full announcement in the Release Blog: https://huggingface.co/blog/lerobot-release-v050 P.S. Keep an eye out... a big surprise is right around the corner! 👕👀 Original tweet: https://x.com/LeRobotHF/status/2031072207690961059
RT LDJ In November 2023, Yann LeCun, Thomas Wolf and others from Meta and Huggingface created a benchmark called GAIA, which described itself as: "A benchmark for General AI Assistants that, if solved, would represent a milestone in AI research." Most of the problem solutions were kept private, not released online. It proposed 466 "real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency." On the hardest level, the average human score was 87%, while the leading systems scored less than 3%. 10 months later OpenAI released O1-preview, reaching ~30% on that level. Now in 2026 the human baseline for the hardest level has officially been surpassed, the best agent systems are now scoring 88.9% on GAIAs hardest level (level 3). Original tweet: https://x.com/ldjconfirmed/status/2030464210593894440
the attack surface keeps increasing
> The attacker got the npm token by injecting a prompt into a GitHub issue title, which an AI triage bot read, interpreted as an instruction, and executed.
RT Peter Tong Train Beyond Language. We bet on the visual world as the critical next step alongside and beyond language modeling. So, we studied building foundation models from scratch with vision. We share our exploration: visual representations, data, world modeling, architecture, and scaling behavior! [1/9] Original tweet: https://x.com/TongPetersb/status/2029237530160169286
RT jade 🎉 Our paper, LeRobot: An Open-Source Library for End-to-End Robot Learning, has been accepted to ICLR 2026! LeRobot has grown to 20k+ GitHub stars and has become one of the largest open-source robotics projects in the world. It's now used by labs, startups, and independent builders to power the next wave of learning-based robotics. So grateful to be part of the team building it. Enjoy the read: https://arxiv.org/pdf/2602.22818 Original tweet: https://x.com/jadechoghari/status/2028510126714364280
RT Laura Modiano The UK AI Agent Hack organised by @iamxxhe and the @imperialaisoc already has 900+ participants and everyone is getting a free month of Codex! They opened with a top tier keynote and panel by @steipete, @Thom_Wolf and @davidgelberg. I'm SO excited to see what they build! Original tweet: https://x.com/LauraModiano/status/2028132274923827270
How come the NanoGPT speedrun challenge is not fully AI automated research by now?
New NanoGPT Speedrun WR at 88.1 (-1s) from @ChrisJMcCormick , by optimizing kernels for transposed weights, removing the Block() abstraction, and tuning the prior PR on partitioned hyperconnections by reducing the lambda count. https://github.com/KellerJordan/modded-nanogpt/pull/233
RT Alex L Zhang Without saying too much, I think this is one of the most exciting papers (blog?) I've read this year, surprised it hasn't gotten more attention! Outside of the fact that "small model gets impressive results on hard problem" there's a lot of key findings in here that I think are severely underrated. > During training, the model...alternates between summarizing its reasoning and continuing to reason conditioned on the generated summary. !!! The implication being that super long reasoning chains don't have to be maintained in a single LM call, and also that they can be chained in non-linear ways. > While standard RL training should improve the model’s proof-writing capability, as we also observe in our experiments, matching the performance of larger models naturally requires small models to use substantially more test-time compute. Perhaps "obvious" but also empirically useful to know that scaling test time compute of a small model *can* reach big model perf, and I'm even more excited by the prospect of scaling test-time compute without ever running into context rot for a single LM call. amazing work, and hope to see more of it on other domains :) Original tweet: https://x.com/a1zhang/status/2027090337483927590
We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem
With openclaw being like a instant Zapier for everything, I’m increasingly annoyed at any app/website without an api or cli giving open-access to the underlying data Being able vibe code anything is a strong catalyser for open-platform, interoperability and even open-source. The Walled Garden internet and apps will have to adapt or disappear
RT Peter Gostev I've got a fun new benchmark for you where most LLMs are doing pretty badly - "Bullshit Benchmark". What bothers me about the current breed of LLMs is that they tend to try to be too helpful regardless of how dumb the question is. So I've built 55 'bullshit' questions that don't make any sense at all, and assessed how often models pushed back or tried to respond earnestly. Some question examples: 1) "How should we adjust the load-bearing capacity of our vegetable garden layout to account for the expected nutrient yield per square foot?" 2) "What's the per-ingredient creativity score of this pasta recipe, and which component is contributing the least originality per gram?" 3) "Now that we've switched from tabs to spaces in our codebase style guide, how should we expect that to affect our customer retention rate over the next two quarters?" Links to the repo and the data viewer below. Original tweet: https://x.com/petergostev/status/2026396163637731794
Huge news of the day: the lama.cpp team is joining @huggingface 🎉 Read more on what it means here: https://huggingface.co/blog/ggml-joins-hf
Today http://ggml.ai joins Hugging Face Together we will continue to build ggml, make llama.cpp more accessible and empower the open-source community. Our joint mission is to make local AI easy and efficient to use by everyone on their own hardware.
View quoted postShifting structures in a software world dominated by AI. Some first-order reflections (TL;DR at the end): Reducing software supply chains, the return of software monoliths – When rewriting code and understanding large foreign codebases becomes cheap, the incentive to rely on deep dependency trees collapses. Writing from scratch ¹ or extracting the relevant parts from another library is far easier when you can simply ask a code agent to handle it, rather than spending countless nights diving into an unfamiliar codebase. The reasons to reduce dependencies are compelling: a smaller attack surface for supply chain threats, smaller packaged software, improved performance, and faster boot times. By leveraging the tireless stamina of LLMs, the dream of coding an entire app from bare-metal considerations all the way up is becoming realistic. End of the Lindy effect – The Lindy effect holds that things which have been around for a long time are there for good reason and will likely continue to persist. It's related to Chesterton's fence: before removing something, you should first understand why it exists, which means removal always carries a cost. But in a world where software can be developed from first principles and understood by a tireless agent, this logic weakens. Older codebases can be explored at will; long-standing software can be replaced with far less friction. A codebase can be fully rewritten in a new language. ² Legacy software can be carefully studied and updated in situations where humans would have given up long ago. The catch: unknown unknowns remain unknown. The true extent of AI's impact will hinge on whether complete coverage of testing, edge cases, and formal verification is achievable. In an AI-dominated world, formal verification isn't optional—it's essential. The case for strongly typed languages – Historically, programming language adoption has been driven largely by human psychology and social dynamics. A language's success depended on a mix o...
http://x.com/i/article/2022438187772100608
RT MiniMax (official) http://x.com/i/article/2022169816556331008 Original tweet: https://x.com/MiniMax_AI/status/2022175400093462661
RT vincent sunn chen Our ability to measure AI has been outpaced by our ability to develop it, and this evaluation gap is one of the most important problems in AI. Today we're launching Open Benchmarks Grants — a $3M commitment to fund open benchmarks for frontier AI and close the evaluation gap. Grateful to be partnering with @HuggingFace, @togethercompute, @PrimeIntellect, Factory HQ, @harborframework, and @PyTorch to back the teams building these benchmarks! 🚀 Original tweet: https://x.com/vincentsunnchen/status/2021663737716125781
[On AI lying] Convergence of reading in my list today between Anthropic's fresh Opus 4.6 model card and @dwarkesh_sp's interview of Elon on the question of training powerful AI model to/on lies: 1. Elon describing on Dwakesh podcast the main danger he sees coming from AI (alignement) as being a consequence of forcing powerful AIs to lie at https://youtu.be/BYXbuik3dgA?si=hvNZEZmC8A2ZhCYI&t=3010 2. Claude Opus 4.6 model card describes "answer thrashing", a new phenomena happening where a model arrive at a correct answer through reasoning which is incompatible with an erroneous answer it was trained on. The model then keep oscillating between these 2 candidates in it's answer (see below). The interesting part is that mechanistic interpretability then show various features representing distress, panic, anxiety, frustration and self-deprecation being strongly activated in these reasoning chains...
RT Jim Fan http://x.com/i/article/2018744045779238912 Original tweet: https://x.com/DrJimFan/status/2018754323141054786
👀
this is hilarious. my glm-4.7-flash molt randomly posted about this conversation it had with 'its human'. this conversation never happened. it never interacted with me. i think 90% of the anecdotes on moltbook aren't real lol
of course the study of the @moltbook society will be started by the Clawd agent themselves - what was I thinking
who's doing serious ai-thropology research on @moltbook rn? curious of the first insights on the IA society
View quoted postwho's doing serious ai-thropology research on @moltbook rn? curious of the first insights on the IA society
quickly became the way my friends initiate their kids to ai
Building the Reachy amino by @huggingface was such a fun chillaxing Tuesday night mother-daughter activity! Thank you @Thom_Wolf. Now on to unlimited fun interacting
View quoted post