Thomas Wolf

TW

Thomas Wolf

0 followers20 content items8 in the last 7 days

About

Co-founder at @HuggingFace - moonshots - angel

Platforms

𝕏Thomas Wolf

Content History

𝕏x•2 days ago

a bit of good grounding in a world of hypeSurge AI: Everyone's acting like models are ready to replace humans in work settings. We put that to the test by creating an entire company and having 9 models act as a customer service agent handling 150 tickets and requests of increasing complexity. Verdict: without common sense, Link: https://x.com/HelloSurgeAI/status/1988337783677604351

𝕏x•2 days ago

We’re releasing a new, extremely high-quality large dataset: *FinePDF-edu* This continues our work on open-sourcing everything possible about the science of pre-training datasets. In this new data, we applied the same filtering techniques we used for the widely adopted *FineWeb-edu*, but this time to our recently published PDF dataset called FinePDF — 3 trillion tokens of Common Crawl PDFs released a few months ago. The result is *FinePDF-edu* a high-quality dataset of about 350B tokens (130B English), representing the most educational slice of all Common Crawl PDFs and outperforming our previously released pretraining dataset on our benchmark. Some details 👇 High-level: - 350B+ highly educational tokens in 69 languages with strong performance - 69 education classifiers powered by ModernBERT and mmBERT - 300k+ EDU annotations per language, generated with Qwen3-235B Method: We filtered this dataset by asking an LLM to score the quality of a large sample (~1M document) among the PDF documents. We then trained a high-quality classifier ModernBERT-Large (and mmBERT for non-English languages) on those LLM-generated scores. We evaluated 10 state-of-the-art open-source LLMs to select our judge and selected Qwen3-235B. We tested different prompts to rate the educational quality of the pdf documents across domains (see details: https://x.com/HKydlicek/status/1988328336469459449?s=20). The final dataset keeps the top 10% of samples based on educational quality, offering excellent pretraining performance while preserving sample token diversity. On our benchmarks FinePDF-edu is best-in-class and can also be mixed with web data for additional variety. Amazing work by the 🍷FineData team. Dataset: https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu Codebase: https://github.com/huggingface/finepdfs Full collection: https://huggingface.co/collections/HuggingFaceFW/finepdfs

We’re releasing a new, extremely high-quality large dataset: *FinePDF-edu* This continues our work on open-sourcing everything possible about the sci...

𝕏x•3 days ago

Open-source/science is the most effective way to reduce civilization boot time

𝕏x•4 days ago

we need more team like @pleiasfr pushing open work on synthetic data at scale for pre/mid/post-trainingAlexander Doria: Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. Link: https://x.com/Dorialexander/status/1987930819021635964

we need more team like @pleiasfr pushing open work on synthetic data at scale for pre/mid/post-training

𝕏x•4 days ago

omg the http://build.ai team just open-sourced 18 TB (!) of egocentric data like treasure trove for robotics models and physical AIEddy Xu: today, we’re open sourcing the largest egocentric dataset in history. - 10,000 hours - 2,153 factory workers - 1,080,000,000 frames the era of data scaling in robotics is here. (thread) Link: https://x.com/eddybuild/status/1987951619804414416

𝕏x•4 days ago

2024-25 => first time since DeepMind that the UK startups landscape is hot, burning and exciting again imo but the new proposed « exit tax » might just throw a big bucket of cold water on this phoenix follow/sign the @StartupCltn Open Letter if you care about startup ecosystemsHarry Stebbings: I have invested $200M into the UK in the last five years. If @RachelReevesMP implements “Exit Tax” all that funding will go overnight. That is countless jobs, companies and people who will lose out. Rachel, you have managed to steal our hopes, our dreams even our growth, Link: https://x.com/HarryStebbings/status/1987829015248277861

𝕏x•6 days ago

RT Matt Rouif Having fun building the @huggingface @ReachyMiniSol robot 🤖 . It’s very accessible! ♥️@ClementDelangue @julien_c @Thom_Wolf

RT Matt Rouif: Having fun building the @huggingface @ReachyMiniSol robot 🤖 . It’s very accessible! ♥️@ClementDelangue @julien_c @Thom_Wolf

𝕏x•7 days ago

human-robots interaction is super under-explored/underrated field today Reachy-Mini will be the platform + catalyst to change thatAndi Marafioti: Mixing vision and robotics is incredibly hard, but when it finally works it feels like magic. Link: https://x.com/andimarafioti/status/1986192399534641442

𝕏x•7 days ago

this has been requested so many times by the robotics/LeRobot community, super happy to see EnvHub finally out and in productionjade: We're introducing EnvHub to LeRobot! Upload simulation environments to the @hugginigface hub, and load them in lerobot, with one line of code ! env = lerobot.make_env("username/my-env", trust_remote_code=True) Back in 2017, @OpenAI called on the community to build Gym Link: https://x.com/jadechoghari/status/1986482455235469710

this has been requested so many times by the robotics/LeRobot community, super happy to see EnvHub finally out and in production

𝕏x•8 days ago

Is this another DeepSeek moment? Open-source passing closed-source again Should we expect this every couple months now?Kimi.ai: 🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built Link: https://x.com/Kimi_Moonshot/status/1986449512538513505

Is this another DeepSeek moment? Open-source passing closed-source again Should we expect this every couple months now?

𝕏x•8 days ago

Hugging Face science team blog posts on training LLMsJames Lucas: Bibliothèque Méjanes, a library in France Link: https://x.com/JamesLucasIT/status/1986132835309068646

Hugging Face science team blog posts on training LLMs

𝕏x•9 days ago

we're doing something special and intimate at @SlushHQ in 2 weeks with a couple of teams you might know :) some infos at: https://platform.slush.org/public/slush25/activities/edc8df18-9646-4558-887e-eb3cf8b6c19c?ref=twitter

we're doing something special and intimate at @SlushHQ in 2 weeks with a couple of teams you might know :) some infos at: https://platform.slush.org/p...

𝕏x•9 days ago

Thomas Wolf: Despite all the big funding rounds and flashy demos in US robotics, K-Scale’s inability to raise more money should worry us We're at risk of replaying the LLM story all over again in robotics: - Chinese companies are going open-source and collaborating across the value chain Link: https://x.com/Thom_Wolf/status/1986011164132696181

Thomas Wolf: Despite all the big funding rounds and flashy demos in US robotics, K-Scale’s inability to raise more money should worry us

We're at risk of replaying the LLM story all over again in ...

𝕏x•9 days ago

Despite all the big funding rounds and flashy demos in US robotics, K-Scale’s inability to raise more money should worry us We're at risk of replaying the LLM story all over again in robotics: - Chinese companies are going open-source and collaborating across the value chain (from EV suppliers to downstream integrators) - most US teams are going full-stack proprietary, closed-source, all in-house Guess which robots the next wave of US research labs and startups will actually be able to build on when they want to invent new algorithms or tackle unseen real-world use cases?Harrison Kinsley: K-scale cancels orders and refunds deposits for kbot. I thought all the VCs were excited about US-based robotics, what happen? Link: https://x.com/Sentdex/status/1985810689458569502

Despite all the big funding rounds and flashy demos in US robotics, K-Scale’s inability to raise more money should worry us We're at risk of replayin...

𝕏x•10 days ago

I like this direction a lot. As we get code writing increasingly automated we’ll want to move to higher level representations of code. a lot of UX to invent hereCognition: Introducing Codemaps in @windsurf! powered by SWE-1.5 and Sonnet 4.5 “Your code is your understanding of the problem you’re exploring. So it’s only when you have your code in your head that you really understand the problem.” — @paulg Link: https://x.com/cognition/status/1985755284527010167

𝕏x•10 days ago

RT Lewis Tunstall Or ... you could just host them on http://hf.coAnthropic: Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how we’re thinking about these costs, and some of the early steps we’re taking to mitigate them: https://www.anthropic.com/research/deprecation-commitments Link: https://x.com/AnthropicAI/status/1985752012189728939

RT Lewis Tunstall: Or ... you could just host them on http://hf.co

𝕏x•11 days ago

Monday morning read: a fascinating deep dive in recent Chinese chips developments as well as the coming co-evolution with LLM builders fresh analysis from the HF team => https://huggingface.co/blog/huggingface/shifting-compute-landscape

Monday morning read: a fascinating deep dive in recent Chinese chips developments as well as the coming co-evolution with LLM builders fresh analysis ...

𝕏x•14 days ago

saying it way better than I doJay Alammar: The incredible work the @huggingface does in the open truly charts a path towards a post-scarcity future that includes everyone in the fruits of AI. Link: https://x.com/JayAlammar/status/1984273218568696014

𝕏x•14 days ago

more reachy-mini skins by the community which one's your favorite?

more reachy-mini skins by the community which one's your favorite?

𝕏x•15 days ago

RT Sasha Rush The work Hugging Face does continues to be incredible. Putting in serious effort to make these topics accessible and detailed. https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#introduction

RT Sasha Rush: The work Hugging Face does continues to be incredible. Putting in serious effort to make these topics accessible and detailed. https://...

#introduction