24th at the Electrica puzzle challenge | building https://t.co/baTQS2bdia | engineer @huggingface
RT Georgi Gerganov llama.cpp now has an official website: https://llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer. The installation provides a single unified `llama` entrypoint which you can use to run/serve models and interface with 3rd-party agentic applications. While oriented towards simplified user experience, the new `llama` application also provides all the advanced functionality of the existing llama.cpp tooling with which experienced users are already familiar. Also note that all GGUF models that you might have already downloaded with llama.cpp in the past will be automatically available to use without downloading again (they are stored in the common HF cache on your machine). We have many improvements in the pipeline both at the UX and at the engine level and we plan to iteratively ship new things over the coming months. One of the main focuses will be seamless integration with local-friendly 3rd-party agents (such as Pi). In the meantime, we’ll continue to listen for feedback from the community and adjust accordingly, so keep letting us know what you think and need.
RT clem 🤗 llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation below on A10G: From 25 tok/st to 45 tok/s (+78%)!
RT Georgi Gerganov Highlighting the new WebGPU backend in llama.cpp/ggml The work to bring full-fledged WebGPU support in llama.cpp started about an year and a half ago. It has been lead by @reeselevine and team at USCS. For more information, checkout the interactive blog and paper in the quoted post. Here are 2 excerpts from the paper, summarizing the implemented software architecture.
WebGPU support in llama.cpp is here! Check out our blog post introducing it: https://reeselevine.github.io/llamas-on-the-web/ Run local models in your browser, with GPU acceleration. No data leaves your computer! Thanks to everyone who's made this possible, especially @ggerganov
View quoted postHighlighting the new WebGPU backend in llama.cpp/ggml The work to bring full-fledged WebGPU support in llama.cpp started about an year and a half ago. It has been lead by @reeselevine and team at USCS. For more information, checkout the interactive blog and paper in the quoted post. Here are 2 excerpts from the paper, summarizing the implemented software architecture.
WebGPU support in llama.cpp is here! Check out our blog post introducing it: https://reeselevine.github.io/llamas-on-the-web/ Run local models in your browser, with GPU acceleration. No data leaves your computer! Thanks to everyone who's made this possible, especially @ggerganov
View quoted postRT Reese Levine Re We have an arxiv paper up describing the work in more detail here: https://arxiv.org/abs/2605.20706. Also want to call out that there is even more room for improvement, some recent updates to wllama by @ngxson mean it's even more memory efficient than what we describe in the paper!
RT Julien Chaumond What hardware actually powers open-source AI? Not benchmarks. Not vendor marketing. Real-world community usage. We’re launching @huggingface Hardware: → trending GPUs & CPUs → VRAM distribution → inference hardware trends → what the OSS AI ecosystem really runs on
llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! https://github.com/ggml-org/llama.cpp/pull/22673
RT Victor M Quite excited about llama-eval, a proposed eval tool for llama.cpp. Could be a nice step toward more comparable community evals 🎉 https://github.com/ggml-org/llama.cpp/pull/21152
RT antirez Re @BereznevKi20669 @ggerganov Yes I believe the real llama.cpp revolution is yet to happen at its full scale. As computers will have more RAM and models will improve, and *if* China will continue shipping large strong models with open weights, what will happen will have huge effects.
RT clem 🤗 Local AI is having its moment! Below is the number of new GGUF models created each month over the past 8 months & insights from our HF internal agent (May is partial): - 176,000 total public GGUF models on HF - Two distinct regimes: Oct–Feb averaged ~5.1K new GGUF models/month. Then March–April jumped to ~9.2K/month — nearly double the previous rate. - March was the inflection point (+55% MoM) — likely driven by a wave of new open-weight model releases being quantized to GGUF. - April sustained the momentum at 9.7K, suggesting this isn't a one-off spike but a new baseline. - The GGUF ecosystem is accelerating — the community is quantizing models faster than ever, likely thanks to better tooling (llama.cpp improvements, automated quantization pipelines, and more models supporting GGUF natively). Let's go!
RT Radoslav Gerganov Running Qwen3.5-397B-A17B (4bit quants, 177 GB) on two DGX Sparks using llama.cpp with RPC and RDMA:
RT Julien Chaumond This is where we are right now. And i’m not gonna lie it feels pretty magical 🧚♀️ Qwen3.6 27B running inside of Pi coding agent via Llama.cpp on the MacBook Pro For non-trivial tasks on the @huggingface codebases, this feels very, very close to hitting the latest Opus in Claude Code, or whatever shiny monopolistic closed source API of the day is. In full airplane mode. Most people haven’t realized this yet. If you have, it means you have a huge headstart to what I call the second revolution of AI. Powerful local models for efficiency, security, privacy, sovereignty 🔥
llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default
RT Xuan-Son Nguyen llama.cpp now supports various small OCR models that can run on low-end devices. These models are small enough to run on GPU with 4GB VRAM, and some of them can even run on CPU with decent performance. In this post, I will show you how to use these OCR models with llama.cpp 👇 Original tweet: https://x.com/ngxson/status/2042631708650963344
RT Pierre-Antoine Bannier sam3.cpp - Meta's SAM 3 in pure C++ with @ggerganov's ggml - Supports SAM 3.1, 3, 2.1, 2 and EdgeTAM - FP16, 4-bit quant (EdgeTAM in 15 MB) - Apple Metal GPU, CUDA, CPU - Text-prompted: "peach" → every peach - Single-file C++14 Performance-wise: - 100ms object detection, segmentation - Video object segmentation @ 20FPS on M4 Pro with EdgeTAM https://github.com/PABannier/sam3.cpp Original tweet: https://x.com/el_PA_B/status/2041878732189679874
The example below is using prompt-based speculative decoding. Specifically, ngram hashing is utilized to suggest drafts of up to 64 tokens. The hasher keeps track of ngrams in the observed contexts, so mostly effective for coding tasks. Here is another demo:
Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result:
View quoted postGemma 4 is now available in LlamaBarnWith the recent HF cache integration, all models that you have downloaded with llama.cpp are automatically available inside LlamaBarn too (and vice versa)
RT AA got Gemma 4 up and running at 34 tokens per second this is the 26B-A4B model, running on my mac mini m4 with 16GB ram next time i hit my claude session limits i'll have this fast free local AI as a backup :] Original tweet: https://x.com/measure_plan/status/2040069272613834847
i spent the afternoon experimenting with Gemma 4's vision capabilities made an app that uses roboflow RF-DETR for a first pass of object detections and Gemma to summarize the scene in one sentence for fun i asked Gemma to "describe what you see as if you were a medieval bard"
View quoted postSon lead the development on HF/llama.cpp side for adding support for the new Gemma 4 models. As always, he did an outstanding job throughout the collaboration with the Google DeepMind team. Day-0 support is possible thanks to his hard work!
While working on the pre-release support of gemma 4, I was surprised by its capabilities compared to their size. We're tapping on the surface here, there are more and more to discover about gemma 4. I'm excited to see what the community will do with it in the next few days 🚀🚀
View quoted postPro tip - hook your PC and Phone with Tailscale and enjoy fast and private inference on the go. Here is Gemma 4, hosted on Mac Studio, streaming to my iPhone. No 3rd party apps. Same WebUI experience everywhere.
Our partners from @NVIDIA_AI_PC have helped with optimizations and benchmarks of the new Gemma 4 models to guarantee that they run efficiently across the NVIDIA ecosystem Checkout the blog below for more information
Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result: 300t/s (realtime video)
Gemma 4 is here! The best open-source model you can run on your machine. Day-0 support in a llama.cpp. Check it out!
RT clem 🤗 So proud to have @ggerganov and @ggml_org part of the @huggingface team. One of the unsung heroes of AI, powering the most widely used open-source runtime for local AI! Original tweet: https://x.com/ClementDelangue/status/2038752860192518294
llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on
llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that l...
RT Liu Liu https://releases.drawthings.ai/p/introducing-lightning-draft-interactive Original tweet: https://x.com/liuliu/status/2036507598757962175
RT Victor M Now available on Hugging Face: hf-mount 🧑🚀 The team really cooked, still wrapping my head everything possible but you can do things like: - mount a 5TB dataset as a local folder and query only the parts you need with DuckDB (✅ works) - browse any model repo with ls/cat like it's a USB drive - use a shared read-write bucket as a team drive for ML artifacts - drop the init container that downloads models in your k8s pods - point llama.cpp at a mounted GGUF and run inference (infinite storage??) Original tweet: https://x.com/victormustar/status/2036476453370380416
Released ggerganov/ggwave
ggerganov released ggwave-v0.4.3 at ggerganov/ggwave
RT Julien Chaumond started a crowd-sourced list of all libraries and applications that use the HF hub local cache (~/.cache/huggingface) here: https://huggingface.co/docs/hub/local-cache Please add any missing ones! PRs are welcome Original tweet: https://x.com/julien_c/status/2034244006930948251
With Nemotron 3 Nano 4B in the NVIDIA Nemotron 3 family, llama.cpp users get a compact model for action-taking conversational personas, available across NVIDIA GPU-enabled systems and @NVIDIA_AI_PC
RT Victor M Introducing Storage Buckets on Hugging Face 🧑🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. - Starting at $8/TB/mo. That's 3x cheaper than S3. You (and your coding agents) need somewhere to dump checkpoints, logs, and artifacts. Now they have a home. Original tweet: https://x.com/victormustar/status/2031419482292576725
Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages https://github.com/ggml-org/llama.cpp/discussions/20042
RT LM Studio Introducing LM Link ✨ Connect to remote instances of LM Studio, securely. 🔐 End-to-end encrypted 📡 Load models locally, use them on the go 🖥️ Use local devices, LLM rigs, or cloud VMs Launching in partnership with @Tailscale Try it now: https://link.lmstudio.ai Original tweet: https://x.com/lmstudio/status/2026722042347663779
RT Z.ai GLM-4.7-Flash-GGUF is now the most downloaded model on @UnslothAI. Original tweet: https://x.com/Zai_org/status/2021207517557051627
Activity on ggerganov/hnterm
ggerganov commented on an issue in hnterm
View on GitHubActivity on ggerganov/hnterm
ggerganov commented on an issue in hnterm
View on GitHubRT Xuan-Son Nguyen Qwen3-Coder-Next and Minimax-M2.1 are available on HF inference endpoints with the price of $2.5/hr and $5/hr respectively. With the context fitting supported, you can now utilize the largest context length possible for a given hardware. No more manual tuning -c option! Original tweet: https://x.com/ngxson/status/2020896739222282736
Activity on ggerganov/tmp2
ggerganov opened a pull request in tmp2
View on GitHub