24th at the Electrica puzzle challenge | https://t.co/baTQS2bdia
RT Xuan-Son Nguyen llama.cpp now supports various small OCR models that can run on low-end devices. These models are small enough to run on GPU with 4GB VRAM, and some of them can even run on CPU with decent performance. In this post, I will show you how to use these OCR models with llama.cpp 👇 Original tweet: https://x.com/ngxson/status/2042631708650963344
RT Pierre-Antoine Bannier sam3.cpp - Meta's SAM 3 in pure C++ with @ggerganov's ggml - Supports SAM 3.1, 3, 2.1, 2 and EdgeTAM - FP16, 4-bit quant (EdgeTAM in 15 MB) - Apple Metal GPU, CUDA, CPU - Text-prompted: "peach" → every peach - Single-file C++14 Performance-wise: - 100ms object detection, segmentation - Video object segmentation @ 20FPS on M4 Pro with EdgeTAM https://github.com/PABannier/sam3.cpp Original tweet: https://x.com/el_PA_B/status/2041878732189679874
The example below is using prompt-based speculative decoding. Specifically, ngram hashing is utilized to suggest drafts of up to 64 tokens. The hasher keeps track of ngrams in the observed contexts, so mostly effective for coding tasks. Here is another demo:
Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result:
View quoted postGemma 4 is now available in LlamaBarnWith the recent HF cache integration, all models that you have downloaded with llama.cpp are automatically available inside LlamaBarn too (and vice versa)
RT AA got Gemma 4 up and running at 34 tokens per second this is the 26B-A4B model, running on my mac mini m4 with 16GB ram next time i hit my claude session limits i'll have this fast free local AI as a backup :] Original tweet: https://x.com/measure_plan/status/2040069272613834847
i spent the afternoon experimenting with Gemma 4's vision capabilities made an app that uses roboflow RF-DETR for a first pass of object detections and Gemma to summarize the scene in one sentence for fun i asked Gemma to "describe what you see as if you were a medieval bard"
View quoted postSon lead the development on HF/llama.cpp side for adding support for the new Gemma 4 models. As always, he did an outstanding job throughout the collaboration with the Google DeepMind team. Day-0 support is possible thanks to his hard work!
While working on the pre-release support of gemma 4, I was surprised by its capabilities compared to their size. We're tapping on the surface here, there are more and more to discover about gemma 4. I'm excited to see what the community will do with it in the next few days 🚀🚀
View quoted postPro tip - hook your PC and Phone with Tailscale and enjoy fast and private inference on the go. Here is Gemma 4, hosted on Mac Studio, streaming to my iPhone. No 3rd party apps. Same WebUI experience everywhere.
Our partners from @NVIDIA_AI_PC have helped with optimizations and benchmarks of the new Gemma 4 models to guarantee that they run efficiently across the NVIDIA ecosystem Checkout the blog below for more information
Let me demonstrate the true power of llama.cpp: - Running on Mac Studio M2 Ultra (3 years old) - Gemma 4 26B A4B Q8_0 (full quality) - Built-in WebUI (ships with llama.cpp) - MCP support out of the box (web-search, HF, github, etc.) - Prompt speculative decoding The result: 300t/s (realtime video)
Gemma 4 is here! The best open-source model you can run on your machine. Day-0 support in a llama.cpp. Check it out!
RT clem 🤗 So proud to have @ggerganov and @ggml_org part of the @huggingface team. One of the unsung heroes of AI, powering the most widely used open-source runtime for local AI! Original tweet: https://x.com/ClementDelangue/status/2038752860192518294
llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on
llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that l...
RT Liu Liu https://releases.drawthings.ai/p/introducing-lightning-draft-interactive Original tweet: https://x.com/liuliu/status/2036507598757962175
RT Victor M Now available on Hugging Face: hf-mount 🧑🚀 The team really cooked, still wrapping my head everything possible but you can do things like: - mount a 5TB dataset as a local folder and query only the parts you need with DuckDB (✅ works) - browse any model repo with ls/cat like it's a USB drive - use a shared read-write bucket as a team drive for ML artifacts - drop the init container that downloads models in your k8s pods - point llama.cpp at a mounted GGUF and run inference (infinite storage??) Original tweet: https://x.com/victormustar/status/2036476453370380416
Released ggerganov/ggwave
ggerganov released ggwave-v0.4.3 at ggerganov/ggwave
RT Julien Chaumond started a crowd-sourced list of all libraries and applications that use the HF hub local cache (~/.cache/huggingface) here: https://huggingface.co/docs/hub/local-cache Please add any missing ones! PRs are welcome Original tweet: https://x.com/julien_c/status/2034244006930948251
With Nemotron 3 Nano 4B in the NVIDIA Nemotron 3 family, llama.cpp users get a compact model for action-taking conversational personas, available across NVIDIA GPU-enabled systems and @NVIDIA_AI_PC
RT Victor M Introducing Storage Buckets on Hugging Face 🧑🚀 The first new repo type on the Hub in 4 years: S3-like object storage, mutable, non-versioned, built on Xet deduplication. - Starting at $8/TB/mo. That's 3x cheaper than S3. You (and your coding agents) need somewhere to dump checkpoints, logs, and artifacts. Now they have a home. Original tweet: https://x.com/victormustar/status/2031419482292576725
Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages https://github.com/ggml-org/llama.cpp/discussions/20042
RT LM Studio Introducing LM Link ✨ Connect to remote instances of LM Studio, securely. 🔐 End-to-end encrypted 📡 Load models locally, use them on the go 🖥️ Use local devices, LLM rigs, or cloud VMs Launching in partnership with @Tailscale Try it now: https://link.lmstudio.ai Original tweet: https://x.com/lmstudio/status/2026722042347663779
RT Z.ai GLM-4.7-Flash-GGUF is now the most downloaded model on @UnslothAI. Original tweet: https://x.com/Zai_org/status/2021207517557051627
Activity on ggerganov/hnterm
ggerganov commented on an issue in hnterm
View on GitHubActivity on ggerganov/hnterm
ggerganov commented on an issue in hnterm
View on GitHubRT Xuan-Son Nguyen Qwen3-Coder-Next and Minimax-M2.1 are available on HF inference endpoints with the price of $2.5/hr and $5/hr respectively. With the context fitting supported, you can now utilize the largest context length possible for a given hardware. No more manual tuning -c option! Original tweet: https://x.com/ngxson/status/2020896739222282736
Activity on ggerganov/tmp2
ggerganov opened a pull request in tmp2
View on GitHubIntroducing LlamaBarn — a tiny macOS menu bar app for running local LLMs Open source, built on llama.cpp
RT Julien Chaumond 5️⃣ Original tweet: https://x.com/julien_c/status/2015804900525932663
RT Xuan-Son Nguyen Hugging Face Inference Endpoint now supports deploying GLM-4.7-Flash via llama.cpp, for as cheap as $0.8/hr Using Q4_K_M and 24k tokens context length - should be enough for most use case! Original tweet: https://x.com/ngxson/status/2015763148523897097
RT Xuan-Son Nguyen 🦙llama.cpp supports Anthropic's Messages API for a while now, with streaming, tool calling and reasoning support. Compatible with Claude Code. See more here: https://huggingface.co/blog/ggml-org/anthropic-messages-api-in-llamacpp Original tweet: https://x.com/ngxson/status/2013307175662223823
Excited about this!
LFM2.5-Audio-1.5B > Real-time text-to-speech and ASR > Running locally on a CPU with llama.cpp > Interleave speech and text It's super elegant, I'm bullish on local audio models
View quoted postRecent contributions by NVIDIA engineers and llama.cpp collaborators resulting in significant performance gains for local AI
Some neat QoL improvements coming to llama.cpp thanks to Johannes Gäßler https://github.com/ggml-org/llama.cpp/discussions/18049
RT Xuan-Son Nguyen Introducing: the new llama-cli 🦙🦙 > Clean looking interface > Multimodal support > Conversation control via commands > Speculative decoding support > Jinja fully supported Original tweet: https://x.com/ngxson/status/1998763208098853332
We joined forces with NVIDIA to unlock high-speed AI inference on RTX AI PCs and DGX Spark using llama.cpp. The latest Ministral-3B models reach 385+ tok/s on @NVIDIA_AI_PC GeForce RTX 5090 systems. Blog: https://developer.nvidia.com/blog/nvidia-accelerated-mistral-3-open-models-deliver-efficiency-accuracy-at-any-scale/
RT Lysandre Transformers v5's first release candidate is out 🔥 The biggest release of my life. It's been five years since the last major (v4). From 20 architectures to 400, 20k daily downloads to 3 million. The release is huge, w/ tokenization (no slow tokenizers!), modeling & processing. Original tweet: https://x.com/LysandreJik/status/1995558230567878975
RT Jeff Geerling Just tried out the new built-in WebUI feature of llama.cpp and it couldn't be easier. Just start llama-server with a host and port, and voila!
Initial M5 Neural Accelerators support in llama.cpp Enjoy faster TTFT in all ggml-based software (requires macOS Tahoe 26) https://github.com/ggml-org/llama.cpp/pull/16634
RT Georgi Gerganov Initial M5 Neural Accelerators support in llama.cpp Enjoy faster TTFT in all ggml-based software (requires macOS Tahoe 26) https://github.com/ggml-org/llama.cpp/pull/16634
RT Emanuil Rusev Re @fishright @ggerganov Just pushed a fix for this — this is what first launch is going to look like in the next version.
RT clem 🤗 When you run AI on your device, it is more efficient and less big brother and free! So it's very cool to see the new llama.cpp UI, a chatgpt-like app that fully runs on your laptop without needing wifi or sending any data external to any API. It supports: - 150,000+ GGUF models - Drop in PDFs, images, or text documents - Branch and edit conversations anytime - Parallel chats and image processing - Math and code rendering - Constrained generation with JSON schema supported Well done @ggerganov and team!
RT yags llama.cpp developers and community came together in a really impressive way to implement Qwen3-VL models. Check out the PRs, it’s so cool to see the collaboration that went into getting this done. Standard formats like GGUF, combined with mainline llama.cpp support ensures the models you download will work anywhere you choose to run them. This protects you from getting unwittingly locked into niche providers’ custom implementations that won’t run outside their platforms.Qwen: 🎉 Qwen3-VL is now available on llama.cpp! Run this powerful vision-language model directly on your personal devices—fully supported on CPU, CUDA, Metal, Vulkan, and other backends. We’ve also released GGUF weights for all variants—from 2B up to 235B. Download and enjoy! 🚀 🤗 Link: https://x.com/Alibaba_Qwen/status/1984634293004747252