ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)
The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below: 1. Full attention as an anti-trend?: They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2. 2. Linear and sparse attention deployment issues: They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system. In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision. Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context). 3. Fine-grained Mixture-of-Experts (MoEs) are useful: Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing. Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing. The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago). 4. Sophisticated agent pipeline It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline. They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc. 5. Interleaved thinking for context management Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especial...
Recently, we took time to consolidate all of the work behind M2 and published it here: our M2 paper on arXiv It’s been just over six months since we first open-sourced M2 on December 23 last year. During that time, a number of our ideas and systems have been broadly adopted by
Activity on rasbt/LLMs-from-scratch
rasbt closed a pull request in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt contributed to rasbt/reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt commented on an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt commented on an issue in machine-learning-book
View on GitHubActivity on repository
rasbt pushed llm-architecture-gallery
View on GitHubActivity on rasbt/llm-architecture-gallery
rasbt contributed to rasbt/llm-architecture-gallery
View on GitHubActivity on repository
rasbt pushed llm-architecture-gallery
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt commented on an issue in machine-learning-book
View on GitHubActivity on repository
rasbt pushed machine-learning-book
View on GitHubAdded a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib. With motivation, overview, and GPT-style model reference implementation as standalone example code: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/09_dsa
It's been *almost* a bit quiet around LLM architecture releases in the past two weeks 😅 Interesting tidbit is the parallel block design. Via the Cmd-A the tech report "equivalent performance but significant improvement in throughput compared to the vanilla transformer block."
Introducing: Cohere Command A+ We’ve created our most powerful LLM yet, optimized it to run on as little hardware as possible, and released it open-source for all.
View quoted postActivity on rasbt/mlxtend
rasbt closed an issue in mlxtend
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed machine-learning-book
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt contributed to rasbt/LLMs-from-scratch
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt commented on an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt closed an issue in machine-learning-book
View on GitHubActivity on rasbt/machine-learning-book
rasbt commented on an issue in machine-learning-book
View on GitHubActivity on repository
rasbt pushed machine-learning-book
View on GitHubReleased rasbt/reasoning-from-scratch
rasbt released v1.0 at rasbt/reasoning-from-scratch
Activity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubReleased rasbt/macos-pdf-splitter
rasbt released v1.1 at rasbt/macos-pdf-splitter
Activity on rasbt/macos-pdf-splitter
rasbt contributed to rasbt/macos-pdf-splitter
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed llm-architecture-gallery
View on GitHubActivity on rasbt/llm-architecture-gallery
rasbt contributed to rasbt/llm-architecture-gallery
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubNew article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4. I focus on long-context efficiency tweaks like KV sharing, per-layer embeddings, layer-wise attention budgets, compressed attention, and mHC. Link: https://magazine.sebastianraschka.com/p/recent-developments-in-llm-architectures
A little talk on what we can learn from implementing LLM architectures from scratch in Python and PyTorch. And how I approach new open-weight models, compare them against reference implementations etc: https://www.youtube.com/watch?v=TXzQ7PGpO6w
Interesting paper. What I like about this is that it is a relatively low-commitment attention modification. I.e., one can use it during most of training, switch back to vanilla attention near the end, and recover roughly the same modeling performance as if full attention had been used the whole time.
Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment? That is the idea behind Lighthouse Attention. The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that