ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)
Activity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on repository
rasbt made this repository public
View on GitHubRT Sten Rüdiger I’ve uploaded a new paper on arXiv (co-authored by @rasbt): MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning In Parameter-Efficient Fine-Tuning, a key question may not just be how low-rank the update is, but *which* subspace we adapt. Original tweet: https://x.com/StenRuediger/status/2041888496927826398
Strong release! GLM-5.1 is a DeepSeek-V3.2-like architecture (including MLA and DeepSeek Sparse Attention) but with more layers. And the benchmarks look better throughout! Looks like THE flagship open-weight model now.
Introducing GLM-5.1: The Next Level of Open Source - Top-Tier Performance: #1 in open source and #3 globally across SWE-Bench Pro, Terminal-Bench, and NL2Repo. - Built for Long-Horizon Tasks: Runs autonomously for 8 hours, refining strategies through thousands of iterations.
Activity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt contributed to rasbt/reasoning-from-scratch
View on GitHubAdded an RSS feed to the LLM Architecture Gallery so it is a bit easier to keep up with new additions over time: https://sebastianraschka.com/llm-architecture-gallery/
Activity on rasbt/mini-coding-agent
rasbt closed an issue in mini-coding-agent
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/mini-coding-agent
rasbt closed an issue in mini-coding-agent
View on GitHubActivity on rasbt/mini-coding-agent
rasbt opened a pull request in mini-coding-agent
View on GitHubActivity on rasbt/mini-coding-agent
rasbt contributed to rasbt/mini-coding-agent
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt contributed to rasbt/LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on repository
rasbt pushed llm-architecture-gallery
View on GitHubActivity on rasbt/llm-architecture-gallery
rasbt contributed to rasbt/llm-architecture-gallery
View on GitHubComponents of a coding agent: a little write-up on the building blocks behind coding agents, from repo context and tool use to memory and delegation. Link: https://magazine.sebastianraschka.com/p/components-of-a-coding-agent
Activity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/mini-coding-agent
rasbt contributed to rasbt/mini-coding-agent
View on GitHubActivity on rasbt/mini-coding-agent
rasbt commented on an issue in mini-coding-agent
View on GitHubFlagship open-weight release days are always exciting. Was just reading through the Gemma 4 reports, configs, and code, and here are my takeaways: Architecture-wise, besides multi-model support, Gemma 4 (31B) looks pretty much unchanged compared to Gemma 3 (27B). Gemma 4 maintains a relatively unique Pre- and Post-norm setup and remains relatively classic, with a 5:1 hybrid attention mechanism combining a sliding-window (local) layer and a full-attention (global) layer. The attention mechanism itself is also classic Grouped Query Attention (GQA). But let’s not be fooled by the lack of architectural changes. Looking at the benchmarks, Gemma 4 is a huge leap from Gemma 3. This is likely due to the training set and recipe. Interestingly, on the AI Arena Leaderboard, Gemma 4 (31B) ranks similarly to the much larger Qwen3.5-397B-A17B model. But as I discussed in my model evaluation article, arena scores are a bit problematic as they can be gamed and are biased towards human (style) preference. If we look at some other common benchmarks, which I plotted below, we can see that it’s indeed a very clear leap over Gemma 3 and ranks on par with Qwen3.5 27B. Note that there is also a Mixture-of-Experts (MoE) Gemma 4 variant that is slightly smaller (27B with 4 billion parameters active. The benchmarks are only slightly worse compared to Gemma 4 (31B). I omitted the MoE architecture in the figure below because the figure is already very crowded, but you can find it in my LLM Architecture Gallery. Anyways, overall, it's a nice and strong model release and a strong contender for local usage. Also, one aspect that should not be underrated is that (it seems) the model is now released with a standard Apache 2.0 open-source license, which has much friendlier usage terms than the custom Gemma 3 license.
Activity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubhttp://x.com/i/article/2038978163389112321
Activity on rasbt/LLMs-from-scratch
rasbt opened a pull request in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt closed a pull request in reasoning-from-scratch
View on GitHubIt’s done. All chapters of Build A Reasoning Model (From Scratch) are now available in early access. The book is currently in production and should be out in the next months, including full-color print and syntax highlighting. There’s also a preorder up on Amazon.
Activity on rasbt/reasoning-from-scratch
rasbt closed an issue in reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt commented on an issue in reasoning-from-scratch
View on GitHubActivity on repository
rasbt pushed reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt opened a pull request in reasoning-from-scratch
View on GitHubRT levi Day 83/365 of GPU Programming Looking at DeepSeek's Multi-Head Latent Attention today. The last part of the AMD challenge series is to optimize an MLA decode kernel for MI355X where the absorbed Q and compressed KV cache are given and your task is to do the attention computation. A resource that really helped internalize what MLA does was @rasbt's incredible visual guide to attention variants in LLMs (luckily he posted that last week!), which covers everything from MHA to GQA to MLA to SWA, et cetera. If there's one place to get a visual intuition for recent attention mechanisms, it's this blog post. @jbhuang0604's video on MQA, GQA,MLA and DSA was the best conceptual intro I found on the topic and progressively builds up the ideas from first principles. The Welch Labs analysis of MLA is a great watch as well. Beautiful visualization of the changes DeepSeek made for MLA. Tried out a few kernels once I had a basic understanding of MLA and I think I'm slowly getting more comfortable with at least analyzing kernels. Original tweet: https://x.com/levidiamode/status/2037663231511322831
Day 82/365 of GPU Programming Taking a closer look at Mixture of Experts today, so I can write better MoE kernels. Specifically, to optimize an MXFP4 MoE fused kernel for the GPU Mode challenge. I haven't had much prior exposure to MoEs, so lots of new concepts I learned today.
Activity on rasbt/reasoning-from-scratch
rasbt commented on an issue in reasoning-from-scratch
View on GitHubActivity on rasbt/reasoning-from-scratch
rasbt commented on an issue in reasoning-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt closed an issue in LLMs-from-scratch
View on GitHubActivity on rasbt/LLMs-from-scratch
rasbt commented on an issue in LLMs-from-scratch
View on GitHubAdded lots of improvements to the LLM Architecture Gallery in the last 2 weeks. Imho the coolest one yet: A diff tool many of you were asking for! https://sebastianraschka.com/llm-architecture-gallery/
Doing my tax return just made me think: TurboTax is probably something one could vibecode. But paying $190 for a reliable, worry-free experience still seems pretty reasonable. SaaS is not dead.