Evals evals evals https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa
RT Ethan Mollick Teaching an experimental class for MBAs on “vibefounding,” the students have four days to come up and launch a company. More on this eventually, but quick observations: 1) I have taught entrepreneurship for over a decade. Everything they are doing in four days would have taken a semester in previous years, if it could have done it at all. Quality is also far better. 2) Give people tools and training and they can do amazing things. We are using a combination of Claude Code, Gemini, and ChatGPT. The non-coders are all building working products. But also everyone is doing weeks of high quality work on financials, research, pricing, positioning, marketing in hours. All the tools are weird to use, even with some training, but they are figuring it out. 3) People with experience in an industry or skill have a huge advantage as they can build solutions that have built-in markets & which solve known hard problems that seemed impossible. (Always been true, but the barriers have fallen to actually doing stuff) 4) The hardest thing to get across is that AI doesn’t just do work for you, it also does new kinds of work. The most successful efforts often take advantage of the fact that the AI itself is very smart. How do you bring its analytical, creative, and empathetic abilities to bear on a problem? What do you do with access to a very smart intelligence on demand? I wish I had more frameworks to clearly teach. So many assumptions about how to launch a business have clearly changed. You don’t need to go through the same discovery process if you build a dozen ideas at the same time & get AI feedback. Many, many new possibilities, and the students really see how big a deal this is. Original tweet: https://x.com/emollick/status/2011523783467958585
If you don't have data, you can generate it synthetically to get started with evals. However, prompting a LLM to generate lots of data without structured inputs / dimensions usually results in homogenous outputs. More info in reply
RT Zach Mueller Since @xeophon slandered me (dared to say I vaguepost like an OAI employee) Here’s the skill I’m messing around with: https://github.com/199-biotechnologies/claude-deep-research-skill Original tweet: https://x.com/TheZachMueller/status/2011154062000209942
Who has good harness recommendations for tightly coupled research using Claude Code? Got something decently going but curious if there’s other examples using agents etc
View quoted postDifferent ways of sampling traces for evals. This is where data literacy becomes super important, because nothing beats good old data analysis.
This claude code mobile app is really good h/t @aronchick for telling me about it It allows you to keep coding on your phone while connected to the session on your computer at home. The UX is really good in terms of ergonomics on your phone (so beats terminal emulators IMO). https://app.happy.engineering/
RT Claude Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code. Original tweet: https://x.com/claudeai/status/2010805682434666759
Verifying a LLM judge involves setting up a classic ML style test against human labels. Shipping an unverified LLM judge is the fastest way for people to lose trust in your evals (and your work in general).
RT Zach Bruggeman The craft of engineering is rapidly changing. At @tryramp, we built our own background coding agent to accelerate faster. We call it Inspect. It wrote 30% of merged frontend + backend PRs in the past week. It’s powered by @opencode, @modal and @CloudflareDev. It runs fully in the cloud, and starts in seconds, letting every builder work at the speed of thought, no setup required. Today, we’re open sourcing the full blueprint so anybody can build their own Inspect. Just give our spec to your current coding agent, and let it build your new favourite. Original tweet: https://x.com/zachbruggeman/status/2010728444771074493
RT Han cs : while loop = ai : ralph loop Original tweet: https://x.com/HanchungLee/status/2010614471895679456
RT Isaac Flath I went through @HamelHusain 's agent threads for an AI Jupyter extension, JupyVibe, that he built agentically to find the patterns in what he did - Agent: @AmpCode - Research: Librarian - Planning/debug: Oracle - Verifiability: Toolbox - Taste: Hamel https://isaacflath.com/writing/Agentic-Coding-Custom-Jupyter-Exension Original tweet: https://x.com/isaac_flath/status/2010411725456351271
RT Isaac Flath AI Writing Skill #1: Conciseness 👇 https://isaacflath.com/writing/AI-Writing-1-Conciseness Original tweet: https://x.com/isaac_flath/status/2010140932491121112
Wait till you hear about while loops and Ralph Wiggum
Activity on hamelsmu/vibetui
hamelsmu contributed to hamelsmu/vibetui
View on GitHubRT Bryan Bischof fka Dr. Donut Me trying to do a quick task with vibe coding and accidentally vibe coding 5 dev tools... Original tweet: https://x.com/BEBischof/status/2009749468422516838
We discourage Likert scales for LLM-judge because they are costly to align and hard to action on. We also find that these scales help annotators dodge difficult decisions. Links in reply
Seriously first encounter with AI to avoid real pain and suffering
One of AI's best uses is to get you out of GH Actions hell > Look at why CI is failing using gh cli, and figure out why its failing. Propose a fix.
View quoted postOne of AI's best uses is to get you out of GH Actions hell > Look at why CI is failing using gh cli, and figure out why its failing. Propose a fix.
Activity on repository
hamelsmu forked hamelsmu/amp-skills from snarktank/amp-skills
View on GitHubRT Omar Khattab “[using embedding search] is why Cursor is the best agent for large codebases” We are still so early… Original tweet: https://x.com/lateinteraction/status/2009438383563788302
This is why Cursor is the best agent for large codebases. You simply cannot beat semantic search as a way for an agent to navigate a large codebase. Imagine trying to find all implementations of a method by searching for strings, versus searching for the semantic meaning behind
View quoted postThis is my reaction anytime I see generic eval metricsHere is why: https://hamel.dev/blog/posts/evals-faq/#q-should-i-use-ready-to-use-evaluation-metrics
This flashcard is fun. Everyone wants to reach for the blue button 🙃 : It doesn't really work because you need to go through exercise of externalizing your requirements first - and criteria normally shifts after looking at traces. These are all very real mistakes.
RT Lenny Rachitsky "This course was a revelation." - EM, Cisco "Wish I had taken it sooner." - Principal PM, Microsoft "This course completely changed how I think about evals." - Sr. Director of AI Engineering, Microsoft "One of the best investments I've made recently" - Sr. Director of Product, Square "This class is mandatory knowledge for anyone building AI systems." — Sr. PM, Google Just a few testimonials from listeners who watched my chat with @HamelHusain and @sh_reya and took their course. Link to the episode and the course in the comments. Their next cohort starting in a few weeks. Original tweet: https://x.com/lennysan/status/2009026506732019749
Vibe coding Jupyter Lab plugins is underrated
I forked @HamelHusain's experiment to test whether someone else could pick it up, or if it only worked for the original coder I added multi-model support via LiteLLM in about 15 minutes. All the context was already there for AI (@AmpCode tools, threads, etc). This surprised me
View quoted postRT Isaac Flath I forked @HamelHusain's experiment to test whether someone else could pick it up, or if it only worked for the original coder I added multi-model support via LiteLLM in about 15 minutes. All the context was already there for AI (@AmpCode tools, threads, etc). This surprised me Original tweet: https://x.com/isaac_flath/status/2008955131769827440
Everyone's talking about vibe coding without looking at code. I was skeptical. I decided to give it a shot on a challenging problem and was blown away by what I could accomplish in 8 hours. I'm not skeptical anymore, but I also do NOT think it kills SaaS Notes: - I tried to
View quoted postYou shouldn't always write an eval. It all comes down to a few things: 1. Did you look at your data first? 2. How much do you anticipate having to iterate on the problem 3. Can you write a cheap eval (like a code based assertion?) Link to download all the flashcards in reply
RT Isaac Flath It's amazing to me how persistent the lines of code metric is. Everyone knows it's a pretty useless metric for measuring productivity/quality... Until they want to talk about handcrafted code quality or agentic coding productivity. Original tweet: https://x.com/isaac_flath/status/2008533292862374137
RT Quinn Slack Lots of people building @AmpCode original features into other agents, like the oracle, the librarian, and thread sharing. https://x.com/tanishqkan/status/2008361057543246047?s=46 https://x.com/badlogicgames/status/2006886256848867704?s=46 https://x.com/moinulmoin/status/2006063601224933394?s=46 Etc. We love to see this and would love to hear feedback from people who do this and who use these transplanted features: what’d you learn in transplanting it to another agent? If you blog about it and share your feedback and criticism, that’d be awesome! The truth is that the current iteration of coding agents is nasty, brutish, and short. It’s like back when there was 15 DVCSes, several like Git and Hg were popular but the workflows were messy, and people were working toward the higher-level platform and workflow (the GitHub in this analogy, but this’ll be much bigger than GitHub). We build Amp so that we and our users are on the front lines and can build/discover that next thing and workflow, the next generation of the stuff congealing in this primordial soup of threads, tasks, tools, skills, commits, subagents, etc. So, for anyone, and especially those who build stuff inspired by Amp and get to know it really well, please share your ideas and criticism back with us so we can congeal this soup faster. (We write down what we’re learning at http://ampcode.com/chronicle. There are some great agent design critiques from @mitsuhiko at https://lucumr.pocoo.org/. Also @mitchellh @badlogicgames @simonw @HamelHusain @geoffreylitt.) Original tweet: https://x.com/sqs/status/2008375213856354423
bruuh, @opencode becomes beast with oh my opencode, you can get the harness as same as @AmpCode , loving it!
View quoted postMy 7 year old attends Montessori school. I've started making software to help practice concepts she's learning anytime we need it with claude code / amp (I use them round robin lol) Example (made this in 15 min): https://racks-tubes.zara.dev/
RT @levelsio This is how I code now about half the time From my phone with Termius coding on my VPS and writing requests into Claude Code like a goon Original tweet: https://x.com/levelsio/status/2008302286183842222
💯
@HamelHusain Agreed. If you believe nobody is shipping anything with AI then it’s kinda willful ignorance at this point IMO
View quoted postThe reason people are shouting with volume turned up to 11 because they are genuinely having constant moments of delight and are shipping real things faster and are able to act on their ambition Also, my TL is filled with plenty of people “showing stuff”
I've never encountered a software productivity technology where so many people are shouting about How Great This Stuff Is with the volume turned up to 11 while almost never showing any interesting new work that they built with AI coding. Could people just show stuff? ...
View quoted postWe created flashcards for students in our Evals course, but are giving them away for free! First up, Error Analysis - the most important part of evals. More info in the reply@sh_reya Download all the flashcards: https://maven.com/parlance-labs/evals/admin/lead-magnets/15420a If you haven't heard of error analysis before, Shreya and I do a live demo here: https://www.youtube.com/watch?v=BsWxPI9UM4c
Wife (who doesn't code) created this website for her educational podcast with Claude Code in 30 min https://www.wonderwise.science/ CC guided her by: - Deploying on vercel and getting a domain - downloaded and analyzed all her content for website copy - made a responsive version - resolved all links to each episode + RSS feed
My wife has been using AI to create educational songs for our children (She's a Cardiac Electrophysiologist by day) It's already had a positive impact on our children (ages 4 and 7). Her workflow involves Claude, Suno, and NanoBanana This is mainly a "personal podcast" that
From a friend > People that will struggle with AI tools aren't the incompetent. It's the people with high ego. you need the humility to be surprised when it overtakes you without biases and to make it better
RT Isaac Flath Re I think new stuff will be created that will take over! The most common is what AI is best at today because very little has been built with AI in mind. But when things get to be mature that were developed specifically with AI usage as a key consideration in dev, they'll take over. For example, @AirWebFramework (by @pydanny and @audreyfeldroy ) is a new web-dev library designed to be well usable by AI and by humans. I use it because it has both really solid AI and human comprehensibility benefits. And it's in alpha and AFAICT I am the only one with commercial apps for it, so it's not ONLY a training data volume equation, they've figured something out there :) Original tweet: https://x.com/isaac_flath/status/2007908570218512870
This is also why I can’t use nbdev anymore There isn’t a good **open source** ai + Jupyter integration I really like (many exist but they all feel second class) So I’ve given up on using it to build software and am using more paved paths
I have flipped from using the libraries/languages I like to using what AI prefers Swimming upstream is not worth it. For example I’m a python developer, but will be using nextjs for web apps - I’ll keep using Python for data / ML work It's also a great opportunity to learn
View quoted postI have flipped from using the libraries/languages I like to using what AI prefers Swimming upstream is not worth it. For example I’m a python developer, but will be using nextjs for web apps - I’ll keep using Python for data / ML work It's also a great opportunity to learn things. there is a huge productivity gain to be had by using the right stack
RT Jake Re Today I handed Claude a document that I've been growing for...years on building an orchestrator/distributed runtime that I had only purely theorized possible. One we've been working towards. It would have taken me probably months to code by hand. Building on 5 years of work and 10 years of experience. Claude wrote all the code in Golang in 4 hours. I'd always actually wanted it in Rust cause I thought it would be easier to express, so I threw it in a loop with a "Rewrite it in Rust and make it as succinct as possible" I went and ate a burrito. I came back and it was done. That's the world we live in now. Original tweet: https://x.com/JustJake/status/2007730898192744751
I 100% did this with @AmpCode I downloaded all my amp threads as markdown files in the repo in the _threads/ folder https://github.com/AnswerDotAI/ai-jup
Everyone's talking about vibe coding without looking at code. I was skeptical. I decided to give it a shot on a challenging problem and was blown away by what I could accomplish in 8 hours. I'm not skeptical anymore, but I also do NOT think it kills SaaS Notes: - I tried to
View quoted postI definitely think vibe engineering is a skill. Don’t be to quick to conclude “it doesn’t work for me” without trying different approaches It takes a fair amount of skill to direct things carefully and anticipate issues, and also when to exit and take over the wheel for bits
Activity on repository
jph00 added jph00 to hamelsmu/ai-jup
View on GitHubRT Bryan Bischof fka Dr. Donut How did you spend your Learning and Development budget? I get this question every year; a lot. So here's an opinionated guide https://open.substack.com/pub/pseudorandomgenerator/p/how-to-invest-in-learning-and-development?utm_campaign=post-expanded-share&utm_medium=web Original tweet: https://x.com/BEBischof/status/2006471295698149848
RT Han the three most harmful addictions are heroin, carbohydrates, and a monthly salary. - n.n.taleb Original tweet: https://x.com/HanchungLee/status/2006448752438190080
RT Isaac Flath Re @HamelHusain Awesome. It worked really well for me. I tried it for looking at data from one feature and taking notes on what it did poorly. It created really nice widgets for exploring. Super convenient. Now I want to go make more personal tools for fun :D Original tweet: https://x.com/isaac_flath/status/2006443529556709597
Everyone's talking about vibe coding without looking at code. I was skeptical. I decided to give it a shot on a challenging problem and was blown away by what I could accomplish in 8 hours. I'm not skeptical anymore, but I also do NOT think it kills SaaS Notes: - I tried to replicate some of my favorite features from the Solve-It platform as a jupyter extension https://www.fast.ai/posts/2025-11-07-solveit-features.html#tools - I've tried this many times before, and it didn't work. - I gave my agent specific testing tools that I packaged as skills. I used AI to write the skills. I found the right testing workflow for this Jupyter extension by having AI peruse lots of other extensions and the Jupyter source code. - I had the AI write and maintain a large suite of tests the whole time. I think this was important in keeping the AI on track. - I watched the diffs and the thinking traces as they streamed by. From time to time, I would see something very suspicious like " try ... except: pass" And would stop the AI and tell it to stop this behavior. Then trigger a comprehensive code review using AI. - Most importantly, I don't think this kills SasS at all. Even if I can create software that replicates some of my favorite features, there is an insanely long tail of paper cuts and features I don't want to manage. The models and capabilities are improving so fast that I don't want to constantly tune everything. So I would rather leave that up to people who are focused on that daily and have good taste, with the knowledge that it has been battle tested against many users.
Vibe engineering feels a lot like machine learning. It's a fuzzy process where tests are regularizers. Instead of watching loss curves, I watch diffs and thinking tokens stream by and then course-correct anything that seems fishy
Activity on hamelsmu/ai-jup
hamelsmu commented on an issue in ai-jup
View on GitHubActivity on hamelsmu/ai-jup
hamelsmu opened an issue in ai-jup
View on GitHubActivity on hamelsmu/ai-jup
hamelsmu opened an issue in ai-jup
View on GitHub