Hamel Husain
简介
Evals evals evals https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa
平台
内容历史
Run the following command and you can see some of what Codex is cooking. TIL they have remote_control too! > codex feature list P.S. its worth reading the manual
RT Shreya Shankar Looking forward to CHI this week! We have a ✨Best Paper ✨ on a "what-if" analysis tool for RAG. Reach out to chat! I'm interested in: MLOps/LLMOps, data analysis, and better interfaces for human-AI collaboration (and, very soon i'll be recruiting students/postdocs) Original tweet: https://x.com/sh_reya/status/2043436179643830517
RT Omar Khattab As promised, here's a recording of my 30-min keynote and the subsequent Q&A for the inaugural late interaction retrieval (LIR) workshop, cc @bclavie @antoine_chaffin. The talk is admittedly advanced, as it's directed at an expert IR community. But hopefully still broadly useful! Original tweet: https://x.com/lateinteraction/status/2043053506504925588
Lots of people interested in the late Interaction workshop, listening to @lateinteraction's keynote!
RT Anthony Morris ツ btw you can ssh into your Mac mini from Claude code desktop now Original tweet: https://x.com/amorriscode/status/2042733568410161326
@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.
View quoted postPeople really need to be reading the prompts underneath off the shelf evals so they can learn how pointless they are. This is "faithfulness score" from RAGAS Read the Context, why the fuck is the statement factually consistent? If you use this in your "harness" good luck
RT Ben Vargas Not sure when this shipped, but just checked and ssh to mac is supported! Thanks @amorriscode Original tweet: https://x.com/benvargas/status/2042675707625771246
@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.
View quoted postRT Bryan Bischof fka Dr. Donut Sorry I couldn't quite hear you over ALPHA ZONE. But seriously check out my podcast it's called In Practice and it's weird and technical about real AI applications https://www.youtube.com/@theoryvc Original tweet: https://x.com/BEBischof/status/2042379114561282103
the current state of production design in tech is laptop, table, books, wall
RT Thariq you'll need to explicitly prompt Claude Code to use it, but the Monitor Tool is super powerful e.g. "start my dev server and use the MonitorTool to observe for errors" Original tweet: https://x.com/trq212/status/2042335178388103559
Thrilled to announce the Monitor tool which lets Claude create background scripts that wake the agent up when needed. Big token saver and great way to move away from polling in the agent loop Claude can now: * Follow logs for errors * Poll PRs via script * and more!
View quoted postRT Harrison Chase 🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: https://www.youtube.com/watch?v=Xyh1EqcjGME - Apple Podcasts: https://podcasts.apple.com/us/podcast/how-hex-builds-ai-agents-making-agents-reason-like/id1891551672?i=1000760489140 - Spotify: https://open.spotify.com/episode/1BJlg3SOJrjnaPXFHTNuux?si=bffc89cb4f774617 Original tweet: https://x.com/hwchase17/status/2042279493050740916
RT Cursor You can now run Cursor on any machine and control it from anywhere. Kick off agents from your phone to run on your devbox. Original tweet: https://x.com/cursor_ai/status/2041912812637966552
RT Mario Zechner People of pi. BIG NEWS. I've sold out. Let me know how you feel about this in the comments below. https://mariozechner.at/posts/2026-04-08-ive-sold-out/ Original tweet: https://x.com/badlogicgames/status/2041808475336941725
RT Chris Tate New Skill: Email Emulation Test magic links, verification codes w/o sending real emails → Send via the Resend SDK → Retrieve emails from a local inbox → Extract codes to complete auth flows → One env var to reroute traffic npx skills add vercel-labs/emulate --skill resend Original tweet: https://x.com/ctatedev/status/2041654204771500547
RT Alexis Gallagher My friend @HamelHusain interviewed me about Sparky. (This was recorded back in February, before Sparky and I went to NVIDIA GTC.) https://youtu.be/LcupCy9loxY?si=Aw5_TC3pYHgAqC1F Original tweet: https://x.com/alexisgallagher/status/2041293277849362708
RT Kyle Kelley It's official, http://nteract.io is back in action. Localfirst Desktop App for interactive computing, notebooks built in Frictionless REPLs for Humans and Agents iframed outputs, interactive widgets Original tweet: https://x.com/KyleRayKelley/status/2041167795921285426
RT Han http://x.com/i/article/2040694045102788609 Original tweet: https://x.com/HanchungLee/status/2040696176383853003
RT dex if you care about coding agents and tasteful software def go watch this talk by @badlogicgames it’s very good https://youtu.be/Dli5slNaJu0?si=Dm6__OAg1dlBx_9u Original tweet: https://x.com/dexhorthy/status/2040068971102408946
Activity on hamelsmu/hamel
hamelsmu closed a pull request in hamel
hamelsmu closed a pull request in hamel
View on GitHubActivity on hamelsmu/hamel
hamelsmu opened an issue in hamel
hamelsmu opened an issue in hamel
View on GitHubYup
Everything is dead. I'm sick of it. Here's our answer: https://www.rip-grep.com/
There are lots of other categories too. Great work from @BEBischof & @adam__conway Nice data vizAmazing Meme Project backed by real data https://www.rip-grep.com/ of all the things that are "dead" ex: > RAG is extremely Dead. And even though it has died 12 times, this time is definitely for real. It’s probably good to avoid this category as an investor and instead focus on Anthropic secondaries. 🤣🤣🤣 Even calls out the top tweets
RT Bryan Bischof fka Dr. Donut Everything is dead. I'm sick of it. Here's our answer: https://www.rip-grep.com/ Original tweet: https://x.com/BEBischof/status/2039360923773632977
RT Tibo Our Codex dashboards are showing increased rate of users hitting rate limits and since we don't fully understand why I have made the cautious decision of resetting the usage limits for all plans. Enjoy. I also wanted to celebrate us finding a pocket of fraudulent accounts that we banned and have helped us regain some compute. The fight against abuse never stops, but it's important to mark the moment and make it a little shared victory. Original tweet: https://x.com/thsottiaux/status/2039248564967424483
This looks really cool
RT Scott Wu Devin Review caught the axios supply chain attack for multiple Cognition customers before the attack was publicly known. These attacks will be 10x more frequent in the age of AI; it is critical that repo maintainers start using AI for defense as well. (showing one example below where Devin Review caught the attack within an hour of its release - text minorly edited for anonymization) Original tweet: https://x.com/ScottWu46/status/2038865428693332094
This seems useful
I built a new plugin! You can now trigger Codex from Claude Code! Use the Codex plugin for Claude Code to delegate tasks to Codex or have Codex review your changes using your ChatGPT subscription. Start by installing the plugin: http://github.com/openai/codex-plugin-cc
View quoted postRT Claude Computer use is now in Claude Code. Claude can open your apps, click through your UI, and test what it built, right from the CLI. Now in research preview on Pro and Max plans. Original tweet: https://x.com/claudeai/status/2038663014098899416
RT dex when people ask about custom tools vs. letting users bring MCPs, the answer is always "both". Custom tools take work and taste, MCPs give flexibility but will always lead to lower quality results 1) for high-volume tools (e.g. Read/Write/Edit in a coding agent) build these as first-class tools 2) for long tail stuff like 'fetch data from random saas', let users bring MCPs 3) LOOK AT YOUR F****** DATA (thanks @HamelHusain ) 4) The most popular MCPs, turn these into first-class tools in your system 5) repeat until AGI another dope episode with @vaibcode Original tweet: https://x.com/dexhorthy/status/2038648255358394576
RT Bryan Bischof fka Dr. Donut too real Original tweet: https://x.com/BEBischof/status/2038471833729876447
RT Boris Cherny I wanted to share a bunch of my favorite hidden and under-utilized features in Claude Code. I'll focus on the ones I use the most. Here goes. Original tweet: https://x.com/bcherny/status/2038454336355999749
RT Anthony Morris ツ This 6 year old used Claude Code desktop to build a space game. Couldn't be more excited for the future. Original tweet: https://x.com/amorriscode/status/2038384070045151588
When claude and codex review each other's work I get very consistent comments from each: > Claude: this is over engineering. > Codex: this is sloppy. 🤣 I've looked at each, and they are both right ~ 50% of the time.
RT Charles 🎉 Frye still hiring! http://modal.jobs. Original tweet: https://x.com/charles_irl/status/2037645043981574271
RT Matt Stockton This is really fantastic. I agree with so many of these points made. "Classical" Machine Learning skills are incredibly valuable right now, and they will become even more valuable as folks realize the things @HamelHusain is pointing out here (likely through battlescars acquired from off-the-rails AI products) I'm building a lot of agentic AI systems, and honestly feel like I have super-powers given my more classical MLE background (combined with knowledge of how to use the agent harnesses, etc.) If you are building agentic AI stuff, and don't have the background - that's fine, but you should spend some time learning things. This is a great post to start pointing you in some good directions. Original tweet: https://x.com/mstockton/status/2037573815543206220
RT Marc Hatton Neat artisanal writing from @HamelHusain Is data science dead? No... - Trace reading → EDA - LLM judge validation → Model Eval - Test set building → Experimental Design - Expert labeling → Data Collection - Prod monitoring → Production ML Original tweet: https://x.com/marchattonhere/status/2037433700841889995
Maybe token austerity will force people to make valuable things (ex: not Twitter reply bots)
RT Erik Bernhardsson Re @graceisford Every company in 2030 is a neocloud or neolab or neofirm Original tweet: https://x.com/bernhardsson/status/2037313572296917230
RT Gergely Orosz I explained in today’s @Pragmatic_Eng newsletter: That’s a repo where OpenAI is merging external contributions. Some are made by Claude Code, some with GitHub Copilot, some with Codex. Codex doesn’t add itself as a contributor - on purpose - that’s why. https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for Original tweet: https://x.com/GergelyOrosz/status/2037252214486393065
I was shocked to learn this is true https://github.com/openai/parameter-golf
View quoted postRT James Cham I read this the slow, traditional way and it was very good! Original tweet: https://x.com/jamescham/status/2037249006901092852
I hand wrote this the slow way. Was a good feeling
I was shocked to learn this is true https://github.com/openai/parameter-golf
RT Thiyagarajan Maruthavanan (Rajan) Every data scientist will rebrand as a harness engineer within 18 months. Original tweet: https://x.com/mtrajan/status/2037214298402152795
RT Bryan Bischof fka Dr. Donut We need you! (To bring your intuition and problem framing to an industry overrun by influencers and trend followers) Original tweet: https://x.com/BEBischof/status/2037186501977776321
http://x.com/i/article/2037041238030114819
RT Bryan Bischof fka Dr. Donut if all press is good press and no news is good news then - f(x)=f(−x) and - f(0)=f(r) ∀ r≥0, so the only invariant is constant and ℝ quotients to a point – everything is good. Original tweet: https://x.com/BEBischof/status/2036917037851959445
RT Bryan Bischof fka Dr. Donut Hamel was the first talk of the day in my track with a Talk title that we’ve been throwing around for over a year. Original tweet: https://x.com/BEBischof/status/2036594140352487796
@HamelHusain brought the memes and eval hats to PyAI Conf 🐍 In his talk, he walks through five common eval mistakes: generic metrics, unverified judges, poor experimental design, bad data and labels, and automating too much. And his fix to avoid these pitfalls? Let's just
View quoted postIt was fun giving this meme packed presentation titled “The Revenge of The Data Scientist”💪
@HamelHusain brought the memes and eval hats to PyAI Conf 🐍 In his talk, he walks through five common eval mistakes: generic metrics, unverified judges, poor experimental design, bad data and labels, and automating too much. And his fix to avoid these pitfalls? Let's just
View quoted postRT Lenny Rachitsky Engineering job openings are at the highest levels we’ve seen in over 3 years There are over 67,000 (!!!) eng openings at tech companies globally right now, with 26,000 just in the U.S. We don’t know if there would have been more open roles if not for AI or if AI is actually leading to more open roles, but since the start of this year, the increase in open eng roles is accelerating even more. Original tweet: https://x.com/lennysan/status/2036535460726767793
STATE OF THE PRODUCT JOB MARKET IN EARLY 2026 In spite of the headlines about layoffs and AI taking jobs, we’re actually seeing a lot of promising signs in tech hiring, and some interesting new trends: 1. PM openings are at the highest levels we’ve seen in over three years 2. AI
Someone sent me a coding challenge that requires you to build a LLM Judge that produces a 1-5 score on correctness, clarity, neutrality - together!
re: LiteLLM exploit - if you like to re-write all your software from scratch "NIH" today is your redemption
RT Simon Willison Thankfully the LiteLLM package has now been marked as "quarantined" on PyPI so attempting to install the compromised update via pip et al shouldn't work Original tweet: https://x.com/simonw/status/2036451896970584167
LiteLLM HAS BEEN COMPROMISED, DO NOT UPDATE. We just discovered that LiteLLM pypi release 1.82.8. It has been compromised, it contains litellm_init.pth with base64 encoded instructions to send all the credentials it can find to remote server + self-replicate. link below
View quoted postRT Doug Turnbull Cheat at Search Essentials, coming back :) Free "Retrieval 101" course. Don't know what BM25 is? Or embedding based retrieval? Or how to spell NDCG? This is the class for you :) Three part series, links below Original tweet: https://x.com/softwaredoug/status/2036235386851074183
re: Software without APIs are going to die. I am already using the Claude Chrome extension to interact with internal APIs of web applications to do things through agents. Claude is really good about reverse engineering internal APIs (b/c it has access to the dev console), and programmatically perform tasks. And ofc I just document this in a skill
I have been using this - its a very nice addition. Still not quite as nice of a UX as Claw, mainly because it doesn't chat with you in the foreground For example, if you ask it to do something, it often doesn't say anything back until its done doing that thing (so you wonder if its died or not), which is not ideal especially for larger coding tasks. The long tail of UX really matters I think
We just released Claude Code channels, which allows you to control your Claude Code session through select MCPs, starting with Telegram and Discord. Use this to message Claude Code directly from your phone.
View quoted postRT Shreya Shankar Fun article on plugging together auto research-style search loops with qualitative coding-style evaluators. I am very optimistic about this approach on non-verifiable (ie subjective) tasks Original tweet: https://x.com/sh_reya/status/2035407816488550881
RT George from 🕹prodmgmt.world http://x.com/i/article/2034580623201824768 Original tweet: https://x.com/nurijanian/status/2035257434365976671
RT Randy Olson We've opened up the Tufte Test as a free, limited-use public API endpoint. Send any chart URL, get back a pass/fail verdict and specific feedback on what failed. No account required. Full details and a copyable example at the bottom of the article: https://www.goodeyelabs.com/insights/the-tufte-test Original tweet: https://x.com/randal_olson/status/2035138282280165830
This week, I encoded Edward Tufte's data visualization principles into an API. Then I let an AI agent try to pass it. I gave @ManusAI a CSV of women's bachelor's degree percentages across STEM fields (1970-2011) and one prompt: visualize this data. It produced a standard chart.
RT Randy Olson This week, I encoded Edward Tufte's data visualization principles into an API. Then I let an AI agent try to pass it. I gave @ManusAI a CSV of women's bachelor's degree percentages across STEM fields (1970-2011) and one prompt: visualize this data. It produced a standard chart. Correct data, readable axes, nothing wrong. But a legend box instead of direct labels. No annotations calling out the rise and fall of women in Computer Science. Default colors. This is what every AI agent produces right now. So I pointed it at the Tufte Test, a quality standard I built in Truesight that checks charts against seven of Tufte's core principles. The API came back: fail on direct labeling and integrated annotations. Five other criteria passed. A quality standard gives an agent something a vague prompt never can: a precise list of exactly what to fix. Manus revised on its own. Legend box became direct endpoint labels. A subtitle surfaced the key insight. An annotation marked the Computer Science peak at 37.1% in 1983. Two prompts total from me. Everything else was autonomous. Any AI agent that can call an API could do this. What matters is the pattern: encode expert judgment once, deploy it as an API, and every AI agent in your stack builds against it. Your taste becomes infrastructure at scale instead of manual review. The Tufte Test is available as a template in Truesight if you want to try it on your own charts. Full writeup + demo video: https://www.goodeyelabs.com/insights/the-tufte-test Original tweet: https://x.com/randal_olson/status/2034978267397313021
RT Bryan Bischof fka Dr. Donut Ai influencers be like Original tweet: https://x.com/BEBischof/status/2034825827016425807
It really is like being an ML engineer - how much compute to spend? - when outputs are stochastic, how to measure & test? - how do I run experiments? - looking at data to form better hypotheses ML engineers are so back
an increasingly large part of the job of an engineer is deciding how much compute to spend on a problem
View quoted postRT Bryan Bischof fka Dr. Donut Original tweet: https://x.com/BEBischof/status/2034708135022325921
Your parallel agents needed scalable test coverage yesterday Introducing Offload: a Rust CLI that spreads your test suite across 200+ @Modal sandboxes, freeing your CPU to keep your agents shipping. On our Playwright suite, it took a 12 min run to 2, at $0.08 a run
View quoted postRT Simon Willison Thoughts on OpenAI acquiring Astral and uv/ruff/ty https://simonwillison.net/2026/Mar/19/openai-acquiring-astral/ Original tweet: https://x.com/simonw/status/2034672725088997879
They pulled off the impossible: made a real human video look 100% AI generated 🤣
RT Felix Rieseberg By popular demand, Dispatch can now launch Claude Code sessions. Ask it to build, make, or improve something! To use it, update your Claude desktop app and make sure you have Code enabled. Original tweet: https://x.com/felixrieseberg/status/2034381385134399913
The highest leverage thing you can do to de-slopify AI writing is to delete at least half of it Seriously any email, post etc try to delete 50%
Excited to try this
Cursor, Codex and Claude Code are all single-player. Your whole team builds alone and no one knows what anyone else decided. But building product is a team sport. AI should be too. The conversations, decisions, specs and builds. All of it, together, with your whole team.
View quoted postTIL Google Colab has a MCP https://developers.googleblog.com/announcing-the-colab-mcp-server-connect-any-ai-agent-to-google-colab/ (came out yesterday but for some reason missed it)