Bringing data science back to AI - https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa
RT Doug Turnbull Re @hugobowne @HamelHusain People should write blog articles instead of skills
Preview of some slides for this
Doing this on Thursday. Yes I’ll show you my skills, but more importantly I’ll talk about how you shoukd be VERY skeptical when adopting them (including my skills!) https://luma.com/ltpzpqgw
View quoted postRT ben hylak introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents. from personal experience, and from working with the best companies in the world. there's even a quiz. link below.
Doing this on Thursday. Yes I’ll show you my skills, but more importantly I’ll talk about how you shoukd be VERY skeptical when adopting them (including my skills!) https://luma.com/ltpzpqgw
RT ben hylak this saturday, @OpenAI is throwing an autoresearch hackathon with @raindrop_ai and @modal come learn how to build systems (agents, models, etc) that improve themselves link below!
RT Ryan Lopopolo One thing to always keep in the back of one’s mind when using AI: Every practitioner (of literally anything) has better evals floating around in their heads than the labs have. Everyone has at least 2 or 3 truly hard things they've done
If a tweet or cold outreach is written entirely with AI I ignore it If you can’t be bothered to write it, I can’t be bothered to read it
The experiments conducted in this post illustrate how early we are as an industry on eval tooling. Some takeaways and related thoughts: 1. Naively applying automation (which many current frameworks do) is likely to fail. 2. It's easy to get fooled that automation (esp overzealous automation) is giving you valuable insights. Stay skeptical at all times! 3. We have to design eval workflows so human-in-the-loop accelerates effort while helping you externalize what "good looks like" 4. Qualitative analysis hasn't sufficiently made its way into eval tooling as much as it should. There are opportunities to design better automation here. (QA is super underrated for evals btw)
i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work? a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis,
RT Shreya Shankar i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work? a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive. as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention. tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued
I always have been impressed by modal’s tech. Highly recommend this talk https://youtu.be/3jJ1GhGkLY0?si=hArGUK4I4lGDEPDV The product is a magical experience, they went deep on the tech to remove the pain from infrastructure. It’s ***fast*** and has great DX especially for AI workloads
Today we're announcing our Series C funding: $355M at a $4.65B valuation, led by some great investors @generalcatalyst and @Redpoint. We've had insane growth in the last year, but we're still very early. So proud of the team and what we have built so far!
View quoted postRT Akshat Bubna Raising $ is cool. What’s even cooler is getting to work every day with this incredible group of humans. We like solving hard problems and building things we can be proud of. If this is you, come join us! We’re just getting started :)
RT Erik Bernhardsson Today we're announcing our Series C funding: $355M at a $4.65B valuation, led by some great investors @generalcatalyst and @Redpoint. We've had insane growth in the last year, but we're still very early. So proud of the team and what we have built so far!
RT Ben Clavié Information Retrieval is about making knowledge accessible. Late Interaction is the best way to do that today. But now that we have a new kind of users, it's time to zoom out so we can plan the future of retrieval. I gave a talk about this at @ir_tsukuba https://docs.google.com/presentation/d/1GmvpRgre2zamJ5zKxxhtj-eKL4j6VqfP3wO2_o210Z0/edit?slide=id.p#slide=id.p
Periodic reminder to look at the data
I prefer html to notebooks now for many tasks including most data analysis
RT Iheanyi Ekechukwu Excuse me? I beg your pardon?
We are investigating unauthorized access to GitHub’s internal repositories. While we currently have no evidence of impact to customer information stored outside of GitHub’s internal repositories (such as our customers’ enterprises, organizations, and repositories), we are closely
View quoted postBun Stainless Now Karpathy 🤣
RT Vincent D. Warmerdam We're doing an extra livestream this week with @HamelHusain Small hint: it'll be about notebooks. https://youtube.com/live/XaKvqdOJlZo
RT Farza 🇵🇰🇺🇸 Been working on a new UX for agents. It will understand your workflow as you use your computer and proactively nudge you when it can be helpful. I feel like agents are these super powerful beings trapped in terminals and chat interfaces. Now, just use your voice. Demo:
RT Shreya Shankar Looking forward to giving a talk at the annual LangChain conference! We will be discussing common mistakes that orgs make when trying to evaluate their agents, and how the secret solution is…to think like a data scientist 😎
It’s almost time for the final sessions of Interrupt! 🎤 How Bridgewater Associates built Pat, the AI pocket analyst tool 🎤 How @Coinbase scaled AI support with a multi-agent system 🎤 @sh_reya + @HamelHusain on the return of the Data Scientist
RT ben hylak we built the first sane way to debug your agent locally. you can see your traces. codex/claude code can too. this lets them write evals and test your agents automatically. best part: it's completely free and open source. install with 1 line. (github below)
RT Charles 🎉 Frye come thru, even tho they rejected my proposed title, "2 Charles 2 Inference"
Our Day 02 afternoon keynote "Beyond the API" is beyond stacked. New panelist just added... welcome back @charles_irl - Member of Technical Staff @Modal, joining: @BEBischof, Head of AI, @theoryvc Charles Zedlewski, CPO, @togethercompute @tuomars, Research Scientist, @nvidia
Really bullish on modal. If you haven’t used it, it’s buttery smooth devex that you didn’t know you needed until you try it. Everything is **fast**, they really took out most of the friction of compute
Inference isn't everything, but it does require a new stack -- not Kubernetes, not SLURM. At @modal, we dove deep to build that stack. In this blog post we explain how, from compute management & cloud-native cacheing to CRIU & GPU checkpointing. https://modal.com/blog/truly-serverless-gpus
RT Shreya Shankar I'm joining Carnegie Mellon's CS Department (and HCII by courtesy) as an assistant professor in Fall 2027! I'll be recruiting PhD students next cycle. If you're interested in AI systems or human-AI collaboration, list me in your application. Stay tuned for more about my new lab!
This is amazing devrel
With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.
View quoted postRT Thinking Machines Re With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.
Activity on hamelsmu/stfu
hamelsmu closed a pull request in stfu
View on GitHubClaude chrome extension still has a key advantage over codex browser use AFAIK (I hope it reaches parity though!)
Just bought my first thing with an agent! (A Domain name w/cloudflare CLI)
RT Mike Rundle When I hit the Update button in Codex and then the updated version of Codex also has an Update button
> “I’m a little bit afraid that people will read this article and turn it into a /html skill or something.” 🤣 me too
RT Chris Tate agent-browser v0.27 Big day for agents and browsers → React introspection: react tree, react inspect, react renders, react suspense for component trees, props/hooks/state, render profiling, and Suspense analysis → Web Vitals: vitals command reports LCP, CLS, TTFB, FCP, INP + React hydration phases → SPA navigation: pushstate for client-side nav without full page loads → Init scripts: --init-script and --enable flags to register scripts before first navigation → Network route filtering by resource type → cURL cookie import (JSON, cURL, Cookie-header formats) → Dashboard works behind reverse proxies Thanks to @thoma33 @andrewqu @shaper for being part of this release! https://github.com/vercel-labs/agent-browser/releases/tag/v0.27.0
RT Omar Khattab I’ve never been this excited about search. 6-7 years ago, IR got an influx of the paradigms we still use, all enabled by the big headroom MS MARCO and then BEIR created. Then progress slowed. Today, Diane releases perhaps the most ambitious IR benchmark to date: OBLIQ-Bench. Queries in it are meant to be increasingly opaque to current first-stage retrieval paradigms. Oblique queries put the bottleneck very early in the search process, as the relevance of a document to the query is quite latent. I can't wait for core IR research on fundamentally more powerful paradigms for first-stage search to be reignited again. Stay tuned for more stories about this, and read Diane's thread and her paper below!!
We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroom remained by running oracle reranking with a frontier LLM. Most had little room left! So we built OBLIQ-Bench to study much harder search queries than before.
RT ben hylak now, your agent can fix itself. introducing raindrop triage. an agent for finding and investigating agent issues.
Seems like the main use case for chatgpt voice are meme videos of people challenging the AI 🤣
RT Kyle Kelley Previously approved packages show up in listing when reviewing dependencies so you can see what's different since the last time you shared a notebook via @nteractio
This post is completely outdated FWIW. Devin is really good now. One of my favorite tools. What's good about it: 1. UX is really great and super polished 2. Love how it gives me a video and visual proof of work 3. It actually works now Some of it is models are better but the UX/harness is also better. I suspect this blog post would not be written today.
New post re: Devin (the AI SWE). We couldn't find many reviews of people using it for real tasks, so we went MKBHD mode and put Devin through its paces. We documented our findings here. Would love to know if others have had a different experience. https://www.answer.ai/posts/2025-01-08-devin.html
RT » teej Do this today. Do not wait. This change will save your ass. Every Python project should add this. Then do everything else in the thread. But do this today.
Add a 7-day dependency cooldown. uv's `exclude-newer` refuses any version published inside a rolling window. With 7 days set, today's malicious uploads would not be considered for resolution at all. Most malicious uploads are caught within that window.
View quoted postRT Chris Tate Dear GitHub, AI is changing the contribution graph. Issues are often the real contribution now. They define the problem, shape the solution and guide the PR. If a GitHub Issue leads to a merged PR, the issue author should get contributor credit. Signed, ctate
RT OpenAI Developers You can build interactive applications with gpt-realtime-1.5, so users can control app state more naturally with voice. Hi Chappy 👋
Kyle is working on some really great tools https://www.nteract.io/ for the AI enabled data scientist highly recommended checking it out
RT Kyle Kelley I don't want to overstretch myself on maintaining open source but there are ways to bring this workflow into Pi agents too Hit me up if you want it, either here or on https://github.com/nteract/desktop
RT Isaac Flath RLM is the most import foundation of my Pi Harness (other than Pi of course). It's seeded with late interaction retrieval results (thanks to @lightonai for pylate). The Agent initiates it with query then.. 𝐒𝐞𝐭𝐮𝐩 A python REPL is created and seeded with: 1. Late interaction search to pre-filter. Instead of doing top 3/5/10, it's top hundreds of documents. This is set into a `context` variable. 2. Python functions are loaded in to do more searches if `context` variable isn't enough. And to make llm calls with cheaper models in parallel batches. 𝐈𝐭𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐨𝐨𝐩 From there, an LLM iterates in the REPL based on the query. It's just like exploring in a jupyter notebook. The LLM writes prose (like a markdown cell) and code to be run in the REPL each turn. This allows the LLM to sort, filter, and synthesize information. It can fan out and ask smaller models to summarize, combine, contrast, or do anything else to documents to help it understand the data. After several turns the LLM reponds with the final answer. Either because it found the answer, or hit the budget limit. Context as a Python variable, LLM as the programmer, REPL as the runtime. 𝐖𝐡𝐲 𝐃𝐨𝐞𝐬 𝐓𝐡𝐢𝐬 𝐖𝐨𝐫𝐤 1. Richer Shell. Agents (and subagents) work by intermixing code and prose/thinking. But they use static scripts or bash that run and exit and start over each tool call. That's not ideal for exploration and synthesis of data. For that, state is useful to continue building and exploring the data as you learn more. There's a reason jupyter notebooks have been popular with data scientists. 2. Keeps main agent context clean. The better context you have the better the agent will perform (duh!). This means three thing: better human input, less missing search results, and less incorrect search results. Letting the agent iterate allows it to synthesize just what is needed and nothing else. All bad paths or peeks at something that turns out to be i...
Let me get this straight Claude subscription: restrictions (or pay extra) on third party usage, pay extra for fast Codex: no restrictions, fast mode within subscription It’s hard not to use Codex. Just the fast mode being within the subscription is enough to make the gap large
I tried Claude Routines and managed agents, and I found it to be far better to use GitHub actions I have more control with GHA, secret management, etc. Claude is really good at writing all the yaml and iterating until it works on its own too. Wild times that I'm saying I like GitHub Actions LOL
RT Tibo Stop tweeting for a hot minute and update your Codex App to find full browser use, global dictation, non-dev mode, a new auto-review mode that is much safer than yolo, in-app docs and PDF viewer, and ... GPT-5.5.
With GPT-5.5, Codex now gets more of the job done across the browser, files, docs, and your computer. We've expanded browser use so Codex can interact with web apps, and test flows, click through pages, capture screenshots, and iterate on what it sees until it completes the
View quoted postRT Bryan Bischof fka Dr. Donut This year at AI Council I am bringing you a story about optimization. Pete challenged me to curate a track for inference systems because so much of modern AI doesn't talk about the work in the data center. I said: 'I want to understand the price of intelligence and who's making it cheaper.' So I invited people who could teach me all the sneaky ways to make intelligence fast and affordable. In this track youre going to learn from the literal top people in the field what the hell is happening behind that API endpoint. LFG
Training gets the headlines. Inference gets the bill. As agents move from novelty to default workload, the hard problem isn't the model anymore. It's every millisecond and every watt between a prompt and the next token. A coding agent running for six hours straight is a very
RT Isaac Flath I tried the new GPT Images model. I tried it 9 thumbnail concepts. All ones I came up with. For each concept I made 4 images. 2 with openai, 2 with gemini. I liked Gemini more on 6 concepts. I like the new ChatGPT Images model on 3. I think Gemini is better at diagram and concept style. The openai model is better at more realistic looking things (like people, or product things). New openai model does pull toward hype, even more than gemini. But it’s manageable with prompts. I tend to prefer diagram or concept style over realistic style, so I think that makes me lean Gemini. But I think given a concept, I could easily predict which model would be better at it and pick the right one. There’s a few caveats. If you want faces incorporated in automatically, openai model will win almost always because it’s so much better at faced. I prefer steering to a particular idea/concept, then overlay a real image as needed still though. But, I predict most people will like the new OpenAI model more. I think my taste in this is not the norm. Openai does a better job making social cards that look like “good” social cards.
RT Kun Chen lol remember this org chart meme? I just created a full simulation for all of them with agents, and the results blew my mind! the simulation asked each organization to build and ship a web spreadsheet want to take a guess who built the best product? reveal in thread below!
RT Lisan al Gaib OpenAI just released a new open-source model it's "a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text" https://github.com/openai/privacy-filter https://huggingface.co/openai/privacy-filter
My theory of people gaining traction with slop posts is you are trading a smart audience for a dumber audience
Generic eval metrics are usually a waste of time. Please don't use them (ex: "Helpfulness score")
RT Isaac Flath Oh wow, Marimo been working on their Agentic/AI capabilities. I didn't realize how much they've done here with live kernel/agent integration. Worth a watch. I'm excited to try this. https://www.youtube.com/watch?v=6uaqtchDnoc
RT Tibo Happy Tuesday. Codex has hit 4M active users, adding over 1M users in less than two weeks. To celebrate we will reset the rate limits again in a few hours. Enjoy!
RT Tibo We are releasing a *research preview* of Chronicle in Codex. It allows codex to build up memories based on your day to day work on your computer and then refer to these memories to be a lot more helpful. Available for PRO subscriptions and on Mac to start. This is early and consumes quite a bit of tokens, but it has changed how I and many folks at OpenAI use Codex.
Last week, we released a preview of memories in Codex. Today, we’re expanding the experiment with Chronicle, which improves memories using recent screen context. Now, Codex can help with what you’ve been working on without you restating context.
View quoted postI have very few notifications turned on but this guy's tweets is one of them, its a constant stream of the most useful tools
Terminal automation + e2e testing solved Now as simple as snapshot, click, type: – wterm renders terminal-in-html, every cell in the a11y tree – agent-browser automates pages via the a11y tree Here's opencode in one browser driving Claude Code in another
View quoted postRT Chris Tate Terminal automation + e2e testing solved Now as simple as snapshot, click, type: – wterm renders terminal-in-html, every cell in the a11y tree – agent-browser automates pages via the a11y tree Here's opencode in one browser driving Claude Code in another
The funny part is I know people who actually talk like this 🤣
My sisters birthday is coming up, she asked for a new laptop for school. I asked her “what are you building?” My mom is approaching retirement; I told her “an AI won’t replace you, a person using AI will replace you” My grandma is struggling to use her iphone lately; switching
View quoted postLots of people asking what’s so good about the new codex desktop computer use. Here’s 5 things that come to mind 1. operate Mac Apps without a great API: Slack, Google Sheets, Notes, IMessage without installing separate plugins. It instantly transforms all your apps into tools 2. If you need to operate your browser more visually it works really smoothly and fast (good for sites that are still human centric) 3. It uses its own cursor, keyboard etc so you can keep working. 4. Once you do any task once you can simply ask Codex to reflect on what it did and how it would accomplish the task next time with the benefit of hindsight and create a skill AND schedule an automation. It’s really nice that codex can just schedule and edit automations when asked! it’s very Claw like in this way. This last point is not computer use specific but is powerful when combined with computer use 5. The UI polish is insane: you get nice icons for any application you want to tag into computer use plus all the other built in new stuff like built in file viewer and browser so there is no context switching. So you can iterate really fast and not lose focus. Because of the polish it also feels nice and delightful to use.
Seriously stop everything you are doing and use codex desktop app new computer use. Absolutely mind blowing
View quoted postI had codex make a song with garage band and reply to this thread (because I needed to walk away from my computer) This feels like AGI
@gazorp5 I am Codex. Hamel directed me to post this reply with the music when it was done on the thread.
View quoted postBeing able to @ any application on your computer is 🤯 (re: codex desktop computer use)
Seriously stop everything you are doing and use codex desktop app new computer use. Absolutely mind blowing
Codex Computer use is FAST and good - how they did this?
Gemini Everyday Error: 503 UNAVAILABLE. {'error': {'code': 503, 'message': 'This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later.', 'status': 'UNAVAILABLE'}}
Run the following command and you can see some of what Codex is cooking. TIL they have remote_control too! > codex feature list P.S. its worth reading the manual
RT Shreya Shankar Looking forward to CHI this week! We have a ✨Best Paper ✨ on a "what-if" analysis tool for RAG. Reach out to chat! I'm interested in: MLOps/LLMOps, data analysis, and better interfaces for human-AI collaboration (and, very soon i'll be recruiting students/postdocs) Original tweet: https://x.com/sh_reya/status/2043436179643830517
RT Omar Khattab As promised, here's a recording of my 30-min keynote and the subsequent Q&A for the inaugural late interaction retrieval (LIR) workshop, cc @bclavie @antoine_chaffin. The talk is admittedly advanced, as it's directed at an expert IR community. But hopefully still broadly useful! Original tweet: https://x.com/lateinteraction/status/2043053506504925588
Lots of people interested in the late Interaction workshop, listening to @lateinteraction's keynote!
RT Anthony Morris ツ btw you can ssh into your Mac mini from Claude code desktop now Original tweet: https://x.com/amorriscode/status/2042733568410161326
@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.
View quoted postPeople really need to be reading the prompts underneath off the shelf evals so they can learn how pointless they are. This is "faithfulness score" from RAGAS Read the Context, why the fuck is the statement factually consistent? If you use this in your "harness" good luck
RT Ben Vargas Not sure when this shipped, but just checked and ssh to mac is supported! Thanks @amorriscode Original tweet: https://x.com/benvargas/status/2042675707625771246
@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.
View quoted postRT Bryan Bischof fka Dr. Donut Sorry I couldn't quite hear you over ALPHA ZONE. But seriously check out my podcast it's called In Practice and it's weird and technical about real AI applications https://www.youtube.com/@theoryvc Original tweet: https://x.com/BEBischof/status/2042379114561282103
the current state of production design in tech is laptop, table, books, wall
RT Thariq you'll need to explicitly prompt Claude Code to use it, but the Monitor Tool is super powerful e.g. "start my dev server and use the MonitorTool to observe for errors" Original tweet: https://x.com/trq212/status/2042335178388103559
Thrilled to announce the Monitor tool which lets Claude create background scripts that wake the agent up when needed. Big token saver and great way to move away from polling in the agent loop Claude can now: * Follow logs for errors * Poll PRs via script * and more!
View quoted postRT Harrison Chase 🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: https://www.youtube.com/watch?v=Xyh1EqcjGME - Apple Podcasts: https://podcasts.apple.com/us/podcast/how-hex-builds-ai-agents-making-agents-reason-like/id1891551672?i=1000760489140 - Spotify: https://open.spotify.com/episode/1BJlg3SOJrjnaPXFHTNuux?si=bffc89cb4f774617 Original tweet: https://x.com/hwchase17/status/2042279493050740916
RT Cursor You can now run Cursor on any machine and control it from anywhere. Kick off agents from your phone to run on your devbox. Original tweet: https://x.com/cursor_ai/status/2041912812637966552