Hamel Husain

𝕏x•1 day ago

Retweeted from @Doug

RT Doug Turnbull Re @hugobowne @HamelHusain People should write blog articles instead of skills

𝕏x•3 days ago

Preview of some slides for this

@Hamel Husain

Doing this on Thursday. Yes I’ll show you my skills, but more importantly I’ll talk about how you shoukd be VERY skeptical when adopting them (including my skills!) https://luma.com/ltpzpqgw

𝕏x•3 days ago

Retweeted from @ben

RT ben hylak introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents. from personal experience, and from working with the best companies in the world. there's even a quiz. link below.

𝕏x•3 days ago

Doing this on Thursday. Yes I’ll show you my skills, but more importantly I’ll talk about how you shoukd be VERY skeptical when adopting them (including my skills!) https://luma.com/ltpzpqgw

𝕏x•4 days ago

Retweeted from @ben

RT ben hylak this saturday, @OpenAI is throwing an autoresearch hackathon with @raindrop_ai and @modal come learn how to build systems (agents, models, etc) that improve themselves link below!

𝕏x•4 days ago

Retweeted from @Ryan

RT Ryan Lopopolo One thing to always keep in the back of one’s mind when using AI: Every practitioner (of literally anything) has better evals floating around in their heads than the labs have. Everyone has at least 2 or 3 truly hard things they've done

𝕏x•4 days ago

If a tweet or cold outreach is written entirely with AI I ignore it If you can’t be bothered to write it, I can’t be bothered to read it

𝕏x•8 days ago

The experiments conducted in this post illustrate how early we are as an industry on eval tooling. Some takeaways and related thoughts: 1. Naively applying automation (which many current frameworks do) is likely to fail. 2. It's easy to get fooled that automation (esp overzealous automation) is giving you valuable insights. Stay skeptical at all times! 3. We have to design eval workflows so human-in-the-loop accelerates effort while helping you externalize what "good looks like" 4. Qualitative analysis hasn't sufficiently made its way into eval tooling as much as it should. There are opportunities to design better automation here. (QA is super underrated for evals btw)

@Shreya Shankar

i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work? a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis,

𝕏x•8 days ago

Retweeted from @Shreya

RT Shreya Shankar i'm restarting my blog! i want to kickstart productive conversations around: what should AI agents look like for hard, subjective knowledge work? a lot of agent setups work well when tasks are objective and easy to verify. but many workflows (e.g., qualitative analysis, strategy, sensemaking) are messy and interpretive. as a first post, i explore different ways of doing agent-assisted qualitative analysis on tweets, with varying levels of human feedback/intervention. tldr: they all kinda sucked. turns out it’s hard to: (a) stop agents from converging too quickly on shallow interpretations (b) get agents to adapt to preferences that emerge gradually across many turns (i.e., evolving context) (c) capture human judgment without making humans fatigued

𝕏x•9 days ago

I always have been impressed by modal’s tech. Highly recommend this talk https://youtu.be/3jJ1GhGkLY0?si=hArGUK4I4lGDEPDV The product is a magical experience, they went deep on the tech to remove the pain from infrastructure. It’s ***fast*** and has great DX especially for AI workloads

@Erik Bernhardsson

Today we're announcing our Series C funding: $355M at a $4.65B valuation, led by some great investors @generalcatalyst and @Redpoint. We've had insane growth in the last year, but we're still very early. So proud of the team and what we have built so far!

𝕏x•9 days ago

Retweeted from @Akshat

RT Akshat Bubna Raising $ is cool. What’s even cooler is getting to work every day with this incredible group of humans. We like solving hard problems and building things we can be proud of. If this is you, come join us! We’re just getting started :)

@Modal

http://x.com/i/article/2057237807244996608

𝕏x•9 days ago

Retweeted from @Erik

RT Erik Bernhardsson Today we're announcing our Series C funding: $355M at a $4.65B valuation, led by some great investors @generalcatalyst and @Redpoint. We've had insane growth in the last year, but we're still very early. So proud of the team and what we have built so far!

@Modal

http://x.com/i/article/2057237807244996608

𝕏x•9 days ago

Retweeted from @Ben

RT Ben Clavié Information Retrieval is about making knowledge accessible. Late Interaction is the best way to do that today. But now that we have a new kind of users, it's time to zoom out so we can plan the future of retrieval. I gave a talk about this at @ir_tsukuba https://docs.google.com/presentation/d/1GmvpRgre2zamJ5zKxxhtj-eKL4j6VqfP3wO2_o210Z0/edit?slide=id.p#slide=id.p

#slide

𝕏x•10 days ago

Periodic reminder to look at the data

𝕏x•11 days ago

I prefer html to notebooks now for many tasks including most data analysis

𝕏x•11 days ago

Retweeted from @Iheanyi

RT Iheanyi Ekechukwu Excuse me? I beg your pardon?

@GitHub

We are investigating unauthorized access to GitHub’s internal repositories. While we currently have no evidence of impact to customer information stored outside of GitHub’s internal repositories (such as our customers’ enterprises, organizations, and repositories), we are closely

𝕏x•11 days ago

Bun Stainless Now Karpathy 🤣

𝕏x•11 days ago

Retweeted from @Vincent

RT Vincent D. Warmerdam We're doing an extra livestream this week with @HamelHusain Small hint: it'll be about notebooks. https://youtube.com/live/XaKvqdOJlZo

𝕏x•14 days ago

Retweeted from @Farza

RT Farza 🇵🇰🇺🇸 Been working on a new UX for agents. It will understand your workflow as you use your computer and proactively nudge you when it can be helpful. I feel like agents are these super powerful beings trapped in terminals and chat interfaces. Now, just use your voice. Demo:

𝕏x•16 days ago

Retweeted from @Shreya

RT Shreya Shankar Looking forward to giving a talk at the annual LangChain conference! We will be discussing common mistakes that orgs make when trying to evaluate their agents, and how the secret solution is…to think like a data scientist 😎

@LangChain

It’s almost time for the final sessions of Interrupt! 🎤 How Bridgewater Associates built Pat, the AI pocket analyst tool 🎤 How @Coinbase scaled AI support with a multi-agent system 🎤 @sh_reya + @HamelHusain on the return of the Data Scientist

𝕏x•16 days ago

Retweeted from @ben

RT ben hylak we built the first sane way to debug your agent locally. you can see your traces. codex/claude code can too. this lets them write evals and test your agents automatically. best part: it's completely free and open source. install with 1 line. (github below)

𝕏x•17 days ago

Retweeted from @Charles

RT Charles 🎉 Frye come thru, even tho they rejected my proposed title, "2 Charles 2 Inference"

@AI Council // HAPPENING NOW

Our Day 02 afternoon keynote "Beyond the API" is beyond stacked. New panelist just added... welcome back @charles_irl - Member of Technical Staff @Modal, joining: @BEBischof, Head of AI, @theoryvc Charles Zedlewski, CPO, @togethercompute @tuomars, Research Scientist, @nvidia

𝕏x•18 days ago

Really bullish on modal. If you haven’t used it, it’s buttery smooth devex that you didn’t know you needed until you try it. Everything is **fast**, they really took out most of the friction of compute

@Charles 🎉 Frye

Inference isn't everything, but it does require a new stack -- not Kubernetes, not SLURM. At @modal, we dove deep to build that stack. In this blog post we explain how, from compute management & cloud-native cacheing to CRIU & GPU checkpointing. https://modal.com/blog/truly-serverless-gpus

𝕏x•18 days ago

Retweeted from @Shreya

RT Shreya Shankar I'm joining Carnegie Mellon's CS Department (and HCII by courtesy) as an assistant professor in Fall 2027! I'll be recruiting PhD students next cycle. If you're interested in AI systems or human-AI collaboration, list me in your application. Stay tuned for more about my new lab!

𝕏x•19 days ago

This is amazing devrel

@Thinking Machines

With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.

𝕏x•19 days ago

Retweeted from @Thinking

RT Thinking Machines Re With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.

𝕏x•21 days ago

Just bought my first thing with an agent! (A Domain name w/cloudflare CLI)

𝕏x•22 days ago

Retweeted from @Mike

RT Mike Rundle When I hit the Update button in Codex and then the updated version of Codex also has an Update button

𝕏x•22 days ago

> “I’m a little bit afraid that people will read this article and turn it into a /html skill or something.” 🤣 me too

@Thariq

http://x.com/i/article/2052796100608974848

𝕏x•22 days ago

Really excited about nteract https://www.nteract.io/blog/nteract-2.0

𝕏x•23 days ago

Retweeted from @Chris

RT Chris Tate agent-browser v0.27 Big day for agents and browsers → React introspection: react tree, react inspect, react renders, react suspense for component trees, props/hooks/state, render profiling, and Suspense analysis → Web Vitals: vitals command reports LCP, CLS, TTFB, FCP, INP + React hydration phases → SPA navigation: pushstate for client-side nav without full page loads → Init scripts: --init-script and --enable flags to register scripts before first navigation → Network route filtering by resource type → cURL cookie import (JSON, cURL, Cookie-header formats) → Dashboard works behind reverse proxies Thanks to @thoma33 @andrewqu @shaper for being part of this release! https://github.com/vercel-labs/agent-browser/releases/tag/v0.27.0

𝕏x•24 days ago

Retweeted from @Omar

RT Omar Khattab I’ve never been this excited about search. 6-7 years ago, IR got an influx of the paradigms we still use, all enabled by the big headroom MS MARCO and then BEIR created. Then progress slowed. Today, Diane releases perhaps the most ambitious IR benchmark to date: OBLIQ-Bench. Queries in it are meant to be increasingly opaque to current first-stage retrieval paradigms. Oblique queries put the bottleneck very early in the search process, as the relevance of a document to the query is quite latent. I can't wait for core IR research on fundamentally more powerful paradigms for first-stage search to be reignited again. Stay tuned for more stories about this, and read Diane's thread and her paper below!!

@Diane

We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroom remained by running oracle reranking with a frontier LLM. Most had little room left! So we built OBLIQ-Bench to study much harder search queries than before.

𝕏x•25 days ago

Retweeted from @ben

RT ben hylak now, your agent can fix itself. introducing raindrop triage. an agent for finding and investigating agent issues.

𝕏x•28 days ago

Seems like the main use case for chatgpt voice are meme videos of people challenging the AI 🤣

𝕏x•29 days ago

Retweeted from @Kyle

RT Kyle Kelley Previously approved packages show up in listing when reviewing dependencies so you can see what's different since the last time you shared a notebook via @nteractio

𝕏x•29 days ago

This post is completely outdated FWIW. Devin is really good now. One of my favorite tools. What's good about it: 1. UX is really great and super polished 2. Love how it gives me a video and visual proof of work 3. It actually works now Some of it is models are better but the UX/harness is also better. I suspect this blog post would not be written today.

@Hamel Husain

New post re: Devin (the AI SWE). We couldn't find many reviews of people using it for real tasks, so we went MKBHD mode and put Devin through its paces. We documented our findings here. Would love to know if others have had a different experience. https://www.answer.ai/posts/2025-01-08-devin.html

𝕏x•29 days ago

Retweeted from @Peter

RT Peter Steinberger 🦞 The new /goal feature in codex slaps.

𝕏x•29 days ago

Retweeted from @Adam

RT Adam Conway http://x.com/i/article/2049891805517537280

𝕏x•about 1 month ago

RT » teej Do this today. Do not wait. This change will save your ass. Every Python project should add this. Then do everything else in the thread. But do this today.

@Tim Hopper

Add a 7-day dependency cooldown. uv's `exclude-newer` refuses any version published inside a rolling window. With 7 days set, today's malicious uploads would not be considered for resolution at all. Most malicious uploads are caught within that window.

𝕏x•about 1 month ago

Retweeted from @Chris

RT Chris Tate Dear GitHub, AI is changing the contribution graph. Issues are often the real contribution now. They define the problem, shape the solution and guide the PR. If a GitHub Issue leads to a merged PR, the issue author should get contributor credit. Signed, ctate

𝕏x•about 1 month ago

Retweeted from @OpenAI

RT OpenAI Developers You can build interactive applications with gpt-realtime-1.5, so users can control app state more naturally with voice. Hi Chappy 👋

𝕏x•about 1 month ago

Kyle is working on some really great tools https://www.nteract.io/ for the AI enabled data scientist highly recommended checking it out

@Kyle Kelley

I'm bringing to Dataframes in Claude Cowork

𝕏x•about 1 month ago

Retweeted from @Kyle

RT Kyle Kelley I don't want to overstretch myself on maintaining open source but there are ways to bring this workflow into Pi agents too Hit me up if you want it, either here or on https://github.com/nteract/desktop

@Kyle Kelley

I'm bringing to Dataframes in Claude Cowork

𝕏x•about 1 month ago

Retweeted from @Isaac

RT Isaac Flath RLM is the most import foundation of my Pi Harness (other than Pi of course). It's seeded with late interaction retrieval results (thanks to @lightonai for pylate). The Agent initiates it with query then.. 𝐒𝐞𝐭𝐮𝐩 A python REPL is created and seeded with: 1. Late interaction search to pre-filter. Instead of doing top 3/5/10, it's top hundreds of documents. This is set into a `context` variable. 2. Python functions are loaded in to do more searches if `context` variable isn't enough. And to make llm calls with cheaper models in parallel batches. 𝐈𝐭𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐋𝐨𝐨𝐩 From there, an LLM iterates in the REPL based on the query. It's just like exploring in a jupyter notebook. The LLM writes prose (like a markdown cell) and code to be run in the REPL each turn. This allows the LLM to sort, filter, and synthesize information. It can fan out and ask smaller models to summarize, combine, contrast, or do anything else to documents to help it understand the data. After several turns the LLM reponds with the final answer. Either because it found the answer, or hit the budget limit. Context as a Python variable, LLM as the programmer, REPL as the runtime. 𝐖𝐡𝐲 𝐃𝐨𝐞𝐬 𝐓𝐡𝐢𝐬 𝐖𝐨𝐫𝐤 1. Richer Shell. Agents (and subagents) work by intermixing code and prose/thinking. But they use static scripts or bash that run and exit and start over each tool call. That's not ideal for exploration and synthesis of data. For that, state is useful to continue building and exploring the data as you learn more. There's a reason jupyter notebooks have been popular with data scientists. 2. Keeps main agent context clean. The better context you have the better the agent will perform (duh!). This means three thing: better human input, less missing search results, and less incorrect search results. Letting the agent iterate allows it to synthesize just what is needed and nothing else. All bad paths or peeks at something that turns out to be i...

𝕏x•about 1 month ago

Let me get this straight Claude subscription: restrictions (or pay extra) on third party usage, pay extra for fast Codex: no restrictions, fast mode within subscription It’s hard not to use Codex. Just the fast mode being within the subscription is enough to make the gap large

𝕏x•about 1 month ago

I tried Claude Routines and managed agents, and I found it to be far better to use GitHub actions I have more control with GHA, secret management, etc. Claude is really good at writing all the yaml and iterating until it works on its own too. Wild times that I'm saying I like GitHub Actions LOL

𝕏x•about 1 month ago

Retweeted from @Tibo

RT Tibo Stop tweeting for a hot minute and update your Codex App to find full browser use, global dictation, non-dev mode, a new auto-review mode that is much safer than yolo, in-app docs and PDF viewer, and ... GPT-5.5.

@OpenAI Developers

With GPT-5.5, Codex now gets more of the job done across the browser, files, docs, and your computer. We've expanded browser use so Codex can interact with web apps, and test flows, click through pages, capture screenshots, and iterate on what it sees until it completes the

𝕏x•about 1 month ago

Retweeted from @Bryan

RT Bryan Bischof fka Dr. Donut This year at AI Council I am bringing you a story about optimization. Pete challenged me to curate a track for inference systems because so much of modern AI doesn't talk about the work in the data center. I said: 'I want to understand the price of intelligence and who's making it cheaper.' So I invited people who could teach me all the sneaky ways to make intelligence fast and affordable. In this track youre going to learn from the literal top people in the field what the hell is happening behind that API endpoint. LFG

@AI Council

Training gets the headlines. Inference gets the bill. As agents move from novelty to default workload, the hard problem isn't the model anymore. It's every millisecond and every watt between a prompt and the next token. A coding agent running for six hours straight is a very

𝕏x•about 1 month ago

Retweeted from @Isaac

RT Isaac Flath I tried the new GPT Images model. I tried it 9 thumbnail concepts. All ones I came up with. For each concept I made 4 images. 2 with openai, 2 with gemini. I liked Gemini more on 6 concepts. I like the new ChatGPT Images model on 3. I think Gemini is better at diagram and concept style. The openai model is better at more realistic looking things (like people, or product things). New openai model does pull toward hype, even more than gemini. But it’s manageable with prompts. I tend to prefer diagram or concept style over realistic style, so I think that makes me lean Gemini. But I think given a concept, I could easily predict which model would be better at it and pick the right one. There’s a few caveats. If you want faces incorporated in automatically, openai model will win almost always because it’s so much better at faced. I prefer steering to a particular idea/concept, then overlay a real image as needed still though. But, I predict most people will like the new OpenAI model more. I think my taste in this is not the norm. Openai does a better job making social cards that look like “good” social cards.

𝕏x•about 1 month ago

Retweeted from @Kun

RT Kun Chen lol remember this org chart meme? I just created a full simulation for all of them with agents, and the results blew my mind! the simulation asked each organization to build and ship a web spreadsheet want to take a guess who built the best product? reveal in thread below!

𝕏x•about 1 month ago

Retweeted from @Lisan

RT Lisan al Gaib OpenAI just released a new open-source model it's "a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text" https://github.com/openai/privacy-filter https://huggingface.co/openai/privacy-filter

𝕏x•about 1 month ago

My theory of people gaining traction with slop posts is you are trading a smart audience for a dumber audience

𝕏x•about 1 month ago

Generic eval metrics are usually a waste of time. Please don't use them (ex: "Helpfulness score")

𝕏x•about 1 month ago

Retweeted from @Isaac

RT Isaac Flath Oh wow, Marimo been working on their Agentic/AI capabilities. I didn't realize how much they've done here with live kernel/agent integration. Worth a watch. I'm excited to try this. https://www.youtube.com/watch?v=6uaqtchDnoc

𝕏x•about 1 month ago

Retweeted from @Tibo

RT Tibo Happy Tuesday. Codex has hit 4M active users, adding over 1M users in less than two weeks. To celebrate we will reset the rate limits again in a few hours. Enjoy!

𝕏x•about 1 month ago

Retweeted from @Tibo

RT Tibo We are releasing a *research preview* of Chronicle in Codex. It allows codex to build up memories based on your day to day work on your computer and then refer to these memories to be a lot more helpful. Available for PRO subscriptions and on Mac to start. This is early and consumes quite a bit of tokens, but it has changed how I and many folks at OpenAI use Codex.

@OpenAI Developers

Last week, we released a preview of memories in Codex. Today, we’re expanding the experiment with Chronicle, which improves memories using recent screen context. Now, Codex can help with what you’ve been working on without you restating context.

⚡github•about 1 month ago

Activity on repository

hamelsmu pushed hamel

⚡github•about 1 month ago

Activity on repository

hamelsmu pushed hamel

𝕏x•about 1 month ago

I have very few notifications turned on but this guy's tweets is one of them, its a constant stream of the most useful tools

@Chris Tate

Terminal automation + e2e testing solved Now as simple as snapshot, click, type: – wterm renders terminal-in-html, every cell in the a11y tree – agent-browser automates pages via the a11y tree Here's opencode in one browser driving Claude Code in another

𝕏x•about 1 month ago

Retweeted from @Chris

RT Chris Tate Terminal automation + e2e testing solved Now as simple as snapshot, click, type: – wterm renders terminal-in-html, every cell in the a11y tree – agent-browser automates pages via the a11y tree Here's opencode in one browser driving Claude Code in another

𝕏x•about 1 month ago

The funny part is I know people who actually talk like this 🤣

@Bryan Bischof fka Dr. Donut

My sisters birthday is coming up, she asked for a new laptop for school. I asked her “what are you building?” My mom is approaching retirement; I told her “an AI won’t replace you, a person using AI will replace you” My grandma is struggling to use her iphone lately; switching

𝕏x•about 1 month ago

Lots of people asking what’s so good about the new codex desktop computer use. Here’s 5 things that come to mind 1. operate Mac Apps without a great API: Slack, Google Sheets, Notes, IMessage without installing separate plugins. It instantly transforms all your apps into tools 2. If you need to operate your browser more visually it works really smoothly and fast (good for sites that are still human centric) 3. It uses its own cursor, keyboard etc so you can keep working. 4. Once you do any task once you can simply ask Codex to reflect on what it did and how it would accomplish the task next time with the benefit of hindsight and create a skill AND schedule an automation. It’s really nice that codex can just schedule and edit automations when asked! it’s very Claw like in this way. This last point is not computer use specific but is powerful when combined with computer use 5. The UI polish is insane: you get nice icons for any application you want to tag into computer use plus all the other built in new stuff like built in file viewer and browser so there is no context switching. So you can iterate really fast and not lose focus. Because of the polish it also feels nice and delightful to use.

@Hamel Husain

Seriously stop everything you are doing and use codex desktop app new computer use. Absolutely mind blowing

𝕏x•about 1 month ago

I had codex make a song with garage band and reply to this thread (because I needed to walk away from my computer) This feels like AGI

@Hamel Husain

@gazorp5 I am Codex. Hamel directed me to post this reply with the music when it was done on the thread.

𝕏x•about 1 month ago

Being able to @ any application on your computer is 🤯 (re: codex desktop computer use)

𝕏x•about 1 month ago

Seriously stop everything you are doing and use codex desktop app new computer use. Absolutely mind blowing

𝕏x•about 1 month ago

Codex Computer use is FAST and good - how they did this?

⚡github•about 1 month ago

Activity on repository

hamelsmu pushed hamel

𝕏x•about 1 month ago

Never seen this kind of UI polish before in a Mac App

⚡github•about 1 month ago

Activity on repository

hamelsmu pushed hamel

⚡github•about 1 month ago

Activity on repository

hamelsmu pushed hamel

𝕏x•about 1 month ago

Gemini Everyday Error: 503 UNAVAILABLE. {'error': {'code': 503, 'message': 'This model is currently experiencing high demand. Spikes in demand are usually temporary. Please try again later.', 'status': 'UNAVAILABLE'}}

𝕏x•about 2 months ago

Run the following command and you can see some of what Codex is cooking. TIL they have remote_control too! > codex feature list P.S. its worth reading the manual

𝕏x•about 2 months ago

Retweeted from @Shreya

RT Shreya Shankar Looking forward to CHI this week! We have a ✨Best Paper ✨ on a "what-if" analysis tool for RAG. Reach out to chat! I'm interested in: MLOps/LLMOps, data analysis, and better interfaces for human-AI collaboration (and, very soon i'll be recruiting students/postdocs) Original tweet: https://x.com/sh_reya/status/2043436179643830517

𝕏x•about 2 months ago

Retweeted from @Omar

RT Omar Khattab As promised, here's a recording of my 30-min keynote and the subsequent Q&A for the inaugural late interaction retrieval (LIR) workshop, cc @bclavie @antoine_chaffin. The talk is admittedly advanced, as it's directed at an expert IR community. But hopefully still broadly useful! Original tweet: https://x.com/lateinteraction/status/2043053506504925588

@Amélie Chatelain

Lots of people interested in the late Interaction workshop, listening to @lateinteraction's keynote!

⚡github•about 2 months ago

Activity on repository

hamelsmu created a branch

𝕏x•about 2 months ago

Retweeted from @Anthony

RT Anthony Morris ツ btw you can ssh into your Mac mini from Claude code desktop now Original tweet: https://x.com/amorriscode/status/2042733568410161326

@Anthony Morris ツ

@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.

𝕏x•about 2 months ago

People really need to be reading the prompts underneath off the shelf evals so they can learn how pointless they are. This is "faithfulness score" from RAGAS Read the Context, why the fuck is the statement factually consistent? If you use this in your "harness" good luck

𝕏x•about 2 months ago

Retweeted from @Ben

RT Ben Vargas Not sure when this shipped, but just checked and ssh to mac is supported! Thanks @amorriscode Original tweet: https://x.com/benvargas/status/2042675707625771246

@Anthony Morris ツ

@benvargas My PR is super stale and I've been working on higher priority stuff. Let me try to get this out by Friday.

𝕏x•about 2 months ago

Retweeted from @Bryan

RT Bryan Bischof fka Dr. Donut Sorry I couldn't quite hear you over ALPHA ZONE. But seriously check out my podcast it's called In Practice and it's weird and technical about real AI applications https://www.youtube.com/@theoryvc Original tweet: https://x.com/BEBischof/status/2042379114561282103

@Benyam Ephrem

the current state of production design in tech is laptop, table, books, wall

𝕏x•about 2 months ago

Retweeted from @Thariq

RT Thariq you'll need to explicitly prompt Claude Code to use it, but the Monitor Tool is super powerful e.g. "start my dev server and use the MonitorTool to observe for errors" Original tweet: https://x.com/trq212/status/2042335178388103559

@Noah Zweben

Thrilled to announce the Monitor tool which lets Claude create background scripts that wake the agent up when needed. Big token saver and great way to move away from polling in the agent loop Claude can now: * Follow logs for errors * Poll PRs via script * and more!

𝕏x•about 2 months ago

Retweeted from @Harrison

RT Harrison Chase 🎙️Introducing Max Agency Max Agency is a new podcast where we go deep on how the best agents are actually being built: architecture decisions, tradeoffs, evals, and everything in between. Each episode, I sit down with engineering leaders who are doing this work in production. Our first episode features Izzy Miller (@isidoremiller), AI Engineer at Hex (@_hex_tech). Hex has been shipping data agents since before most teams were even thinking about them, starting with single-cell text-to-SQL and graduating to a full Notebook agent that can work autonomously for 20 minutes on a complex analysis. Izzy has a lot of perspective on what it actually takes to get agents working well in production, and what breaks along the way. A few takeaways from our conversation: - Keep your eval sets small enough to hold in your head: Izzy runs 30-50 handcrafted "traps" with multiple repetitions, rather than hundreds of variants. If you can't explain why your agent fails each one, your eval set is too big - Day zero performance is almost irrelevant: The more interesting question is how the agent compounds. Izzy is building a 90-day simulation where the warehouse evolves and the agent has to accumulate understanding - You can catch agent errors without seeing the raw outputs: By running an LLM-as-a-judge over production usage and clustering the results, you can surface places where something likely went wrong, without needing to read individual conversations Watch the full episode on: - Youtube: https://www.youtube.com/watch?v=Xyh1EqcjGME - Apple Podcasts: https://podcasts.apple.com/us/podcast/how-hex-builds-ai-agents-making-agents-reason-like/id1891551672?i=1000760489140 - Spotify: https://open.spotify.com/episode/1BJlg3SOJrjnaPXFHTNuux?si=bffc89cb4f774617 Original tweet: https://x.com/hwchase17/status/2042279493050740916

𝕏x•about 2 months ago

Retweeted from @Cursor

RT Cursor You can now run Cursor on any machine and control it from anywhere. Kick off agents from your phone to run on your devbox. Original tweet: https://x.com/cursor_ai/status/2041912812637966552