Evals evals evals https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa
RT Charles 🎉 Frye calling a model with a non-commercial license "open weights" is bullshit. Original tweet: https://x.com/charles_irl/status/1994593258287374613
Gemini .. .. .. .. … … … This a test to see if Logan sees this 🤣🤣🤣 I swear I could email just myself about Gemini and Logan would figure out how to reply. Absolutely epic devrel
Has anyone made good writing evals for themselves
As much as I love Gemini, G3 seems like it has regressed in terms of writing for me and less steerable. Maybe this has something to do with the focus on coding What’s worse if I attach a file to a Gem (aka Project), it often cannot see the file and it hallucinates what I attached. This doesn’t happen with the API so it’s appears to be a buggy product surface area.
RT Shreya Shankar we are not holding back in the evals book. if 10+ people have made the same mistake, we are going to tell you Original tweet: https://x.com/sh_reya/status/1994159429269230073
I used the design plugin + Opus 4.5 to upgrade my business page https://parlance-labs.com/ Absolutely blown away. Before vs. After
Opus 4.5 is our best model yet for design & vision. Here are some of my favorite UIs we made with Claude Code's frontend-design plugin.
View quoted postA GTM strategy of rage bait gets views but destroys trust.
I have been rage baited too many times today 😭. Time to turn this off
I challenge those who build better retrieval to lead with "We build a really great retrieval system that beats XYZ baseline" rather than "RAG is dead" Leading with the latter is insta red flag
💯Isaac Flath: @HamelHusain The only thing worse than promoting generic metrics, is promoting agents that auto-optimize your system based on those generic metrics Link: https://x.com/isaac_flath/status/1992642545005027650
Nothing triggers me more when eval tools promote generic metrics (i.e. Affirmation, Brevity, Levenshtein) as way to make "evals easy" In reality, this is extremely poor data literacy sold as "best practices", in the same way that sugary cereal is marketed as healthy. The only thing that generic metrics do is waste your time and burn tremendous engineering cycles by having you chase vanity metrics. What works is looking at your data and define metrics tailored to failure modes you actually observe. When an eval tool promotes these front and center, I run in the other direction. BTW I have no idea WTF affirmation score even means. This is something I saw in an IRL advertisement. Stuffing a dashboard with a bunch of random metrics is a guaranteed way to waste everyone's time. Don't do it. I wrote more about this here: https://hamel.dev/blog/posts/evals-faq/#q-should-i-use-ready-to-use-evaluation-metrics Also yes, this is a subtweet 🤣
Nothing triggers me more when eval tools promote generic metrics (i.e. Affirmation, Brevity, Levenshtein) as way to make "evals easy" In reality, this is extremely poor data literacy sold as "best practices", in the same way that sugary cereal is marketed as healthy. The only thing that generic metrics do is waste your time and burn tremendous engineering cycles by having you chase vanity metrics. The only thing that works is looking at your data and define metrics tailored to failure modes you actually observe. When an eval tool promotes these front and center, I run in the other direction. I wrote about this more here: https://hamel.dev/blog/posts/evals-faq/#q-should-i-use-ready-to-use-evaluation-metrics
I'm in a coffee shop talking to my computer out loud 🤣🤣🤣🤣
I love @amp b/c its the Omakase experience of coding agents. They maniacally try all models and select the best model is for the task and tune the prompts for you Most importantly, the devex is incredibly fun. My notes: https://hamel.dev/notes/coding-agents/amp.html BTW I have no financial interest in amp at all!
👀Aamir Shakir: we just made Claude Code - use 53% fewer tokens - respond 48% faster - give 3.2x better responses just by giving it a better grep Link: https://x.com/aaxsh18/status/1991626308611371387
How-to articles about AI SEO can be distilled into: "Write good shit that people find valuable" 🤷♂️
I’m in SF for the next few days 😀
Love it. The Simon honeypotSimon Willison: Since this question shows up so often that it qualifies as an FAQ, here's my definite answer to "What happens if AI labs train for pelicans riding bicycles?" https://simonwillison.net/2025/Nov/13/training-for-pelicans-riding-bicycles/ Link: https://x.com/simonw/status/1989001665526264169
RT Simon Willison AI automated replies on here are annoying, but the ones that ask follow-up questions are next level rude because, if taken at face value, they act as time vampires
RT Vik Paruchuri The Datalab API can now extract redlines and comments into clean markdown! This is great for analyzing legal documents with LLMs.
RT Hamel Husain Lots things make sense when you realize how similar these activities are (when done well)
This eval talk features some of my favorite people all in one go. It's discusses evals from many perspectives: - How to look at data - Human/Computer interface design - Metrics - Tools - etc @eugeneyan , @sh_reya , @BEBischof , @hwchase17 , etc 🔥 https://www.youtube.com/watch?si=P9EmuJXw0kzLsdIu&v=SnbGD677_u0
> New AI coding / “OS for AI” / etc announced claiming “We are different” > Opens page > Says “Join waitlist” > Close it, move on, forget about it I can’t remember any great software behind a waitlist in recent times
RT Omar Khattab This is an extremely exciting initiative on Saturday. I'm especially excited about the fact that they're creating evals for their task! Bummer it's only in SF. Folks who are there should give this a shot!! I was asked to share the event, but I wanted to find a moment to write some thoughts: I get that the caricature of the tools below is just a caricature, but one must add that building a good system and building a hackathon system call for opposite tradeoffs. A good system is maintainable and portable into the future, even as the underlying technology shifts. It revolves around separation of concerns. A good tool thus prevents you from premature hand-engineering, even though the bitter lesson tells you that hand-fitting *will* in fact help you in the short term. In other words, if my goal is to build a throwaway artifact in a few hours, I will shamelessly consider low-level tricks and duct tape. The only way they can fail me is if I'm not a very good prompt engineer for some reason or if there are just too many settings and models to handle at once by hand. All that said, I'm super excited about the resources that this will produce. We need better public evals and my understanding is that this could produce one.Hamel Husain: This is going to be a 🌶️ event. Battle Royale of all the tribes - @DSPyOSS - Just optimize bro - @LangChainAI - A framework is all you need - @fastdotai - everything is a notebook (solveit) - Python vs. Typescript - etc. This is in two weeks! Links in reply Link: https://x.com/HamelHusain/status/1983950540431355999
RT Greg Kamradt Hamel and Bryan are the type of group that if you do good work (regardless of your track record or lack of) you will get rewarded with more cool work and opportunities Great chance for a data analyst to jump on a quick project and over deliverHamel Husain: . @BEBischof and I are looking for a strong Data Analyst who will have a super fun, special role in this Hackathon to be "THE human baseline" Group DM us if interested. Link: https://x.com/HamelHusain/status/1986994437311091016
. @BEBischof and I are looking for a strong Data Analyst who will have a super fun, special role in this Hackathon to be "THE human baseline" Group DM us if interested.Hamel Husain: 📢New development, we have prizes for this event (which is almost full) - ${10k,5k,1k) in OpenAI credits - NVIDIA mystery GPUs - Swag TLDR; this is an IRL Kaggle competition re: build agents to answer questions over structured & unstructured data https://luma.com/7vg7u3mf?utm_source=hh Link: https://x.com/HamelHusain/status/1986896087307985292
RT Shreya Shankar This is important information» teej: O'Reilly Wine Pairings Book: Evals for AI Engineers by @sh_reya and @HamelHusain Wine: Balgownie 2023 Gold Label Shiraz Link: https://x.com/teej_m/status/1986929594843500574
RT » teej O'Reilly Wine Pairings Book: Evals for AI Engineers by @sh_reya and @HamelHusain Wine: Balgownie 2023 Gold Label ShirazHamel Husain: 👀 Animals have been assigned. Scheduled to print fall 2026! We have iterated on this with over 3k students (and continue to do so). We give our students access to the full draft as part of our evals course (link in bio). Link: https://x.com/HamelHusain/status/1986918458286862480
📢New development, we have prizes for this event (which is almost full) - ${10k,5k,1k) in OpenAI credits - NVIDIA mystery GPUs - Swag TLDR; this is an IRL Kaggle competition re: build agents to answer questions over structured & unstructured data https://luma.com/7vg7u3mf?utm_source=hh
RT Shreya Shankar Finally got some time to code in this busy semester & am building DocScraper for the DocETL stack. People struggle to discover documents to analyze. Surprisingly AI is pretty bad at writing (i) code to scrape *high-quality* docs, and (ii) custom UI to visualize the docs
Come heckle me in ~10 minutes hereabhishek: join here: https://www.youtube.com/watch?v=1VBeE5BfhPM Link: https://x.com/abhi1thakur/status/1986734266689224997
I guess we gonna have Sydney Sweeney memes for the next 48 hours
RT abhishek today, i will talking with hamel husain about evaluations. this is going to be so good. dont miss it!
Activity on repository
hamelsmu made this repository public
View on GitHubRT Chris Albon Your goal isn’t to code. Your goal is to build. If EDM music makes you build faster, listen to EDM. If AI makes you build faster, use AI. Being a great coder + AI will make you build 1000x faster than not being able to code + AI.
Activity on repository
hamelsmu made this repository public
View on GitHub