Evals evals evals https://t.co/Zrmp6LRd9c About Me: https://t.co/P6WyeKkyTa
Claude renovated my GitHub homepage for me by automatically setting up a CRON that pulls in my latest blog posts, and found images and other details to make things a bit nicer :)
🌶️ I don’t think people should hire others to do AI work unless they are using AI heavily in their own daily workflows “I use ChatGPT once in a while” doesn’t count Without this: - You’ll make bad decisions and have too many blind spots - It’s a sign you aren’t really invested in AI and are trying to check a box I think it’s totally fine to hire experts if you are in this position if you also upskill at the same time, but delegating AI work and continuing to stay at arms length from AI you are ngmi
Offline tests and A/B testing are both important and compliment each other. Use both. Just don't confuse one for the other.
Codex Desktop app is the first interface that has taken me away from the terminal. It's that good
RT Ryan Carson Here’s how I use Agentation + queuing messages in Codex to crank through frontend work much faster Original tweet: https://x.com/ryancarson/status/2027481166496874876
RT Shreya Shankar We are designing an interface for interpretable, steerable LLM-powered document ranking and clustering! Help us by taking part in our user studies 😊 Original tweet: https://x.com/sh_reya/status/2027397815056712080
Do you use LLMs as judges or evaluators? We’re running a 40-min study on LLM evaluators to score, rank, or triage hundreds of items (e.g., model outputs, essays, resumes, tickets, ...). Your insights will help shape steerable AI evaluators. Interested? Signup link in 🧵👇🏽
View quoted postI love Gemini but they gotta do something about - oppressive quota limits, and model not available half the time - Confusing as hell billing - their highest tier is $250, $50 more than others with opaque usage limits (what are they?) Seems like enterprise >> individual developers ???
RT Sakana AI We’re excited to introduce Doc-to-LoRA and Text-to-LoRA, two related research exploring how to make LLM customization faster and more accessible. https://pub.sakana.ai/doc-to-lora/ By training a Hypernetwork to generate LoRA adapters on the fly, these methods allow models to instantly internalize new information or adapt to new tasks. Biological systems naturally rely on two key cognitive abilities: durable long-term memory to store facts, and rapid adaptation to handle new tasks given limited sensory cues. While modern LLMs are highly capable, they still lack this flexibility. Traditionally, adding long-term memory or adapting an LLM to a specific downstream task requires an expensive and time-consuming model update, such as fine-tuning or context distillation, or relies on memory-intensive long prompts. To bypass these limitations, our work focuses on the concept of cost amortization. We pay the meta-training cost once to train a hypernetwork capable of producing tasks or document specific LoRAs on demand. This turns what used to be a heavy engineering pipeline into a single, inexpensive forward pass. Instead of performing per-task optimization, the hypernetwork meta-learns update rules to instantly modify an LLM given a new task description or a long document. In our experiments, Text-to-LoRA successfully specializes models to unseen tasks using just a natural language description. Building on this, Doc-to-LoRA is able to internalize factual documents. On a needle-in-a-haystack task, Doc-to-LoRA achieves near-perfect accuracy on instances five times longer than the base model's context window. It can even generalize to transfer visual information from a vision-language model into a text-only LLM, allowing it to classify images purely through internalized weights. Importantly, both methods run with sub-second latency, enabling rapid experimentation while avoiding the overhead of traditional model updates. This approach is a step towards lowering...
Activity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on hamelsmu/research-council
hamelsmu opened a pull request in research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on hamelsmu/research-council
hamelsmu opened a pull request in research-council
View on GitHubre: skills rot In many cases, you are better off pointing your coding agent to the GitHub repo of a skill as a reference for a one-off task and not installing it.
Activity on repository
hamelsmu pushed research-council
View on GitHubRT Nick This is an excellent tool! Highly recommend you giving it a try if you find yourself using multiple models in research. Original tweet: https://x.com/renegadesilicon/status/2027144909036347711
I keep doing deep research tasks across frontier models. I wanted to automate it cheaply by using my existing coding agent subscriptions so I created this https://github.com/hamelsmu/research-council
View quoted postActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubReally great for METR to have the humility to correct its previous study that AI coding "slowed people down" I hope those who are still skeptical about AI coding can suspend their disbelief for their own benefit
Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.
RT Alexis Gallagher I am thrilled and honored that Sparky and I were selected winners for NVIDIA GTC Golden Ticket. Here's how he received the news. Original tweet: https://x.com/alexisgallagher/status/2027129913736876201
Congratulations to our #NVIDIAGTC Golden Ticket winners 🎉: @alexisgallagher Brandon I. Hans B. Julia S. Lluís D. Marco D. Tarique S. You’re headed to GTC! We’ll be reaching out soon with next steps to claim your prize. Thank you to our partners for collaborating with NVIDIA
RT Eleanor Berger Re @HamelHusain Same. Complete surprise to myself but I now use TS and Go more than Python, rely on whatever is the most popular library of framework, everything typed ... complete opposite of how I worked for years. Now more than ever it's crucial to keep your identity separate from the tech. Original tweet: https://x.com/intellectronica/status/2027113896406847930
I keep doing deep research tasks across frontier models. I wanted to automate it cheaply by using my existing coding agent subscriptions so I created this https://github.com/hamelsmu/research-council
RT Mario Zechner stop 👏 anthropomorphizing 👏 tensors 👏 Original tweet: https://x.com/badlogicgames/status/2026783542936551547
In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.
View quoted postRT ben today, we're introducing self diagnostics: the first ever way for agents to proactively self-report issues they encounter. welcome to the future of agent observability. Original tweet: https://x.com/benhylak/status/2026712861666587086
RT Noah Zweben Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow. Original tweet: https://x.com/noahzweben/status/2026371260805271615
New lab by JJ Allaire focused on Evals 👀 Excited that more data folks are getting into this! https://meridianlabs.ai/
RT swyx Big news today if you're into coding evals: SWE-Bench Verified is dead!! https://x.com/latentspacepod/status/2026027529039990985 i'm not sure if @HamelHusain is tired of me tagging him but it turns out @OpenAI really did look back at their own 2024 work and then you 1) look at the CoT and 2) look at the evals they realized that at LEAST 16.4% of SWE-Bench Verified should technically be unsolvable... ... and also that ALL frontier models, including OpenAI's own, are capable of solving them by sheer contamination (including being able to recite verbatim the entire SWE-Bench problem setup and solution, just by giving Task ID alone (!!!!)). Heroic work from the OAI Evals team, and imo an important highlight on the fragility and messiness of Evals work in general. OpenAI spent the money to do 3 independent reviews of each problem in 2024 and AT LEAST SIXTEEN PERCENT OF THESE were still egregiously prolematic (as shown in screenshots). in this 2026 audit they then did 6 independent reviews from software engineers, with ADDITIONAL positive finding verification from a separate team, in order to arrive at today's conclusion. If this happens to SWE-Bench Verified... what else is hiding in other benchmarks out there? Original tweet: https://x.com/swyx/status/2026029120040137066
🆕 The End of SWE-Bench Verified (2024-2026) https://latent.space/p/swe-bench-dead Today @OpenAIDevs is announcing the voluntary deprecation of SWE-Bench Verified! We're releasing a podcast + analysis in today's post. Saturation of SWE-Bench has been a community hot topic for over a year -
Activity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubActivity on repository
hamelsmu pushed research-council
View on GitHubRT jason liu I’ve recently joined @openai to work with @romainhuet on @OpenAIDevs Now is the year of dogged pursuits But Back in 2021 i thought my technical career was over. I had chronic hand pain in both my hands and could barely tie my shoes let alone use my phone or write code. I spent a few years not thinking about what it mean for the value of my labor to go zero but to not being able to produce any labor at all… I gave up bjj. Pottery. Tech. Etc. Then, that one company that solved dota and hide and seek released chatgpt and whisper and all of a sudden with dictation and some determination I could write essays, build things, and make a living from twitter meeting great people like @eugeneyalt @dmdohan @humford @GEVS94 for my reintegration into the tech world after so many years away. From Canada advised companies for free until I had to ask them to pay me. I charged companies until I figured out pricing and asked for enough that I became an investor as well. I started a consulting business and a course business. Learning alongside @HamelHusain and @vig_xyz But through that time I learned a lot about running a business and felt like I’d stopping learning about everything else. I realized that last summer that I wanted to wrap things up and go somewhere and just get involved and be at the center of it all. Original tweet: https://x.com/jxnlco/status/2025986108006568043
Activity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on hamelsmu/claude-review-loop
hamelsmu closed an issue in claude-review-loop
View on GitHubActivity on hamelsmu/claude-review-loop
hamelsmu contributed to hamelsmu/claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on hamelsmu/claude-review-loop
hamelsmu commented on an issue in claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHubActivity on repository
hamelsmu pushed claude-review-loop
View on GitHub