Sunday, June 28, 2026

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills https://ift.tt/qgayukA

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills Skills for Claude Code and Codex are hard to test. What I mean by hard is that there's no standard way to do it. You evaluate the skill once on something, it looks like it works. You publish it. Then the new super model releases (GLM 5.2 anyone?), it will quietly break for some part, and you won't find out until your users complain. I also faced the same problem, so I tried to build something lightweight to stop doing that. Caliper. It's a local and lightweight harness that runs a skill k times in isolated environments and gives you a pass@k score (How much times it succeeded in these k times). As a non-deterministic technology, you can't just say "it worked once". You need to answer how much it passed in k times. You define success in a YAML spec. I picked YAML to keep a schema and make it still readable for a human. You either use a LLM judge, a Python assertion, or both: Here's an simple evaluation example with a JSON extraction, so you write this in a YAML file: tasks: - name: Extracts action items as clean JSON prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json." expect: "A valid JSON array where every item has owner, task, due. No markdown fences." assert: | import json items = json.load(open("/tmp/actions.json")) assert isinstance(items, list) assert all({"owner","task","due"} <= i.keys() for i in items) Then with the CLI, you'll run it: caliper run extract-actions.eval.yaml --k 5 --baseline What's cool about the --baseline flag is that it will re-runs everything without the skill, so you can see whether the skill is doing the work or the base agent was going to pass anyway: ID Task k(5) pass@k task-1 Extracts action items as JSON 5/5 100% PASS With skill 100% No skill 60% Delta +40% Most models know how to get the JSON right most of the time (JSON extraction was solved by 2 years old already). But that's it, "most of the time" is the bug. That delta shows how the skill actually helped. (It's sometimes 0%, sometimes -100%!) I also created two skills you can get started right away with your favorite harness, e.g. Claude Code, Codex or Pi: - evaluate-skill: run and manage evals without leaving your workflow - grill-skill: reads your SKILL.md, interviews you about what "good" looks like, writes a 3-task spec (happy path, edge case, adversarial), and runs it You can install the skill with the command: npx skills@latest add edonadei/caliper I for now support claude-code, codex, pi, claude-api, openai-api. You can run the agent and the judge as separate backends, so you can run a skill on one and judge with another. GitHub: https://github.com/edonadei/caliper PyPI: https://pypi.org/project/caliper-eval/ Of course, it's a first step. I think the autorater layer can be vastly improved, more handholding to create and iterate on evaluation specs, supporting more harness, why not including this layer into a self-improvement bigger system? If you're also building agentic evaluations, I'm genuinely interested to hear how you are handling that. https://github.com/edonadei/caliper June 28, 2026 at 11:12PM

Saturday, June 27, 2026

Show HN: Starglyphs - A constellation puzzle game based on Euler paths https://ift.tt/9jCuPHo

Show HN: Starglyphs - A constellation puzzle game based on Euler paths I am a big Dragon Age fan and sunk hundreds of hours into Inquisition. It had this minigame called astrariums where you had to solve these shapes based on constellation guides by tracing stars. I'm a hobby game dev and wondered if I could procedurally generate these puzzles so they were always solvable. Turns out you can, so I built a space puzzle game around it with a colorful aesthetic. I released it in web form here but I'm currently working on getting it on Steam and mobile. https://starglyphs.com June 28, 2026 at 03:20AM

Show HN: Adrafinil – keep a lid-closed Mac awake only while agents work https://ift.tt/uYUrhoj

Show HN: Adrafinil – keep a lid-closed Mac awake only while agents work A month ago there was a wave of posts and tweets about engineers walking around cafes and parks with their MacBooks propped half-open, as fully closing the lid forces sleep that stops their AI agents. Some people made snarky comments about using tmux or Amphetamine, and some defended their choice with “but I only need it sometimes, and forgetting to disable Amphetamine and finding my laptop discharged in my bag is worse.” This is a solution to this problem. Unlike caffeinate, it will prevent your MacBook from sleeping even with the lid closed, with no external power or display, using pmset disablesleep 1. Unlike other sleep-preventing apps, Adrafinil only activates when there’s an agent actively doing something. It detects agent activity through hooks it installs into Claude Code, Codex, and others. To reassure you it’s working, the app shows the active status in the menu bar, and it plays a chime when you close the lid. Once the agent is done, Adrafinil detects it and lets the laptop go to sleep by setting pmset disablesleep back to 0. It will also let it sleep in case of overheating. And if you want to manually toggle it, you can install an optional MCP and tell your agent to keep the MacBook awake for a specific time. It has four binaries, one of which is a root helper exposing a single setSleepBlocked call. All the logic and policy live in the unprivileged parts. They’re all notarized, and the app is fully open source (MIT). https://ift.tt/6YeD5Em June 28, 2026 at 02:04AM

Show HN: Wind particles on Mapbox from a single EXIF JPEG https://ift.tt/xidvVYu

Show HN: Wind particles on Mapbox from a single EXIF JPEG https://ift.tt/tqXv0HP June 27, 2026 at 11:46PM

Show HN: A Living Neural Web in HTML5 Canvas https://ift.tt/gTivKnV

Show HN: A Living Neural Web in HTML5 Canvas https://techoreon.github.io/verpad/canvas-playground.html June 27, 2026 at 10:05PM

Friday, June 26, 2026

Show HN: TBD, a Mac-native CLI-forward coding agent multiplexer https://ift.tt/2I3TKB7

Show HN: TBD, a Mac-native CLI-forward coding agent multiplexer Inspired by Conductor, dmux, claude-squad, agent-deck, and Git Tower ## What makes it different: (Aside from GUI) A core tenet is -- everything a user can do manually, must be exposed via CLI for agents/automation Best paired with something that lets agents in different worktrees talk to each other (e.g. https://ift.tt/HTjYahr ) ## Background: I used and loved Conductor for months starting around January, but hit some persistent issues that made me realize that a core tool that I'm actively using for most of my waking hours sits too close to my skin to produce itches that I can't scratch myself After realizing I needed to switch to something hackable, I went through a few week-ish long trials of dmux, claude-squad, and agent-deck. They were all great, but I then realized I really didn't want to memorize keyboard shortcuts, and I've managed to put off learning how to drive tmux for over a decade, didn't want to end that streak XD So TBD happened in March. In the months since, it's gotten stable enough to the point where a few former and current colleagues have switched to using it as their daily drivers as well. It's been kind of like a fun little club house we contribute to The architecture is a daemon that handles the bulk of state management and actual work, and CLI and GUI clients as two interfaces. Users go through GUI, LLMs and scripts go through CLI. It works best for Claude Code (our shared daily drivers) but two of us also use Codex on the side, so there's some basic support there as well The only way to run it is to clone and build from source, partially b/c I imagine the main appeal is for people who need to hack on the thing they're using (but also b/c didn't want to shell out for an Apple dev license) I think it's now a good enough starting point for similarly minded folks to use as a base to fork and build your own variants, tailored to your own workflows https://ift.tt/Pmz4Fkp June 26, 2026 at 10:29PM

Show HN: Mantis, A self-hosted LLM gateway https://ift.tt/9uEkyB0

Show HN: Mantis, A self-hosted LLM gateway Hey HNers - Riz here. I got together with a few guys and we built an LLM gateway. It's designed for small teams working on early-stage products, and can be deployed to AWS using a single command (i.e. `mantis deploy`). It's self-hosted, and is designed to belong to you. https://ift.tt/tLYE9eq June 27, 2026 at 12:45AM

Show HN: Puzzle with Strangers. A free multiplayer jigsaw https://ift.tt/es4avwt

Show HN: Puzzle with Strangers. A free multiplayer jigsaw I built this over the last few days. Me and handful of friends are successfully hooked. I recently went to a — for lack of a better word – social/collaborative performance at an art gallery in Berlin where a group of artists filled a huge industrial hall with wooden 10x10cm cubes for people to build structures with. It was beautiful how universal the concept of playing with wooden blocks is and how ephemeral the structures were, people of all ages were put back into a childlike play. The thought about what kind of games need zero explanation stuck with me and i built an anonymous multiplayer jigsaw. We've already spent hours in there and you're invited now as well. Hope you enjoy. https://ift.tt/okCpys9 June 26, 2026 at 10:17PM

Thursday, June 25, 2026

Show HN: I created a Scrabble-like word game with simple rules and fun combos https://ift.tt/GayFv3W

Show HN: I created a Scrabble-like word game with simple rules and fun combos When I was in school, my teacher used to play this game to our class. You add one letter turnwise and try to make a word. Later, I tried searching for this game but didn't find the exact match anywhere. The closest was Scrabble, but it was too complicated. So, I decided to build my own. I did make some modifications to make the game more challenging and fun. Back then, we would start with a blank board and also score 2 letter words. Here, the game gets prefilled with random letters so the game becomes more different each time. No scoring for two letter words. The best thing that I added was the combos. If your letter makes 2 or more words, you will get a multiplier for each subsequent word, so the challenge becomes finding a way to score more combos. Initially, I wanted to assign values to each letter like Scrabble, but after running multiple AI-to-AI experiments, I concluded that having flat values per letter increases variances in the game and also reduces the first turn advantage to 0. I still added the weighted game mode if you would like to give that a try as well. And I also added daily puzzles where you get 5 boards, and you need to find the best spot and best letter that scores the most. You can share the Wordle-like result to your friends. You can also play directly on the web at https://ift.tt/1RAp0La or free download in the App Store at https://ift.tt/gPnUha1 https://letterphile.com June 26, 2026 at 03:37AM

Show HN:Every Team Is Building the Same Cache https://ift.tt/zr2pVnw

Show HN:Every Team Is Building the Same Cache https://ift.tt/bOijAHJ June 26, 2026 at 03:10AM

Show HN: Full featured language that compiles to binary https://ift.tt/pKmvXM2

Show HN: Full featured language that compiles to binary Features: 1. Self-hosting compiler 2. C99 backend 3. Built-in dependency injection / IoC 4. Typed business-rule features like decision tables 5. Native binaries + WASM 6. Real app built with it: eXstream https://ift.tt/ERkhTPy June 26, 2026 at 12:45AM

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion https://ift.tt/IBxm5cE

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion https://ift.tt/8hStksB June 25, 2026 at 09:34PM

Wednesday, June 24, 2026

Show HN: Dspyer – self-correcting, optimizable LLM steps for DSPy and LangGraph https://ift.tt/KpYFV74

Show HN: Dspyer – self-correcting, optimizable LLM steps for DSPy and LangGraph https://ift.tt/jwiQgJO June 25, 2026 at 02:38AM

Show HN: LookAway, a Mac break reminder that knows when not to interrupt https://ift.tt/swgeKXU

Show HN: LookAway, a Mac break reminder that knows when not to interrupt Hello, I'm Kushagra and I am the indie developer behind LookAway (I've posted about it earlier but it has received quite a lot of updates since the last time so I am posting it again). LookAway is a native break reminder for macOS that doesn't interrupt. I built it because I work from home and I spend a lot of time in front of my screens. It's very easy for me to get lost in the flow and I can end up sitting for hours. Due to this, I started facing issues like eye strain and back pain by the end of the day. The solution to this was simply taking enough breaks throughout the day. But remembering to take breaks was difficult, especially when I was in the flow. I tried some reminder apps but the problem with those was that they always interrupted me at the worst moments. So I ended up not using them. LookAway is designed not to interrupt. It gives enough heads up before a break so that you're not caught off-guard. It's also context-aware and it automatically pauses when you go into a meeting, start watching a video, record screen, and much more. It even waits for you to finish typing or dictating when a break is due. One thing worth mentioning is the free iOS counterpart LookAway Mirror. When your Mac goes on a break, your iOS devices can also mirror the same break so you don't end up scrolling your phone screen during the Mac break. I've spent a lot of time in making LookAway the least annoying break reminder app and I would love to know your thoughts. It's a native Swift app so it doesn't take much resources (150MB RAM and <1% CPU when idle). It's available to download from the website (lookaway.com), Setapp, and the App Store. Thank you! https://lookaway.com June 24, 2026 at 06:59PM

Tuesday, June 23, 2026

Show HN: The Cascade Graph – An interactive map of AI and energy constraints https://ift.tt/wJGe0Ed

Show HN: The Cascade Graph – An interactive map of AI and energy constraints Hello, I wanted to share with you all a interactive map of the economics and physics constraints of the AI buildout. It has macro drivers, industrial chokepoints, and where that shows up in markets. I've added 393 nodes and 562 edges to capture other supply / physics constraints as well. There's no sign up, and no pay wall, it's all free. Please let me know what you think! https://ift.tt/4CfYr19 June 23, 2026 at 08:52PM

Show HN: Wordit – Change One Letter, Keep the Chain Going https://ift.tt/hu9sB16

Show HN: Wordit – Change One Letter, Keep the Chain Going Hi everyone, I got this idea for a game where, starting from a four letter word you need to go as deep as you can in your vocabulary, changing only one letter per word. bear -> beer -> peer... Each correct word gives you 1 point Each incorrect word takes one life away from you, you start with 3 https://ift.tt/0qdxLFf June 24, 2026 at 12:27AM

Monday, June 22, 2026

Show HN: I scanned every YC Spring 2026 startup for what AI crawlers see https://ift.tt/XPyvnFL

Show HN: I scanned every YC Spring 2026 startup for what AI crawlers see Used 'potatometer.com' to scan and analyze all All 197 YC Spring 2026 startups on their SEO / GEO / AEO technical setup. I scanned the URL each startup lists in YC's directory. Most are readable by AI crawlers. Most don't tell a crawler what they are. Read more in the blog above. https://ift.tt/U5lhE0q June 23, 2026 at 08:10AM

Show HN: Durable Agent Sessions API (Preview) https://ift.tt/HhVZfFd

Show HN: Durable Agent Sessions API (Preview) https://ift.tt/0hdVuFL June 23, 2026 at 07:07AM

Show HN: Kitcat 2.0 – A Matplotlib back end for terminal plotting https://ift.tt/873ML5y

Show HN: Kitcat 2.0 – A Matplotlib back end for terminal plotting https://ift.tt/OWo1wsm June 22, 2026 at 11:00PM

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills https://ift.tt/qgayukA

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills Skills for Claude Code and Codex are hard to test. What I mea...