Daniel King

Continuous Development still has prerequisites

2026-05-20T00:00:00+00:00

There’s a pattern showing up in how teams talk about agentic coding. The agent writes code fast, faster than any of us could alone, so the next move feels obvious: ship it fast too. Out come the words that used to mark a mature engineering org. Trunk-based development. Feature flags. Continuous delivery, many releases a day. The reasoning goes that if generation is no longer the bottleneck, the release process shouldn’t be either.

I understand the pull. But most teams reaching for that machinery haven’t built the things that make it safe. They’re trying to adopt the destination without the road that gets you there.

Continuous Development is a set of practices that lean on each other, and they were never free. Jez Humble and Dave Farley wrote a whole book about the discipline that has to sit under a fast release. The DORA research that followed, led by Nicole Forsgren, makes the empirical case: the strongest teams release more often and break things less, because the underlying practices make speed and stability climb together rather than trade off.

Trunk-based development assumes a green main and small, frequent, reviewed commits. It falls apart the moment people start landing large unreviewed changes, and agents make large changes trivially easy to produce. Without that discipline, trunk-based dev is just everyone breaking main faster.

Feature flags assume a lifecycle, the kind Pete Hodgson spells out: someone owns each flag, release flags get retired once the feature beds in, dead ones get removed. The failure mode I see most often is subtler than forgetting to clean up: too many long-running features in flight at once. Each live flag forks the system’s behaviour, and flags combine. Three half-finished features means eight possible states, every one a configuration someone has to keep working and verify before anything ships. That compatibility burden flows downstream and drags on the pipeline, the opposite of the speed the flags were supposed to buy. Skip the discipline altogether and flags stop being a safety mechanism and turn into a second debt pile layered on the first.

Fast releases assume observability and a quick rollback. Ship many times a day with neither and you’ve optimised for shipping bugs faster while detecting them slower.

And all of it assumes a test suite you actually trust. If you can’t tell whether a change is correct without a human reading it carefully, you don’t have continuous anything. You have a queue of changes waiting on the expensive step.

Here’s the part I find genuinely interesting. The same agents driving the rush can build these foundations fast. Ask Claude Code to add structured logging, wire up flag cleanup, or write the rollback runbook, and it does. The capability that makes people want to skip the discipline is the same one that makes it cheap to acquire.

Testing is the clearest example, and it comes with a catch. Agents are very good at covering code that already exists, and they sit well with the test pyramid. Unit tests give the best return, quick to generate and cheap to run. Integration tests come next, still strong. User-facing tests are where they struggle, because driving a real UI is slow and the assertions are brittle. A suite that leans heavily on UI tests is the worst case twice over: the layer the agent helps least with, and the slow, flaky one a continuous pipeline can least afford. Lean on the top of the pyramid and CD will fight you, agent or not.

But only if you ask. The agent builds what you point it at, and most people are pointing it at features. The foundation work doesn’t happen unless someone decides it matters and frames the questions that lead there. That’s a culture problem, not a tooling one. A team that values a green main and a trustworthy suite will use the agent to protect both. A team that only counts features shipped will use the same agent to pile features onto a foundation that’s quietly cracking.

This reframes what experience is for. You no longer need twenty years behind you to stand up a CD pipeline; the agent wires up the mechanics for someone who has never built one. What twenty years buys is different: knowing which questions to ask before you start, and checking the first-principle assumptions the setup quietly rests on. The experienced engineer spots the blank spaces and the assumptions that don’t hold, the “we’ll add rollback later,” the test that asserts nothing, the flag no one will ever turn off. Those are cheap to fix now and genuinely expensive to discover in production. That instinct is the part you still have to bring. Ask the agent the right question and it will reason about any of them as well as anyone. It just won’t raise what you didn’t think to ask.

So the order matters more than ever, not less. Build the hygiene first, or build it alongside, but build it. The agent will happily help you go fast in either direction, including the wrong one.

I trained a sprite model with agents. The data was the bottleneck.

2026-05-06T00:00:00+00:00

I just published pixel-llm, a small autoregressive transformer that generates 32x32 pixel art sprites of reef sea creatures. About 2.9 million parameters, a 64-colour palette, runs on consumer hardware. Built end to end through agent sessions, with me steering rather than typing.

The output is sub-par. I am sharing it anyway, because the way it failed taught me something I did not expect.

The setup was narrow on purpose. I picked sea creatures because the visual vocabulary is constrained: a few zones (shallows, twilight, midnight, abyss, hadal) and a few categories (reef fish, grazer, coral, jellyfish, cephalopod, plus an abyssal catch-all). A small, well-defined domain felt like the right shape for a small model. Six categories, five zones, thirty cells in the grid. Tractable on paper.

The model itself fell out fast. Agents wrote the transformer, the KV-cache inference loop, the sprite breeding via partial completion, and the post-process palette-aware shader. That last piece is the strongest output. The model produces flat colour-indexed sprites and a separate procedural shader applies directional light and ambient occlusion, staying inside the 64-colour palette by walking pre-computed luminance ramps.

When the categories worked, you can see what I was after. When they did not, you can see that too: two of the six categories (cephalopod and one abyssal column) never converged. Pure noise, regardless of sampling temperature.

I iterated the training data four times. A procedural synthetic generator. Wikimedia Commons photographs, downloaded and palette-quantised. Sprite sheet extraction from OpenGameArt. A mixed corpus stitched together from all three. The validation loss kept going down. The samples for those two categories kept looking wrong. The other four held up well enough to look at.

That is the part I want to flag. Loss is not taste. The agentic loop has a fast, local correctness signal for the code: does it run, does the loss go down, does it not crash. It does not have a corresponding signal for the data. Whether a corpus is the right shape for a problem is a slow, aesthetic judgment that arrives after a training run, after staring at sample grids, after a cycle measured in hours rather than seconds. Agents cannot close that loop on their own yet.

So the work split cleanly. The model code, training scaffold, sampler, breeder, and shader were straightforward agent output. The data choices were the part where I had to keep showing up.

This connects back to something I wrote about in April. When agents take over execution, the premium activity is the layer above. For a coding agent that layer is verification. For a research-flavoured agent loop, it is data curation: deciding what the model should see, recognising when the existing corpus is wrong, and recognising when the iteration has hit its ceiling.

Knowing when to stop is itself the call. After the fourth dataset I judged that the agentic loop had run out of useful moves for this architecture. The next step would not be more data, it would be a different model shape. I called time, wrote the README honestly, and shipped.

The repo is up at github.com/danfking/pixel-llm with the sample images and a fuller writeup. The interesting thing in there is not the trained model. It is the trail.

Anyone can cook (and the kitchen is opening up too)

2026-05-05T00:00:00+00:00

At a quarterly meeting in our group last week, senior leadership made the expectation clear: everyone should be using agentic tooling to build apps that solve problems they actually have. Not just engineers. Everyone.

The Ratatouille line keeps surfacing for me, and for my tech lead, who reached for it independently. “Anyone can cook,” Gusteau said. Anton Ego refined it later: “a great cook can come from anywhere.” More are growing into great cooks than I’d have guessed, and a great one can show up from anywhere, including, in the film, a rat.

Agentic tooling is doing the same thing for code. The barrier to writing something that runs has dropped sharply. Someone in operations or marketing can describe what they want and watch a working prototype appear. That part is exciting and worth taking seriously.

The same tooling opens the rest of the stack to people who don’t ship code. Someone who only saw the front end can now ask grounded questions about source, security, and pipeline. The kitchen is becoming visible from the dining room.

But code is one part of a meal. A restaurant kitchen has consistent supply, food safety, plating, and prep timing. Software at scale needs the equivalent: architecture that won’t fold under load, security review, repeatable deployment, observability, requirements that don’t shift under you.

Not every meal goes on the same menu. A marketing team’s prototype doesn’t need the architecture, security, and operations of customer-facing enterprise software. If it’s a prototype, there’s no end to own. If it’s meant to be more, build for ownership from day one. The same tooling can help you reason about its security and deployment, with the production kitchen alongside when stakes are real. A half-cooked plate handed off later leaves the rest of the work with the production kitchen. When more people own their plates from the start, the gap between their work and the production kitchen’s narrows.

Tooling investment is uneven. Most energy goes to prompt-to-code; less to prompt-to-security-review or prompt-to-deployment. The kitchen is becoming legible faster than it’s becoming agentic.

I’m watching teams cross-skill vertically. Developers picking up bits adjacent to code that AI now puts in reach. People in business roles asking grounded questions about parts of the stack they previously only saw the front of, and acting on more of their own ideas.

A great cook can come from anywhere. A working restaurant takes a team that reads the whole kitchen and shares the plating more than it used to.

Show HN by the Numbers: 188,000 Posts, 14 Years of Data, and What Actually Predicts GitHub Stars

2026-04-23T00:00:00+00:00

Does it actually matter when you post your Show HN? And does a front-page run translate into GitHub stars? I scraped 188,085 Show HN posts and cross-referenced the top 500 with their GitHub star histories to find out.

TL;AI

The median Show HN scores 2 points. If you hit 50, you’re in the top 6%.

Best posting time: Monday 00:00 UTC (Sunday 7pm Eastern), with a 10.8% chance of scoring 50+.

Each HN upvote converts to roughly 1.4 GitHub stars within 48 hours.

The half-life of a Show HN is 24 hours. After 48h, 92% of the star impact is over.

Show HN volume has nearly tripled since 2019 (28,000 posts in 2025). Your post now competes with ~200 others per day.

HN score and GitHub stars correlate at r = 0.29. Significant, but your HN score explains only 8% of the variance in stars.

Comments don’t predict stars (r = 0.10). Discussion doesn’t mean conversion.

The dataset

Every Show HN post from 2012 to April 2026, pulled from the HN Algolia API. 188,085 posts total, of which 51,338 (27%) link to a GitHub repo. For the GitHub correlation analysis, I fetched stargazer timestamps for 491 of the top 500 repos by HN score (all scoring 258+), using GitHub’s star-with-timestamps API.

Some caveats upfront. The Algolia API records final scores, not time-series, so I can’t tell you how long a post sat on the front page. The stargazer API caps at 1,000 stars per repo in my sampling window, which means the 48h star counts for very popular repos are underestimates. The dataset is biased toward high-scoring posts for the star analysis, since I couldn’t practically fetch star histories for all 51k repos.

With those limitations acknowledged, the patterns are clear enough to be useful.

Show HN is booming (and getting noisier)

The most striking trend in the data isn’t about timing at all. Show HN submissions have exploded. From 2012 to 2019, the platform saw a steady ~10,000 Show HN posts per year. COVID lockdowns pushed this to 15,000 in 2020. Then came ChatGPT.

Starting in late 2022, submissions began climbing steadily, and 2025 hit 28,302 posts. That’s nearly a 3x increase from the pre-COVID baseline. Whether this is because more people are building things (thanks to AI-assisted development) or because more people are treating Show HN as a launch channel is hard to say. Probably both.

The practical implication: your Show HN is now competing with roughly 200 other Show HN posts on any given day, up from about 30 a decade ago. The signal-to-noise ratio has changed dramatically.

What “normal” looks like

The median is 2, the 90th percentile is 24, and the 99th percentile is 263.

Before talking about what works, it helps to calibrate expectations. The median Show HN post scores 2 points. The mean is 13.5, dragged up by the long right tail.

If your Show HN gets 5 points, you’re already above average. If it hits 50, you’re in the top 6%. And if you crack 250, you’re in the top 1% of all Show HN posts ever submitted.

Most Show HN posts simply don’t gain traction. That’s not necessarily a reflection of quality. The new/rising page is a crowded, fast-moving queue, and a post can easily get buried in minutes if it doesn’t catch an early upvote or two.

When to post: the heatmaps

This is the question everyone asks, and I have both the expected answer and some surprises.

Where the scores are

Mean scores show a pattern that differs from post volume. The highest mean scores cluster around 12:00 to 15:00 UTC (7-10am Eastern), which is just before the main wave of competition hits. Sunday morning UTC also performs well. Fridays at 12-15 UTC (mean 18.0) and Sundays at 16-19 UTC (17.3) are the best blocks.

Your actual odds

The best individual slot is Monday 00:00 UTC (10.8%). The worst is Thursday 06:00 UTC (2.6%).

The most actionable view: what percentage of posts at each timeslot score 50 or higher? (Recent data only, 2021 onwards.)

The best slot, by a significant margin, is Monday at 00:00 UTC (Sunday 7pm Eastern), where 10.8% of posts reach 50+. The worst slots are mid-week during the early UTC morning, particularly Thursday at 06:00 UTC (2.6%). This makes intuitive sense: you’re posting when the audience hasn’t yet arrived for the day.

Note that the 4-hour blocks above smooth out the granular peaks. The individual best hours are Monday 00:00 (10.8%), Sunday 02:00 (9.8%), and Saturday 19:00 (9.2%).

Score vs. competition

The gap between competition (post volume) and score is where the opportunity sits. The widest gap, where scores are high relative to competition, is around 00:00 to 01:00 UTC (US evening). The narrowest gap is around 15:00 UTC (US late morning), where both volume and scores peak together.

The takeaway: posting right at the start of the US workday (early morning Eastern) catches the audience as they’re arriving, before the day’s competition has accumulated.

The real question: does HN performance predict GitHub stars?

This is the analysis nobody else has done, and the answer is more nuanced than I expected.

The correlation

Overall correlation: r = 0.285 (p = 1.2e-10). Comments vs stars: r = 0.102 (p = 2.4e-02).

The correlation between HN score and GitHub stars gained in 48 hours is r = 0.29 (p < 0.001). Statistically significant, but modest. A higher HN score does predict more stars, but it explains only about 8% of the variance.

What’s interesting is the diminishing conversion rate. Posts scoring 258-350 average 1.77 stars per HN point. Posts scoring 700+ average only 0.79 stars per point. The relationship is sublinear: doubling your HN score does not double your GitHub impact.

Comments are an even weaker predictor (r = 0.10). A lively comment section on HN doesn’t mean people are heading to your repo to star it.

The conversion rate by category

Each HN upvote translates to roughly 1.4 GitHub stars in the 48-hour window (median across all repos in the sample). AI/ML projects and CLI tools convert slightly better, perhaps because HN’s audience skews toward power users who star tools they might actually use. But the difference is small. The category of your project matters far less than simply getting upvotes.

The half-life of a Show HN

This is, to me, the most important finding in the entire analysis.

Day 1 spike: 1,188x baseline. After 48h, ~92% of star-getting is over.

The median successful Show HN project (scoring 258+) goes from 0.4 stars per day before posting to 509 stars on Day 1. That’s a 1,200x spike. By Day 2, it’s dropped to 40. By Days 3 through 7, it averages 9 per day. By Days 8 to 30, it’s back to zero.

The half-life of a Show HN bump is almost exactly 24 hours. After 48 hours, 92% of the star-getting is over. After a week, it’s done.

This has a practical consequence that I think is underappreciated. A Show HN launch is not a growth strategy. It’s a pulse. You get one day of intense attention, and then it’s over. If your project doesn’t have a growth flywheel beyond HN (SEO, word of mouth, integrations, a community), the stars you gain on Day 1 are essentially all you’re going to get from this channel.

Some of the most-starred repos in the sample (lazygit at 76k, pocketbase at 57k, nocodb at 62k) gained relatively few stars from their Show HN despite scoring 500+. Why? Because they already had growth engines running. The Show HN was a blip in their overall trajectory, not the main driver.

What this means for your launch

If you’re planning a Show HN, here’s what the data actually supports:

1. Post on Sunday evening or Monday morning (US time). The success rate is highest at Monday 00:00 UTC (Sunday 7pm Eastern). Sunday posts generally face less competition and have more time to accumulate upvotes before the Monday rush.

2. Expect roughly 1.4 GitHub stars per HN upvote. If you hit 100 points, plan for ~100 to 150 stars. If you hit 500, maybe 500 to 700. This is a useful mental model for setting realistic expectations.

3. Your window is 24 hours, not 48. The vast majority of the star impact happens on Day 1. Make sure your repo’s README, demo, and documentation are polished before you post, not after.

4. Comments don’t predict stars. A spirited comment section is fun, but it doesn’t correlate with people actually visiting your repo. Don’t mistake engagement for conversion.

5. The HN channel is getting noisier. With 200+ Show HN posts per day in 2026 (up from 30 a decade ago), the base rate for success keeps dropping. Treat Show HN as one launch channel among several, not your entire go-to-market strategy.

Methodology notes

Data source: HN Algolia API (hn.algolia.com/api/v1), queried with the show_hn,story tag filter. Paginated by month with recursive time-window splitting to stay within the API’s 1,000-result limit per query.

GitHub data: Star timestamps fetched via the GitHub API’s application/vnd.github.v3.star+json accept header for 491 of the top 500 repos by HN score (all scoring 258+). For repos with more than 1,000 stars, binary search was used to find the page window around the HN post date, then surrounding pages were fetched. This means the 48h star count is accurate for repos with fewer than ~1,000 stars in that window, and a lower bound for larger repos.

Time period: January 2012 to April 2026 for the full dataset. Timing analysis (success rate heatmap) uses 2021 onwards for relevance.

Code: Analysis scripts are Python, using pandas and scipy. Charts generated with kuva, a scientific plotting CLI by James Ferguson. Available on request.

I’m Dan King, and I built this analysis after posting my own Show HN for Burnish, an MCP protocol renderer. It scored 1 point. No traction, no front page, no star bump. I’m part of the 29%. That’s what motivated the analysis: I wanted to know whether I’d done something wrong, or whether this is just what the distribution looks like. Turns out, it’s mostly the distribution.

Verification is the expensive thing now

2026-04-23T00:00:00+00:00

Martin Fowler’s latest fragments post collects several ideas about how AI is reshaping software development. The one that stuck with me is Ajey Gore’s argument: as coding agents take over execution, verification becomes the premium activity.

Gore puts it bluntly. Instead of ten engineers building, you might have three engineers plus seven people defining acceptance criteria and designing tests. The bottleneck moves from “can we write the code?” to “do we know whether the code is right?”

This matches what I see daily. I run multiple Claude Code sessions in parallel, each producing working code at a pace I couldn’t match alone. The hard part is never the generation. The hard part is knowing whether what came out actually does what I intended, handles the edges I care about, and doesn’t quietly break something else. And it’s not just me who needs to know. My team members need to look at that same output and reach the same confidence, often without the context I had when I prompted it.

The cultural shift Gore describes is the part most teams will struggle with. Your Monday standup changes. Instead of “what did we ship?” the question becomes “what did we validate?” Instead of tracking output, you’re tracking whether the output was right. That reframes what it means to be productive. An engineer who catches a subtle misalignment in generated code before it ships has done more valuable work than one who prompted three features into existence without checking them.

This connects to something else Fowler highlights in the same post: Margaret-Anne Storey’s concept of “intent debt,” where the goals guiding a system are poorly documented or maintained. If you can’t clearly articulate what the system should do, you can’t verify that it does it. Intent debt was always a problem, but it was partially hidden when the same person writing the code also held the intent in their head. When an agent writes the code, that implicit knowledge gap becomes a concrete failure mode.

I think the teams that figure out verification workflows early will have a real advantage. Not just automated tests (though those matter), but the whole practice of clearly stating intent, reviewing output critically, and building confidence that what shipped is what was meant.

Small productivity hack that changed how I work with Claude Code

2026-04-22T00:00:00+00:00

I typically have half a dozen Claude Code sessions running at once, spread across different terminals and monitors, some hidden behind other windows. The visual “done” indicator is easy to miss when you’re not looking at the right terminal.

About a month ago I added a global hook that plays a short chime whenever any session finishes a task. Two minutes to configure. Can’t live without it now.

The difference is about flow. Before, I’d either stare at a terminal waiting, or context-switch and then keep interrupting myself to check which session was ready. Now the chime pulls me back at exactly the right moment. I stay in whatever I’m doing until I hear it, then go find the session that needs me. It’s kept me in a flow state in a way I genuinely didn’t expect from something so simple.

How to set it up

Drop this into your ~/.claude/settings.json:

{
  "hooks": {
    "Notification": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "powershell.exe -NoProfile -Command \"(New-Object Media.SoundPlayer 'C:\\Windows\\Media\\chimes.wav').PlaySync()\""
          }
        ]
      }
    ]
  }
}

That’s it. Windows has the sound file built in. For Mac/Linux, swap the command for afplay or paplay with a sound file of your choice.

The empty matcher means it fires on every notification, regardless of which project or session triggered it. Claude Code sends a notification whenever a session finishes and is waiting for input, which is exactly the moment you want to know about.