I find it interesting that folks are so focused on cost for AI models. Human time spent redirecting AI coding agents towards better strategies and reviewing work, remains dramatically more expensive than the token cost for AI coding, for anything other than hobby work (where you're not paying for the human labor). $200/month is an expensive hobby, but it's negligible as a business expense; SalesForce licenses cost far more.
The key question is how well it a given model does the work, which is a lot harder to measure. But I think token costs are still an order of magnitude below the point where a US-based developer using AI for coding should be asking questions about price; at current price points, the cost/benefit question is dominated by what makes the best use of your limited time as an engineer.
We already shipped 3 things this year built using Claude. The biggest one was porting two native apps into one react native app - which was originally estimated to be a 6-7 month project for a 9 FTE team, and ended up being a 2 months project with 2 people. To me, the economic value of a claude subscription used right is in the range of 10-40k eur, depending on the type of work and the developer driving it. If Anthropic jacked the prices 100x today, I'd still buy the licenses for my guys.
Edit: ok, if they charged 20k per month per seat I'd also start benchmarking the alternatives and local models, but for my business case, running a 700M budget, Claude brings disproportionate benefis, not just in time saved in developer costs, but also faster shipping times, reduced friction between various product and business teams, and so on. For the first time we generally say 'yes' to whichever frivolities our product teams come up with, and thats a nice feeling.
Who's going to review that output for accuracy? We'll leave performance and security as unnecessary luxuries in this age and time.
In my experience, even Claude 4.6's output can't be trusted blindly it'll write flawed code and would write tests that would be testing that flawed code giving false sense of confidence and accomplishment only to be revealed upon closer inspection later.
Additionally - it's age old known fact that code is always easier to write (even prior to AI) but is always tenfold difficult to read and understand (even if you were the original author yourself) so I'm not so sure this much generative output from probabilistic models would have been so flawless that nobody needs to read and understand that code.
I am not sure how others are doing this, but here is our process:
- meaningful test coverage
- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash
- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.
- performance tests as per usual - designed by our infra engineers, not vibed
- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").
- pentests with regular human beings
The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.
I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.
The part about refactoring is very interesting and reassuring. I sometimes think I'm holding it wrong when I end up refactoring most of the agent's code towards our "opinionated" style, even after laying it out in md files. Thank you very much for this insight.
Thanks! In our limited experience, Claude does not focus that much on guardrails and code quality when building a feature - but can be pretty focused on code quality and architecture when asked to do just that. So, one a few hours to iterate a feature, a few hours to refactor. Rinse and repeat.
Very nice insight, that’s where the value is, even with a lot of time refactoring, testing and reviewing the compressed code phase is so much gziped than it’s still worth it to use an imperfect LLM. Even with humans we have all those post phases so great structure around the code generation leads to a lot of gains.
It depends on industries and what’s being developed for sure
I don't want to defend LLM written code, but this is true regardless if code is written by a person or a machine. There are engineers that will put the time to learn and optimize their code for performance and focus on security and there are others that won't. That has nothing to do with AI writing code. There is a reason why most software is so buggy and all software has identified security vulnerabilities, regardless of who wrote it.
I remember how website security was before frameworks like Django and ROR added default security features. I think we will see something similar with coding agents, that just will run skills/checks/mcps/... that focus have performance, security, resource management, ... built in.
I have done this myself. For all apps I build I have linters, static code analyzers, etc running at the end of each session. It's cheapest default in a very strict mode. Cleans up most of the obvious stuff almost for free.
> For all apps I build I have linters, static code analyzers, etc running at the end of each session.
I think this is critically underrated. At least in the typescript world, linters are seen as kind of a joke (oh you used tabs instead of spaces) but it can definitely prevent bugs if you spend some time even vibe coding some basic code smell rules (exhaustive deps in React hooks is one such thing).
Well it's all tradeoffs, right? 6 months for 9 FTEs is 54 man months. 2 months for 2 FTEs is 4 man months. Even if one FTE spent two extra months perusing every line of code and reviewing, that's still 6 man months, resulting in almost 10x speed.
Let's say you dont review. Those two extra months probably turns into four extra months of finding bugs and stuff. Still 8 man months vs 54.
Of course this is all assuming that the original estimates were correct. IME building stuff using AI in greenfield projects is gold. But using AI in brownfield projects is only useful if you primarily use AI to chat to your codebase and to make specific scoped changes, and not actually make large changes.
I do greenfield in fluid dynamics and Claude doesn't help: I need to be able to justify each line of my code (the physics part) and using Claude doesn't help.
On the UI side Claude helps a lot. So for me I'd say I have a 25% productivity increment. I work like this: I put the main architecture of the code in place by hand, to get a "feel" for it. Once that is done, I ask Claude to make incremental changes, review them. Very often, Claude does an OK job.
What I have hard times with is to have Claude automatically understand my class architectures: more often than not it tries to guess information about objects in the app by querying the GUI instead of the data model. Odd.
Your estimate of "6-7 month project for a 9 FTE team" was probably waaay off. I mean, what is this mobile app? Without even seeing your app, I would say 2 months TOPS with 2 devs. So, the "AI" version is really not that much better, and probably even worse.
You copied two human coded native apps into a vibe coded react app? If the vibe coding is so good why wouldn't you keep the native apps and vibe code on top of them instead of spending a bunch of money to reach feature parity with a worse version?
That’s exactly what I’ve become. A monkey typing prompts and pressing Enter to confirm plans and actions.
Obviously I am exaggerating but my days shifted from figuring out issues and coming up with solutions to explaining the issue to Claude and supervising the work.
What worries me is two things:
1. Current models were mostly trained on human work. Do we have enough training materials created now for the models to progress or they’ll be training or other models output? That cannot end well.
2. I’ve started as a Junior Engineer and had opportunity to learn and become Senior. The job market for junior is really bad cause businesses plan just for couple quarters ahead. They replaced juniors with AI. Who’s gonna replace seniors? And don’t say AI ;)
Since Anthropic has capacity problems I'm pretty sure they're limiting the $20/month guys to serve the $200/month business plans. I'm afraid coding will increasingly become pay-to-play. Luckily there is good competition.
This is hard to say definitively. The new Nvidia Vera Rubin chips are 35-50x more efficient on a FLOPS/ megawatt basis. TPU/ ASICS/ AMD chips are making similar less dramatic strides.
So a service ran at a loss now could be high margin on new chips in a year. We also don’t really know that they are losing money on the 200/ month subscriptions just that they are compute constrained.
If prices increase might be because of a supply crunch than due to unit economics.
Given the massive costs on training, R&D, and infrastructure build out in addition to the fact that both Anthropic and OpenAI are burning money as quickly as they can raise it, the safe bet is on costs going up.
Honestly some of this info is quite hard to parse. I think the efficiency is ~35X on the system level but 10X on the hardware level. I think this is due to Nvidia bringing in Groq in addition to chip improvements.
Seems like the real costs and numbers are very hidden right now. It’s all private companies and secret info how much anything costs and if anything is profitable.
That's like saying driving for Uber is profitable if you only take into consideration gas mileage but ignore car maintenance, payments, insurance, and all the other costs associated with owning a car.
Not sure which exact model you're talking about, but I've run the 30B and the 3.5 32B models and both can get some things done and can waste tons of time getting some things completely wrong.
They're fun to mess around with to figure out what they can and can't do, but they're certainly not not tools in the way I can count on Codex.
Only small businesses and startups pay $200/month, most medium+ sized companies will have an enterprise plan and pay by token usage to access the security, privacy, and compliance guarantees that their legal and security teams require.
Also, I think the $200/mo plan is subsidized by VC money and is likely hemorrhaging money for Anthropic, so it's not really meaningful to reason around that.
It seems far from clear at this point what the dollar value of agentic coding tools is if measured objectively in terms of value delivered.
IF they can be shown to be multiplying developer productivity (completing more projects on time, without reduction in quality and associated costs) by some significant amount then they are providing value at current cost, but it's not at all clear whether that is in fact the case, especially since most of the claims of productivity are anecdotal and/or based on things like LOC generated rather than delivered functionality.
Meta's "token usage leaderboard" shows how far some companies are from measuring anything meaningful! It'd be exactly like some company in the .com era measuring employee's "productivity" by how many bytes they'd downloaded from the internet each day (even if that was just a cat video). "Woo hoo, we're out-internetting you! Our internet bill is enormous!" (then proceeds to fire the guy coding, and gives a bonus to the one downloading cat videos).
There have been some studies/polls done indicating that some very high percentage (90%?) of corporate AI projects are failing. Why is this? Are they ill-conceived, and or ill-executed? Is it the quality of what's being produced that is causing these projects to be abandoned and/or considered as a failure?
There have also been some separate studies indicating programmer productivity to be reduced, not increased, by use of AI coding tools, which is easy to understand. The developer struggles with the tool and it's fallibilities, eventually gets it to generate something that works, then closes his JIRA story with an "AI coded" tag (which shows up on the boss's dashboard, and is all that he sees). Was this an AI productivity success story? To the boss perhaps, but not if the developer admits that it would have just been faster to do it the old way by hand or cut-n-paste from stack overflow.
Yeah completely agree. Even out of my own pocket I'd be willing to spend ~1k a month for the current AI, as compared to not having any AI at all. And I bet I could convince an employer to drop 5k a month on it for me. The consumer surplus atm is insane.
I mean, my openclaw instance was billing $200 a day for Opus after they banned using the max subscription. I think a fair amount of that was not useful use of Opus; so routing is the bigger problem. but, that sort of adds up, you know! At $1/hr, I loved Openclaw. At $15/hour, it's less competitive.
The key question is how well it a given model does the work, which is a lot harder to measure. But I think token costs are still an order of magnitude below the point where a US-based developer using AI for coding should be asking questions about price; at current price points, the cost/benefit question is dominated by what makes the best use of your limited time as an engineer.