This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.
"Prep the next three point releases now, but don't release any until I say so. None needs to be noticably better or even different, just has to have a higher number." -CEO of AI companies
I think this means that GPT5 is better - you can't launch a worse model after the competitor supersedes you - you have to show that you're in the lead even if its just for a day.
Not sure that this is true. Are there a lot of people waiting anxiously to adopt the next model on the day of release and expecting some huge work advantage?
My coworkers/partners and I haven’t stopped talking about it for weeks. I’m one of them I guess, but we’ll see. The ARC graph I saw, if accurate, is really incredible.
In my experience it take weeks if not months to coordinate a release, from testing to documentation to drafting press releases in multiple languages to benchmarks and website updates.
I’m old and I’ve been in this industry most of my life. I have never once seen or heard of all of that work being done and the company just waiting on competitors before pulling the trigger.
Eu auto brands colluded for years to synchronize new tech into their model lines. Could it be the AI SaaS sector is showing its first steps towards "maturity"? /s
Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.
Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.
Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.
Is there any documentation on what the max sub usage limit is? A coworker tried it and was booted off Opus within just a couple hours due to "high usage". I haven't made the jump since I expect my $3k/mo on API would just instantly fly by a $200/mo sub and then I'd just be back on API again, but if it could carve out $1k-2k of costs for a little bit of time managing sub(s) it might be worth it.
It's not documented - that's the whole point. They can scale it back and forth opaquely, letting the high volume users get more usage whenever the low-volume users aren't using it much. If it's explicit and transparent, you don't get the benefit of that, since it would be gamed by unscrupulous power users.
Also there's a cli argument that lets you specify the model. try `claude --help`.
I find the token/credit restrictions on Opus to be near useless even when using Claude Code. I only ever switch to it so get another model's take on the issue. Five minutes of use and I have hit the limit.
We have the $200 plans for work and despite only using Opus, we rarely hit the limits. CCUsage suggests the same via API would have been ~$2000 over the last month (we work 5 hours a day, 4 days a week, almost always with Claude).
Is it considerably more cost effective than cline+sonnet api calls with caching and diff edits?
Same context length and throughput limits?
Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.
I'm on the basic $20/mo sub and only ran into token cap limitations in the first few days of using Claude Code (now 2-3 weeks in) before I started being more aggressive about clearing the context. Long contexts will eat up tokens caps quickly when you are having extended back-and-forth conversations with the model. Otherwise, it's been effectively "unlimited" for my own use.
YMMV I'm using the $100/mo max subscription and I hit the limit during a focused coding session where I'm giving it prompts non-stop.
Unfortunately there's no easy tool to inspect usage. I started a project to parse the Claude logs using Claude and generate a Chrome trace with it. It's promising but it was taking my tokens away from my core project.
That's neat. According to the tool I'm consuming ~300m tokens per day coding with a (retail?) cost of ~125$/day. The output of the model is definitely worth $100/mo to me.
There are a lot of fraudsters out there who will happily create thousands of accounts with valid CCs that will fail on first actual charge.[0]
I wouldn't be surprised if asking for a phone number lowers the fraud rate enough to compensate for the added friction.
[0] Incidentally, this is also why many AI API providers ask for your money upfront (buy credits) unless you're big enough and/or have existing relationship with them.
Well, it's expensive compared to other models. But it's often much cheaper than human labor.
E.g. if need a self-contained script to do some data processing, for example, Opus can often do that in one shot. 500 line Python script would cost around $1, and as long as it's not tricky it just works - you don't need back-and-forth.
I don't think it's possible to employ any human to make 500 line Python script for $1 (unless it's a free intern or a student), let alone do it in one minute.
Of course, if you use LLM interactively, for many small tasks, Opus might be too expensive, and you probably want a faster model anyway. Really depends on how you use it.
(You can do quite a lot in file-at-once mode. E.g. Gemini 2.5 Flash could write 35 KB of code of a full ML experiment in Python - self-contained with data loading, model setup training, evaluation, all in one file, pretty much on the first try.)
In every price comparison I make. Claude (API) always comes out cheapest if you manage to keep most of your context cached. 90% price reduction for input is crazy.
My experience is that large models are capable of understanding large contexts much better. Of course they are more expensive and slower, too. But in terms of accuracy, large models are always better at querying the context.
I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?
I don't doubt Opus is technically superior, but it's not practically superior for me.
It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.
For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.
I've been having a great time with Windsurf's "Planning" feature. Have a nice discussion with Cascade (Claude) all about what it is that neerds to happen - sometimes a very long conversation including test code. Then when everything is very clear, make it happen. Then test and debug the results with all that context. Pretty nice.
In Zed I switch the AI panel to ask mode and chat with the agent about different approaches and have it draft patches. Then when I think there's a design worth trying, switch to Write mode and have it implement that change + run the tests and diagnostics to verify the code at least compiles, tests pass and follows our style guides. Finally a line by line review + review of the test coverage (in terms of interface surface area) before submitting a PR for another human review.
After watching a few videos trying to understand how people were using LLMs and getting useful results I found that even making a simpler version of the fancy planning mode in the LLM IDEs via the instructions.md produced hugely better productivity gains.
I started adding an instruction file along the lines of "Always tell me your plan to solve the issue first with short example code, never edit files without explicit confirmation of your plan" at the start and it is like a day and night difference in how useful it becomes. It also starts to feel like programming again where you can read through various files and instead of thinking in your head, you write out your thoughts. You end up getting confirmation or push back on errors that you can clean up.
Reading through a sort of wrong sort of right implementation spread across various files after every prompt just really sucked.
I'm not one shotting massive amounts of files, but I am enjoying the lack of grunt work.
That's essentially what I do, but that doesn't (and cannot) entirely solve the problem.
A major part of software engineering is identifying and resolving issues during implementation. Plans are a good outline of what needs to be done, but they're always incomplete and inaccurate.
Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.
This seems like a case of reversion to the mean. When one model is performing below average, changing anything (like switching to another model) is likely to improve it by random chance...
This is a great use case for sub-agents IMO. By default, sub-agents use sonnet. You can have opus orchestrate the various agents and get (close to) the best of both worlds.
AFAIK subagents inherit the default model since v1.0.64. At least that's the case for me with the Claude Code SDK — not providing a specific model makes subagents use claude-opus-4-1-20250805.
In this case I don't think the controller needs to be the smartest model. I use sonnet as the main driver and pass the heavy thinking (via zen mcp) onto Gemini pro for example, but I could use openai or opus or all of them via OpenRouter.
Subagents seem pretty similar to using zen mcp w/ OpenRouter but maybe better or at least more turnkey? I'll be checking them out.
Amp (ampcode.com) uses Sonnet as its main model and has GPT o3 as a special purpose tool / subagent. It can call into that when it needs particularly advanced reasoning.
Interestingly I found that prompting it to ask the o3 submodel (which they call The Oracle) to check Sonnet's working on a debugging solution was helpful. Extra interesting to me was the fact that Sonnet appeared to do a better job once I'd prompted that (like chain of thought prompting, perhaps asking it to put forward an explanation to be checked actually triggered more effective thinking).
Is there a way to get persistent sub-agents? I'd love to have a bunch of YAML files in my repository, one for each sub-agent, and have those automatically used across all Claude Code instances I have on multiple machines (I dev on laptop and desktop), or across the team.
In my experience the best use for subagents is saving context.
Example: you need to review some code to see if it has proper test coverage.
If you use the "main" context, it'll waste tokens on reading the codebase and running tests to see coverage results.
But if you launch an agent (a subprocess pretty much), it can use a "disposable" context to do that and only return with the relevant data - which bits of the code need more tests.
Now you can either use the main context to implement the tests or if you're feeling really fancy launch another sub-agent to do it.
I have suspected for a long time that hosted models load shed by diverting some requests to lesser models or running more quantized versions under high load.
> yet the general consensus and my own experience seem to be that Sonnet is much much better
Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.
I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.
I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.
Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.
I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.
Same. I'm on the $200 plan and I find Opus "better", but Sonnet is more straight forward. Sonnet is, to me, a "don't let it think" model. It does great if you give it concrete and small goals. Anything vague or broad and it starts thinking and it's a problem.
Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.
Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.
I use both. Sonnet is faster and more cost efficient. It's great for coding. Where Opus is noticeably better is in analysis. It surpasses Sonnet for debugging, finding patterns in data, creativity and analysis in general. It doesn't make a lot of sense to use Opus exclusively unless you're on a max20 plan and not hitting limits. Using Opus for design and troubleshooting and Sonnet for everything else is a good way to go.
Im on the Max plan and generally Opus seems to do better work than Sonnet. However, that’s only when they allow me to use Opus. The usage limits, even on the max plan, are a joke. Yesterday I hit the limits within MINUTES of starting my work day.
I'm using ccusage to get the number, I think it just looks at your history and calculates based on tokens vs API pricing. So I think it wouldn't account for caching.
But I totally agree there's no way it lasts. I'm mostly only using this for side projects and I'm sitting there interacting with it, not YOLO'ing, I do sometimes have two sessions going at the same time but I'm not firing off swarms or anything crazy. Just have it set to Opus and I chat with it.
Claude Code definitely reports cached tokens, and I think CCusage does too, so it wouldn’t make sense for the calculation to be based on full pricing when they have the cached values.
Is this on x5? Because ever since they booted all the freeloaders I’ve not once seen the “you are approaching usage limits” message. Anyway, the “you are approaching usage limits” message shows up when you are over 50% of your tokens for that timeframe, so it’s not sure useful.
If I'm using cursor then sonnet is better, but in claude code Opus 4 is at least 3x better than Sonnet. As with most things these days, I think a lot of it comes down to prompting.
This is interesting. I do use Cursor with almost exclusively Sonnet and thinking mode turned on. I wonder if what Cursor does under the hood (like their indexing) somehow empowers Sonnet more. I do not have much experience with using Claude Code.
Opus really shines for completing long-running tasks with no supervision. But if you are using Claude Code interactively and actively steering it yourself, Sonnet is good enough and is faster.
I don't believe anyone saying Sonnet yields better results than Opus though, as my experience has been exactly the opposite. But trade-off wise, I can definitely see it being a better experience when used interactively because of its speed and lower cost.
With aggressive Claude Code use I didn't find Sonnet better than Opus but I did find it faster while consuming far fewer tokens. Once I switched to the $100 Max plan and configured CC to exclusively use Sonnet I haven't run into a plan token limit even once. When I saw this announcement my first thing was to CMD-F and see when Sonnet 4.1 was coming out, because I don't really care about Opus outside of interactive deep research usage.
My opinion of Opus is that it takes the correct action 19/20 times, where Sonnet takes the correct action 18/20 times. It’s not strictly necessary to use Opus, but if you have the subscription already it’s just a pure win.
I've found with limited context provided in your prompt, opus is just awful compared to even gpt-4.1, but once I give it even just a little bit more of an explanation, it jumps leagues ahead.
> If you believe the benchmarks are reflective of reality anyways.
That's a big "if." But yeah, I can't tell a difference subjectively between Opus and Sonnet, other than maybe a sort of placebo effect. I'm more careful to write quality prompts when using Opus, because I don't want to waste the 5x more expensive tokens.
Just more ancedata, but I entirely agree. I can't say that I am happy with Sonnet's output at any point, really, but it still occasionally works, whereas Opus has been a dumpster fire every single time.
The finest of AI, probably using electricity/water for 100s of homes can not even beat a very simple children game with millions of texts guides etc. about it.
I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.
Major AI companies are not doing nearly enough to address the sycophancy problem.
I get that it's not an easy problem to solve, but how is Anthropic supposed to solve the actual alignment problem if they can't even stop their production LLMs from glazing the user all the time? And OpenAI is somehow even worse.
I feel like this is just related to my projects getting bigger. Claude Code is trying to keep up with my project evolving from 2k lines of code to 100k lines. Of course it’s going to feel worse.
I think it is how our expectations of the latest model change over time.
I expect to be completely blown away by GPT-5 in the first few days and then over time I will figure out the limitations of the model. Then I will be less impressed because you don't know what it can't do at first.
Other than it starting out trying to produce a full and complete web app (or whatever) for my daily yak shaving session instead of the normal "let's talk about and work through this thing" the new Opus 4.1 seems to 'get it' a lot quicker than the old daffy robot did. It asked pertinent questions to understand the system we are working on and accomplished the goal of updating the design document so I don't have to keep explaining details at the start of every chat session. Something, by the way, it always previously failed to do causing me to have to explain stuff each and every time before forward progress could be made.
I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.
Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.
The article says "We plan to release substantially larger improvements to our models in the coming weeks."
Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.
I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.
i think its probably mostly vibes but that still counts, this is not in the charts
> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
And in 52 weeks we've gone 3.5->4.1 with this training improvement, meanwhile the 52 weeks prior to that were Claude -> Claude 3. The absolute jumps per version delta also used to be larger.
I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.
Why is there supposed to be no step between frequently useful and indispensable? Quickly going from nothing to frequently useful (which involved many rapid hops between) was certainly surprising, and that's precisely the lost momentum.
They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.
I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.
I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.
Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.
This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.
This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.
Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.
Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!
I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?
Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?
When people post this stuff, it's like, are you also confused that Nike sells shoes AND shorts AND shirts, and there's different colors and skus for each article of clothing, and sometimes they sell direct to consumer and other times to stores and to universities, and also there's sales and promotions, etc, etc?
It's almost as if companies sell more than one product.
Why is this the top comment on so many threads about tech products?
In this case, they tried something and were told they were doing it wrong, and they know there's more than one way to do it wrong - wrong model, wrong tool using the model, wrong prompting, wrong task that you're trying to use it for.
And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.
On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.
> I think it's pretty different from buying shoes.
Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
Are you a construction worker, a banker, a cashier or a driver? Are you walking 5 miles everyday or mostly sedentary? Do you require steel toed shoes? How long are you expecting them to last and what are you willing to pay? Are you going to wear them on long runs or take them river kayaking? Do they need to be water resistant, waterproof or highly breathable? Do you want glued, welted, or stitch down construction? What about flat feet or arch support? Does shoe weight matter? What clothing are you going to wear them with? Are you going to be dancing with them? Do the shoes need a break in period or are they ready to wear? Does the available style match your preferences? What about availability, are you ok having them made to order or do you require something in stock now?
By comparison I can try 10 different AI services without even needing to stand up for a break while I can't buy good dress shoes in the same physical store as a pair of football cleats.
> Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
Oh c'mon, now you're just being disingenuous, trying to make an argument for argument's sake.
No, shoe shopping is not more complicated than trialing a LLM. For all of those questions about shoes you are posing, either a) a purchaser won't care and won't need to ask them, or b) they already know they have specific requirements and will know what to ask.
With an LLM, a newbie doesn't even know what they're getting into, let alone what to ask or where to start.
> By comparison I can try 10 different AI services without even needing to stand up for a break
I can't. I have no idea how to do that. It sounds like you've been following the space for a while, and you're letting your knowledge blind you to the idea that many (most?) people don't have your experience.
It sounds like you're generally unfamiliar with using AI to help you at all? Or maybe you're also being disingenuous? It's insanely easy to figure this stuff out, I literally know a dozen people who are not even engineers, have no programming experience, who use these tools. Here's what Claude (the free version at claude.ai) said in response to me saying "i have no idea how to use AI coding assistants, can you succinctly explain to me what i need to do? like, what do i download, run, etc in order to try different models and services, what are the best tools and what do they do?":
Here's a quick guide to get you started with AI coding assistants:
## Quick Start Options (Easiest)
*1. Web-based (Nothing to Download)*
- *Claude.ai* - You're here! I can help with code, debug, explain concepts
- *ChatGPT* - Similar capabilities, different model
- *GitHub Copilot Chat* - Web interface if you have GitHub account
*2. IDE Extensions (Most Popular)*
- *Cursor* - Full VS Code replacement with AI built-in. Download from cursor.com, works out of the box
- *GitHub Copilot* - Install as VS Code/JetBrains extension ($10/month), autocompletes as you type
- *Continue* - Free, open-source VS Code extension, lets you use multiple models
*3. Command Line*
- *Claude Code* - Anthropic's terminal tool for autonomous coding tasks. Install via `npm install -g @anthropic-ai/claude-code`
- *Aider* - Open-source CLI tool that edits files directly
## What They Do
- *Autocomplete tools* (Copilot, Cursor) - Suggest code as you type, finish functions
- *Chat tools* (Claude, ChatGPT) - Explain, debug, design systems, write full programs
- *Autonomous tools* (Claude Code, Aider) - Actually edit your files, make changes across codebases
## My Recommendation to Start
1. Try *Cursor* first - download it, paste in some code, and ask it questions. It's the most beginner-friendly
2. Or just start here in Claude - paste your code and I can help debug, explain, or write new features
3. Once comfortable, try GitHub Copilot for in-line suggestions while coding
The key is just picking one and trying it - you don't need to understand everything upfront!
Ya know, in the over half a century I've been on this planet, choosing a new pair of shoes is so low on my 'life's little annoyances' list that it doesn't even rise above the noise of all the stupid random things which actually do annoy me.
Maybe the problem is I don't take shoes seriously enough? Something to work on...
You also learned about your shoe needs over the course of a lifetime. A caregiver gave you your first pair and you were expected to toddle around at most with them. You outgrew and replaced shoes as a child, were placed into new scenarios requiring different footwear as you grew up, learning and forming opinions about what's appropriate functionally, socially, economically as you went. You learned what stores were good for your needs, what brands were reputable, what styles and fits appealed to you. It took you more than a decade at minimum to achieve that.
If you allow yourself to be a novice and a learner with AI and LLMs and don't expect to start out as a "shoe expert" where you never even think about this in your life and it's not even an annoyance, you'll find that it's the exact same journey.
Is it though? People complain about sore feet and hear they wear the wrong kind of shoes so they go to the store where they have to spend money to find out while trying to navigate between dress shoes, minimal shoes, running shoes, hiking shoes etc etc., they have to know their size, ask for assistance in trying them on...
Because the offerings are not simple. Your Nike example is silly; everyone knows what to do with shoes and shorts and shirts, and why they might want (or not want) to buy those particular items from Nike.
But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.
I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.
> Why is this the top comment on so many threads about tech products?
Because you overestimate the difference that the representative person understands.
A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.
You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.
Also, the green-blue shoes charge per-step, but the blue-green shoes are billed monthly by signing up for BlueGreenPro+ or BlueGreenMax+, each with a hidden step limit but BlueGreenMax+ is the one that gives you access to the Cyan step model which is better; plus the green-blue shoes are only useful when sprinting, but the blue-green shoes can be used in many different events, but only through the Nike blue-green API that only some track&field venues have adopted...
When you walk into a store, you can see and touch all of these products. It's intuitive.
With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!
Claude.ai, ChatGPT, etc. are finished B2C products. They're black boxes, encapsulated experiences. Consumers don't want to pick a model, or know what model they're using; they just want to "talk to AI", and for the system to know which model is best to answer any given question. I would bet that for these companies, if their frontend observes you using the little model override button, that gets instrumented as an "oops" event in their metrics — something they aim to minimize.
What you're looking for, are the landing pages of the B2B API products underlying these B2C experiences. That would be https://www.anthropic.com/claude, https://openai.com/api/, etc. (In general, search "[AI company] API".)
From those B2B landing pages, you can usually click through to pages with details about each of their models.
(Also, note how these B2B pages are on the AI companies' own corporate domains; whereas their B2C products have their own dedicated domains. From their perspective, their B2C offerings are essentially treated as separate companies that happen to consume their APIs — a "reference use-case" — rather than as a part of what the B2B company sells.)
Hey, I'm open to the idea that I'm just stupid. But, if people in your target market (software developers) don't even understand your product line and need a HOWTO+glossary to figure it out, maybe there's also a branding/messaging/onboarding problem?
Eh, this seems like a take that reeks a bit of "everyone is stupid except me".
I do know the answer to OP's question but that's because I pickle my brain in this stuff. It is legitimately confusing.
The analogy to different SKUs strikes me also inaccurate. This isn't the difference between shoes, shirts, and shorts - it's more as if a company sells three t-shirts but you can't really tell what's different about them.
It's Claude, Claude, and Claude. Which ones code for you? Well, actually, all of them (Code, web/desktop Claude, and the API can all do this)
Which ones do you ask about daily sundry queries? Well, two of them (web/desktop Claude, but also the API, but not Code). Well, except if your sundry query is about a programming topic, in which case Code can also do that!
Ok, if I do want to use this to write code, which one should I use? Honestly, any of them, and the company does a poor job of explaining why you would use each option.
"Which of these very similar-seeming t-shirts should I get?" "You knob. How are posts like this even being posted." is just an extremely poor way to approach other people, IMO.
> It's Claude, Claude, and Claude. Which ones code for you?
Thanks for articulating the confusion better than I could! I feel it's a similar branding problem as other tech companies have: I'm watching Apple TV+ on my Apple TV software running on my Apple TV connected to my Google TV that isn't actually manufactured by Google. But that Google TV also has an Apple TV app that can play Apple TV+.
It's a bit worse than a branding problem honestly, since there's legitimate overlap between products, because ultimately they're different expressions of the same underlying LLMs.
I'm not sure if you ever got a good rundown, but the tl;dr is that the 3 products ("Desktop", Code, and API) all expose the same underlying models, but are given different prompts, tools, and context management techniques that make them behave fairly differently and affect how you interact with them.
- The API is the bare model itself. It has some coding ability because that's inherent to the model - you can ask it to generate code and copy and paste it for example. You normally wouldn't use this except that if you're using some Copilot-type IDE integration where the IDE is doing the work of talking to the model for you and integrating it into your developer experience. In that case you provide API key and the IDE does the heavy lifting.
- The desktop app is actually a half-decent coder. It's capable of producing specific artifacts, distinguishing between multiple "files" it's writing for you, and revisiting previously-written code. "Oh, actually rewrite this in Go." is for example a thing it can totally do. I find it useful for diagnosing issues interactively.
- "Claude Code" is a CLI-only wrapper around the model. Think of it like Anthropic's first-party IDE integration, except there's not an IDE, just the CLI. In this case the integration gives the tool broad powers to actually navigate your filesystem, read specific files, write to specific files, run shell commands like builds and tests, etc. These are all functions that an IDE integration would also give you, but this is done in a Claude-y way.
My personal take is: try Claude Code, since as long as you're halfway comfortable with a CLI it's pretty usable. If you really want a direct IDE integration you can go with the IDE+API key route, though keep in mind that you might end up paying more (Claude Code is all-you-can-eat-with-rate-limits, where API keys will... just keep going).
FWIW it's probably because a lot of us have been following along and trying these things from the start so the nuances seem more obvious but also I feel that some folks feel your question is a bit "stupid", like "why are you suddenly interested in the frontier of these models? where were you for the last 2 years?"
And to some extent it is like the PC race. Imagine going to work and writing software for whatever devices your company writes software for in whatever toolchain your company uses. Then 2-3 years after the PC race began heating up, asking "Hey I only really write code for whatever devices my employer gives me access to. Now I want to buy one of these new PCs but I don't really understand why I'd choose an Intel over a Motorolla chipset or why I'd prioritize more ROM or more RAM, and I keep hearing about this thing called RISC that's way better than CISC and some of these chips claim to have different addressing modes that are better?"
Also when it comes to API integrations, I find some better than others. Copilot has been pretty crummy for me but Zed's Agent Mode seems to be almost as good as Claude Code. I agree with the general take that Claude Code is a good default place to start.
Claude Code running in a terminal can connect to your IDE so you can review its proposed changes there. I’ve found this to be a nice drop in way to try it out without having to change your core workflow and tools too much. Check out the /ide command for details.
If anything, Anthropic has the product lineup that makes the most sense. Higher numbers mean better model. Haiku < Sonnet < Opus which translates to length/size. Free < Pro < Max.
Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?
Which Nike shoe is best for basketball? The Nike Dunk, Air Force 1, Air Jordan, LeBron 20, LeBron XXI Prime 93, Kobe IX elite, Giannis Freak 7, GT Cut, GT Cut 3, GT Cut 3 Turbo, GT Hustle 3, or the KD18?
At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?
What's the average programmer? Is it someone who likes CLI tools? Or who likes IDE integration? Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
The environment isn't the only difference, it's not "do you prefer CLI or IDE or Web" because they behave differently. Claude Code and Claude web and Claude through Cursor won't give you identical outputs for the same question.
It's not like running a tool in your IDE or CLI where the only difference is the interface. It would be like if gcc ran from your IDE had faster compile times, but gcc run from the CLI gives better optimizations.
The fact that no one is recommending any baseline to start with proves the point that it's confusing. And we haven't even touched on Sonnet v Opus
> Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
That's a silly claim to me, we're talking about a completely new environment where you prompt an AI to develop code, and therefore an "average programmer" is unlikely to have any meaningful experience or intuition with this flow. That is exactly what GP is talking about - where does he plug in the AI? What tradeoffs are there to different options?
The other day I had someone judge me for asking this question by dismissively saying "dont say youve still been using ChatGPT and copy/paste", which made me laugh - I don't use AI at all, so who was he looking down on?
To me that's the silly argument. How many different tools have you ever used? New build system? New linter? How did you know if you wanted to run those on the command line or in your IDE?
And it seems the story you shared sort of proves the point: the web interface worked fine for you and you didn't need to question it until someone was needlessly rude about it.
Because few seem to want to expend the effort to dive in and understand something. Instead they want the details spoonfed to them by marketing or something.
This is like being told to buy Nike shoes. Then when you proudly display your new cleats, they tell you "no, I meant you should by basketball shoes. The cleats are terrible."
Because I think that claude has gone beyond tech niche at this point..
Or maybe that's me, but still whether its through the likes of those vibe coding apps like lovable bolt etc.
at the end of the day, Most people are using the same tool which is claude since its mostly superior in coding (questionable now with oss models, but I still use it through kiro).
People expect this stuff to be simple when in reality its not and there is some frustation I suppose.
You're comparing well understood products that are wildly different to products with code names. Even someone who has never wore a t-shirt will see it on a mannequin and know where it goes.
I'm sorry but I cannot tell what the difference is between sonnet and opus. Unless one is for music...
So in this case you read the docs. Which is, in your analogy, you going to the Nike store and reading up on if a tshirt goes on your upper or lower body.
Surely anyone interested in taking out a Claude subscription knows broadly what they're going to use an LLM for.
It's more like going to the Nike store and asking about the difference between the Vaporfly 3 and the Pegasus 41. I know they're all shoes and therefore go on my feet, but I don't know what the difference is unless one is better for riding horses?
On the contrary, I'm confused about why you're confused.
This is a well-known and documented phenomenon - the paradox of choice.
I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.
I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.
You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.
Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.
Actually, to try it out, prepaid token billing is fine. You are not required to have a subscription for claude code cli. Even just $5 gave me enough breathing room to get a feeling for its potential, personally. I do not touch code often these days so I was relieved not to have to subscribe and cancel again just to play around a little and have it write some basic scripts for me.
I wouldn't be too prescriptive. I have Pro, and it's fine. I'm not an incredibly heavy user (yet?); I've hit the rate limits a couple times, but not to the point where I'm motivated to spend more.
I haven't tried it myself, but I've heard from people that Opus can be slow when using it for coding tasks. I've only been using Sonnet, and it's performed well enough for my purposes.
What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.
Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month.
Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits.
So I would recommend paying $20 and trying the Claude Code via that subscription.
I’m looking for cursor alternatives after confusing pricing changes. Is Claude code an option? Can be integrated on an editor/ide for similar results?
My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.
If you like an IDE, for example VS Code you can have the terminal open at the bottom and run Claude Code in that. You can put your instructions there and any edits it makes are visibile in the IDE immediately.
Personally I just keep a separate terminal open and have the terminal and VSCode open on two monitors - seems to work OK for me.
VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.
But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.
This workflow works well for exploring and understanding new topics and technologies.
Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.
Lets see: We have GitHub, and GitHub Enterprise Server, and a GitHub API. Then there's the command line and a desktop version, and one that is just browser based I guess. Then you have different pricing plans, Free, Team, and Enterprise? How is Enterprise different than GitHub Enterprise Server? It's very easy to find evidence to confirm our bias.
Claude code is actually one of the most straightforward products I've used as far as onboarding goes. You download the tool, and follow the instructions. You can use one of the 3 plans, and everything else is automatic. You can figure out token usage and what models and versions to use and how to use MCP servers and all of that -- there's a lot of power -- but you don't need to do ANY of that to get started trying it out.
You're not being:
> That critic who doesn't try the stuff he criticizes
You're being:
> That critic who is trying to confirm their biases
If you're looking for a coding assistant, get Claude Code, and give it a try. I think you need the Pro plan at a minimum for that ($20/mo; I don't think Free includes Claude Code). Don't do the per-request API pricing as it can get expensive even while just playing around.
Agree that the offering is a bit confusing and it's hard to know where to start.
Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.
Yes. You basically need an LLM to provide guidance on product selection in this brave new world.
It is actually one of my most useful use cases of this tech. Nice to have a way to ask in private so you don’t get snarky answers like: it’s just like buying shoes!
Thanks. With the CLI, can you get Copilot-ish things like tab-completion and inline commands directly in your IDE? Or do you need to copy/paste to and from a terminal? It feels like running a command on the IDE and then copying the output into your IDE is a pretty primitive way to operate.
1) Completely separate in your mind the auto-completion features from the agentic coding features. The auto-completion features are a neat trick but I personally find those to be a bit annoying overall, even if they sometimes hit it completely right. If I'm writing the code, I mostly don't want the LLM autocompletion.
2) Pay the $20 to get a month of Claude Pro access and then install Claude Code. Then, either wait until you have a small task in mind or your stuck on some stupid issue that you've been banging your head on and then open your terminal and fire up Claude Code. Explain to it in plain English what you want it to do. Pretend it's a colleague that you're giving a task to over Slack. And then watch it go. It works directly on your source code. There is no copying and pasting code.
3) Bookmark the Claude website. The next time you'd Google something technical, ask it Claude instead. General questions like "how does one typically implement a flizzle using the floppity-do framework"? "I'm trying to accomplish X, what are my options when using this stack?". General questions like that.
From there you'll start to get it and you'll get better at leverage the tool to do what you want. Then you can branch out the rest of the tool ecosystem.
Interesting about the auto-completion. That was really the only Copilot feature I found to be useful. The idea of writing out an English prompt and telling Copilot what to write sounded (and still sounds) so slow and clunky. By the time I've articulated what I want it to do, I might as well have written the code myself. The auto-completion was at least a major time-saver.
"The card game state is a structure that contains a Deck of cards, represented by a list of type Card, and a list of Players, each containing a Hand which is also a list of type Card, dealt randomly, round-robin from the Deck object." I could have input the data structure and logic myself in the amount of time it took to describe that.
I think you should embrace a bit of ambiguity. Don't treat this like a stupid computer where you have to specify everything in minute detail. Certainly the more detail you give, the better to an extent. But really: Treat it like you're talking to a colleague and give it a shot. You don't have to get it right on the first prompt. You see what it did and you give it further instructions. Autocomplete is the least compelling feature of all of this.
Also, I don't remember what model Copilot uses by default, especially the free version, but the model absolutely makes a difference. That's why I say to spend the $20. That gives you access to Sonnet 4 which is where, imo, these models took a giant leap forward in terms of quality of output.
One analogy I have been thinking about lately is GPUs. You might say "The amount of time it takes me to fill memory with the data I want, copy from RAM to the GPU, let the GPU do it's thing, then copy it back to RAM, I might as well have just done the task on the CPU!"
I hope when I state it that way you start to realize the error in your thinking process. You don't send trivial tasks to the GPU because the overhead is too high.
You have to experiment and gain experience with agent coding. Just imagine that there are tasks where the overhead of explaining what to do and reviewing the output are dwarfed by the actual implementation. You have to calibrate yourself so you can recognize those tasks and offload them to the agent.
There's a sweet spot in terms of generalization. Yes, painstakingly writing out an object definition in English just so that the LLM can write it out in Java is a poor use of time. You want to give it more general tasks.
But not too general, because then it can get lost in the sauce and do something profoundly wrong.
IMO it's worth the effort to know these tools, because once you have a more intuitive sense for the right level of abstraction it really does help.
So not "make this very basic data structure for me based on my specs", and more like "rewrite this sequential logic into parallel batches", which might take some actual effort but also doesn't require the model to make too many decisions by itself.
It's also pretty good at tests, which tends to be very boilerplate-y, and by default that means you skip some cases, do a lot of brain-melting typing, or copy-and-paste liberally (and suffer the consequences when you missed that one search and replace). The model doesn't tire, and it's a simple enough task that the reliability is high. "Generate test cases for this object, making sure to cover edges cases A, B, and C" is a pretty good ROI in terms of your-time-spent vs. results.
Is there any more agent-oriented approach where it just push/pulls a git repo like a normal person would, instead of running it on my machine? I'd like to keep it a bit more isolated and having it push/pull its own branches seems tidier.
> I just want to putz around with something in VSCode for a few hours!
I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.
Why care about pricing and product names and UI until it's a problem?
> Someone on HN told me Copilot sucks, use Claude.
I concur, but I'm also just a dude saying some stuff on HN :)
2. Write a wrapper around Ollama using your favorite language. The idea is that you want to be able to intercept responses coming back from the model.
3. Create a system prompt that tells the model things like "if the user is asking you to create a file, reply in this format:...". Generally to start, you can specify instructions for read file, write file, and execute file
4. In your wrapper, when you send the input chat prompt, and get the model response back, you look for those formats, and make the wrapper actually execute the action. For example if the model replies back with the format to read file, you read the file from your wrapper code and send it back to the model.
Every coding assistant is basically this under the hood with just a lot more fluff and their own IDE integration.
The benefit of doing your own is that you can customize it to your own needs, and when you direct a model with more precision even the small models perform very well with much faster speed.
OP is asking for where to get started with Claude for coding. They're confused. They just want to mess around with it in VSCode. And you start talking about Ollama, PAT, coding your own wrapper, composing a system prompt etc.!?
OP is trying to get LLMs to assist with coding. Implying that coding is something he is capable of, and coding your own wrapper is a great way to get familiarity with these systems.
Download Cursor and try it through that, IMO that's currently the most polished experience especially since you can change models on the fly. For more advanced usecases, CLI is better but for getting your feet wet I think Cursor is the best choice.
Thanks. Too bad you need to switch editors to go that path. I assume the Cursor monthly plans are not the same as the Claude monthly plans and you can't use one for the other if you want to experiment...
You just described all of your options in detail - what's the problem? Pick one. Seems like you've got a very thorough grasp on how to get started trying the stuff out, but it requires you to choose how you want to do that.
Github Copilot and Claude code are not exactly competitors.
Github Copilot is autocomplete, highly useful if you use VS Code, but if you are using e.g. Jetbrains then you have other options. Copilot comes with a bunch of other stuff that I rarely use.
Claude code is project-wide editing, from the CLI.
They complement each other well.
As far as I'm concerned the utility of the AI-focused editors has been diminished by the existence of Claude code, though not entirely made redundant.
This isn't correct. GitHub Copilot now totally competes with Claude Code. You can have it create an entire app for you in "Agent" mode if you're feeling brave. In fact, seeing as Copilot is built directly into Visual Studio when you download it, I guess they have a one-up.
Copilot isn't locked to a specific LLM, though. You can select the model from a panel, but I don't think you can plug in your own right now, and the ones you can select might not be SOTA because of that.
I didn't mean it doesn't attempt to compete, I mean it doesn't actually compete. Claude code for agents, Copilot for autocomplete (depending on your editor/IDE).
For single-line autocomplete, which is how I use it, pretty much anything will do the job. I use Copilot only because it integrates well with VS Code. I find the other features to be inferior.
I use Copilot for the same reason (it's already there in Visual Studio). But I think we're talking about different things -- did you try Agent mode in Copilot? (the naming of all these things is getting confusing)
Sonnet 4 in copilot agent mode has been doing great work for me lately. Especially once you realise that at least 50% of the work is done before you get to copilot, as architectural and product specs and implementations plans.
Ehhh... I wouldn't use it for anything important right now. It often screws up by truncating code files then asking itself "where did all those functions go?" and having to rewrite them from scratch.
When it works, it's great though. I've used it to vibe-code some nice little desktop apps to automate things I needed and it produced way more polished UI than I would have spent the time doing, and the code is pretty much how I would have written it myself. I just set it going and go do some other task for 10 mins and come back to see what changes it made.
Opencode https://github.com/sst/opencode provides a CC like interface for copilot. It's a slightly worse tool, but since copilot with Claude 4 is super cheap, I ended up preferring it over CC. Almost no limits, cheaper, you can use all the Copilot models, GH is not training on your data.
honestly - copilot free mode; and just play with the agentic stuff can give you a good idea. Attach it to Roo and you'll get a good idea. Realize that if you paid to use a better model; you'd get better results as free doesn't have a ton of premium tokens.
All the tools, copilot,claude, gemini in vscode are all completely worthless unless in Agent Mode. I have no idea why none of these tools dont default to Agent mode.
Is there any tool like Claude Code that can go into the same "automatic feedback and coding loop" (I don't know if it has an official name) but compatible with using different LLMs.
I've used Aider for a while, and I kind of liked if, but it felt like it needed way more manual work, and I also want to use different models, probably locally hosted. Haven't used Aider in 2 or 3 months, so I don't know if it already has evolved in that way...
edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.
o3 and o3-pro are just so good. Sonnet goes off the deep end too often and Opus, in my experience, is not as strong at reasoning compared to OpenAI, despite the higher costs. Rarely do we see a worse, more expensive product win - but competition is good and I’m rooting for Anthropic nonetheless!
OpenAI also has Flex processing[1] for o3. I've spent most of my time with Gemini 2.5, but lately been trying out a ton of o3 as it seems to work quite well and I get really cheap tokens (~95% of my agentic tokens are cached which is 75% discount and flex mode adds 50% for $0.25 / million input tokens)
I've made my own fork of Codex that always uses flex, or you can route agents through litellm and make it add the service_tier parameter. I haven't really seen native support for it anywhere.
o3 feels pretty good to me as well but o3-pro has consistently one shotted problems other LLMs got stuck on.
I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.
Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.
o3-pro level LLMs at reduced cost and increased speed will already be amazing..
Probably referring to it's tendency to over-complicate things to the point you have to step in and be like "WTF are you even talking about... Wouldn't it be a lot simpler to just use the original, well planned out design?"
If they release before GPT-5, they don't have to compare to GPT-5 in their benchmarks. It's a big PR win to be able to plausibly claim that your model is the best coding model at the time of release.
Could it be nobody wanted to be first and overshadowed, nor the only one left out - and it cascaded after the first announcement? My first hunch, though, was that it had been agreed upon. Game theory I think tells us that releasing same day in the pattern ABC BCA CAB etc would be lowest risk and highest average gain?
The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.
It's interesting that Anthropic maintains current prices for prior state of the art models when doing a new release. Why offer a model with worse performance for the same price? What incentives are they trying to create?
One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.
I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."
I'm guessing it's mostly for legacy reasons. When 3.7 came out many people were not happy with it and went back to 3.5; I guess supporting older models for a while makes sense.
Claude Code has honestly made me at least 10x more productive. I’ve burned through about 3 billion tokens and have been consistently merging 5+ PRs a day, tackling tons of tech debt, improving GitHub Actions, and making crazy progress on product work
only 10x? I'm at least 100x as productive. I only type at a measly 100wpm, whereas Claude can output 100+ tokens a second
I'm outputting a PR every 6 minutes. The reviewers are using Claude to review everything. It used to take a day to add 100 lines to the codebase.. now I can add 100 lines in one prompt
If I want even more productivity (at risk of making the rest of my team look slow) I can tell Claude to output double the lines and ship it off for review. My performance metrics are incredible
So no human reads the actual code that you push to production? Are you not worried about security risks, spaghetti code, and other issues? Or does Claude magically make all of those concerns go away?
This is only the beginning. I can see myself having 100 Claude tasks running concurrently - the only problem is edits clash between files. I'm working on having Claude solve this by giving each instance its own repo to work with, then I ask the final Claude to mash it all together as best it can
What's 100x productivity multiplied by 100 instances of Claude? 10,000x productivity
Now to be fair and a bit more realistic it's not actually 10000x because it takes longer to push the PR because the file sizes are so big. Let's call it 9800x. That's still a sizable improvement
I also have this feeling that I'm 2-10x more productive. But isn't it curious how a lot of devs feel this way, but no devs that I know have the experience that any of their colleagues have become 2-10x more productive?
<raises hand> Our automated test folks were chronically behind, struggling to keep up with feature development. I got the two assigned to the team that was the most behind set up with Claude Code. Six weeks later they are fully caught up, expanding coverage, and integrating AI code review into our build pipeline.
It's not 10x, but those guys do seem like they've hit somewhere around 2x improvement overall.
Sometimes 10x can mean that I start things that I would have never started before, knowing it would take a long time. Or that I can have any of the agentic stuff "explore" libs, stacks and frameworks I wanted to look at, but had no time. Or distill some vague docs and blog posts to find common use cases for tech x. And so on.
It's not always a literal 10x time for taskA w/ AI vs taskA w/o AI...
What type of work do you do and what type of code do you produce?
Because I've found it to work pretty amazingly for things that don't need to be exact (like data modeling) or don't have any security implications (public apps). But for everything else I end up having to find all the little bugs by reading the code line by line, which is much slower than just writing the code in the first place.
How do you maintain high confidence in the code it generates ?
My current bottleneck is having to review the huge amounts of code that these models spit out. I do TDD, use auto-linting and type-checking.... but the model makes insidious changes that are only visible on deep inspection.
You have to review your code for quality and bugs and errors now just as you did last month or last year. Did you never write bugs accidentally before?
We're all bottlenecked on reviewing now. That's a good thing.
There was a greater awareness of exactly what I'd written. By definition, I would not have written those bugs in, as long as I had known edge cases in my mind.
Lapses of judgement and syntax errors happen, but they're easier to spot because you know exactly what you're looking at. When code is written by a model, I have to review it 3 times.
1st to understand the code. 2nd to identify lapses in suspicious areas. 3rd to confirm my suspicions through interactive tests, because the model can use patterns I'm unfamiliar with, and it takes me some googling to confirm if certain patterns used by the model are outright bugs or not. The biggest time sink is fixing an identified bug, because now you're doing it in someone-else's (model's) legacy code rather than a greenfield feature implementation.
It's a big productivity bump. But, if reviewing is the bottleneck, then that upper bounds the productivity gains at ~4x for me. Still incredible technology, but the death of software-engineering that it is claimed to be.
Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".
It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.
Claude plus failed me today badly compared to chatGPT plus.
I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.
I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).
Will the price for 4 go down? I still find Opus completely unusable for the cost/performance, as someone who spends thousands per month on tokens. There's really no noticeable difference from Sonnet, at nearly 10x the price.
On the other hand, they have always exposed their raw chain of thought, so you know exactly what you're paying for, unlike OpenAI who hides it. Similarly they allow an actual thinking budget rather than vague "low, medium, high", again unlike OpenAI. They also allow API access to all their models without draconic send-us-your-personal-data-KYC, once more unlikely OpenAI.
They might not fit your personal definition of "openness", but they do fit many other equally valid interpretations of that contept.
All three major labs released something within hours of each other. This anime arc is insane.
This is why you have PR departments. Being on top of the HN front page, news sites, etc matters a lot. Even if you can't be the first, it's important to dilute the attention as much as possible to reduce the limelight your competitors get.
"Prep the next three point releases now, but don't release any until I say so. None needs to be noticably better or even different, just has to have a higher number." -CEO of AI companies
How do they know when it's time? Corporate espionage? Or do they just have Next Thing queued up months in advance and ready to go.
Given the GPT5 rumors, August is just getting started.
Given the Gregorian Calendar and the planet's path through its orbit, August is just getting started.
https://tintin.dlazaro.ca/month
This legitimately made me chuckle.
Good one, made my day
No, the rotation of the Earth around its axis did so.
Technically, it Earth's rotation gives us day and night. Doesn't move the calendar, which is through Earth's orbital revolution
Exactly the GP's point (the rotation of the Earth "made their day").
Given the majesty and nobility of HN commenters, augustness is just getting started.
I'm all buckled up and ready for what's effectively GPT 4.6
What a time to be alive
as if they wait competitor first then launch it at the same time to make market decide which one is best
I think this means that GPT5 is better - you can't launch a worse model after the competitor supersedes you - you have to show that you're in the lead even if its just for a day.
Not sure that this is true. Are there a lot of people waiting anxiously to adopt the next model on the day of release and expecting some huge work advantage?
If you’re using an LLM near the limits of what it can do then a small improvement in performance is noticeable.
My coworkers/partners and I haven’t stopped talking about it for weeks. I’m one of them I guess, but we’ll see. The ARC graph I saw, if accurate, is really incredible.
Absolutely.
There's so many leakers in every lab
If only there were more leakers in the FBI or DOJ.
It's a risky game to leak the secrets of the gang that has a legal monopoly on violence.
Slipped in the bathroom and hung himself on the shower curtains. Oh, what a shame.
Sneakier is safer.
They likely sit on releases ready to go.
It's definitely a coincidence
It's not a coincidence or a cartel, it's PR counterprogramming.
Agree 100%
If you look at the past, whenever Google announces something major, OpenAI almost always releases something as well.
People forget realize that OpenAI was started to compete with Google on AI.
Any source or just vibes?
In my experience it take weeks if not months to coordinate a release, from testing to documentation to drafting press releases in multiple languages to benchmarks and website updates.
I’m old and I’ve been in this industry most of my life. I have never once seen or heard of all of that work being done and the company just waiting on competitors before pulling the trigger.
But is it just a coincidence
None of them seem to have published any papers associated with them on how these new models advanced the state-of-the-art though. =^(
china will do that for them
Eu auto brands colluded for years to synchronize new tech into their model lines. Could it be the AI SaaS sector is showing its first steps towards "maturity"? /s
Opus 4(.1) is so expensive[1]. Even Sonnet[2] costs me $5 per hour (basically) using OpenRouter + Codename Goose[3]. The crazy thing is Sonnet 3.5 costs the same thing[4] right now. Gemini Flash is more reasonable[5], but always seems to make the wrong decisions in the end, spinning in circles. OpenAI is better, but still falls short of Claude's performance. Claude also gives back 400's from its API if you CTRL-C in the middle though, so that's annoying.
Economics is important. Best bang for the buck seems to be OpenAI ChatGPT 4.1 mini[6]. Does a decent job, doesn't flood my context window with useless tokens like Claude does, API works every time. Gets me out of bad spots. Can get confused, but I've been able to muddle through with it.
1: https://openrouter.ai/anthropic/claude-opus-4.1
2: https://openrouter.ai/anthropic/claude-sonnet-4
3: https://block.github.io/goose/
4: https://openrouter.ai/anthropic/claude-3.5-sonnet
5: https://openrouter.ai/google/gemini-2.5-flash
6: https://openrouter.ai/openai/gpt-4.1-mini
Get a subscription and use claude code - that's how you get actual reasonable economics out of it. I use claude code all day on the max subscription and maybe twice in the last two weeks have I actually hit usage limits.
Is there any documentation on what the max sub usage limit is? A coworker tried it and was booted off Opus within just a couple hours due to "high usage". I haven't made the jump since I expect my $3k/mo on API would just instantly fly by a $200/mo sub and then I'd just be back on API again, but if it could carve out $1k-2k of costs for a little bit of time managing sub(s) it might be worth it.
It's not documented - that's the whole point. They can scale it back and forth opaquely, letting the high volume users get more usage whenever the low-volume users aren't using it much. If it's explicit and transparent, you don't get the benefit of that, since it would be gamed by unscrupulous power users.
Also there's a cli argument that lets you specify the model. try `claude --help`.
> Get a subscription and use claude code
I find the token/credit restrictions on Opus to be near useless even when using Claude Code. I only ever switch to it so get another model's take on the issue. Five minutes of use and I have hit the limit.
It seems for Opus the Max plan is almost always needed for being useful
Is it a max subscription?
We have the $200 plans for work and despite only using Opus, we rarely hit the limits. CCUsage suggests the same via API would have been ~$2000 over the last month (we work 5 hours a day, 4 days a week, almost always with Claude).
Are you part time?
In a way. Those are my company's working hours.
Yup. Getting to try three or so prompts that it messes up and then running out of quota for hours is entirely useless to me.
That's fine if you use it for private use. Doesn't work if you're building a product using Claude.
Is it considerably more cost effective than cline+sonnet api calls with caching and diff edits?
Same context length and throughput limits?
Anecdotally I find gpt4.1 (and mini) were pretty good at those agentic programming tasks but the lack of token caching made the costs blow up with long context.
If you use Claude Code with a subscription and run `ccusage` [0] you can get an idea of your "true usage" and cost.
[0] https://github.com/ryoppippi/ccusage
I'm on the basic $20/mo sub and only ran into token cap limitations in the first few days of using Claude Code (now 2-3 weeks in) before I started being more aggressive about clearing the context. Long contexts will eat up tokens caps quickly when you are having extended back-and-forth conversations with the model. Otherwise, it's been effectively "unlimited" for my own use.
YMMV I'm using the $100/mo max subscription and I hit the limit during a focused coding session where I'm giving it prompts non-stop.
Unfortunately there's no easy tool to inspect usage. I started a project to parse the Claude logs using Claude and generate a Chrome trace with it. It's promising but it was taking my tokens away from my core project.
Check out ccusage, it sounds like the tool you’re describing: https://github.com/ryoppippi/ccusage
That's neat. According to the tool I'm consuming ~300m tokens per day coding with a (retail?) cost of ~125$/day. The output of the model is definitely worth $100/mo to me.
This is a good bar to know. I see the warnings but not sure how much I really have left.
Do you mostly use opus?
Neat tool thanks!
ccusage on GitHub.
Yes, it’s much better.
It uses way less tokens or much more effectively when running locally.
Is there a way to sign up for Claude code that doesn't involve verifying a phone number with Anthropic? They don't even accept Google Voice numbers.
Maybe I'm out of touch, but I'm not handing out my phone number to sign up for random SaaS tools.
It's maybe the leading subscription based tool in our field, not a random SaaS tool.
Sure, no contest on that. They still don't need my phone number.
They have zero need for a phone number.
There are a lot of fraudsters out there who will happily create thousands of accounts with valid CCs that will fail on first actual charge.[0]
I wouldn't be surprised if asking for a phone number lowers the fraud rate enough to compensate for the added friction.
[0] Incidentally, this is also why many AI API providers ask for your money upfront (buy credits) unless you're big enough and/or have existing relationship with them.
Come on now. You're about to run their cli and let it send any random file on your machine to their API intentionally. Trust them a little.
use a burner
Well, it's expensive compared to other models. But it's often much cheaper than human labor.
E.g. if need a self-contained script to do some data processing, for example, Opus can often do that in one shot. 500 line Python script would cost around $1, and as long as it's not tricky it just works - you don't need back-and-forth.
I don't think it's possible to employ any human to make 500 line Python script for $1 (unless it's a free intern or a student), let alone do it in one minute.
Of course, if you use LLM interactively, for many small tasks, Opus might be too expensive, and you probably want a faster model anyway. Really depends on how you use it.
(You can do quite a lot in file-at-once mode. E.g. Gemini 2.5 Flash could write 35 KB of code of a full ML experiment in Python - self-contained with data loading, model setup training, evaluation, all in one file, pretty much on the first try.)
In every price comparison I make. Claude (API) always comes out cheapest if you manage to keep most of your context cached. 90% price reduction for input is crazy.
Cached prices: $.31 for Gemini Pro / Mtok, $1.50 for claude opus 4.1 / Mtok
There's additional storage costs with google caching, around $3.75 for 5 minutes/Mtok, and Claude Opus is $3.75 for 5minute Cache Writes / Mtok.
For cached reads Gemini Pro is 5X cheaper than Opus and like $0.01 more than Sonnet.
Large models are for querying the model
Small models are for querying the context
Opus is cheap if you use it for its niche
> Large models are for querying the model
> Small models are for querying the context
I respectfully disagree.
My experience is that large models are capable of understanding large contexts much better. Of course they are more expensive and slower, too. But in terms of accuracy, large models are always better at querying the context.
GLM 4.5 / Kimi K2 / Qwen Coder 3 / Gemini Pro 2.5
I'm confused by how Opus is presented to be superior in nearly every way for coding purposes yet the general consensus and my own experience seem to be that Sonnet is much much better. Has anyone switched to entirely using Opus from Sonnet? Or maybe switching to Opus for certain things while using Sonnet for others?
I don't doubt Opus is technically superior, but it's not practically superior for me.
It's still pretty much impossible to have any LLM one-shot a complex implementation. There's just too many details to figure out and too much to explain for it to get correct. Often, there's uncertainty and ambiguity that I only understand the correct answer (or rather less bad answer) after I've spent time deep in the code. Having Opus spit out a possibly correct solution just isn't useful to me. I need to understand _why_ we got to that solution and _why_ it's a correct solution for the context I'm working in.
For me, this means that I largely have an iteratively driven implementation approach where any particular task just isn't that complex. Therefore, Sonnet is completely sufficient for my day-to-day needs.
I've been having a great time with Windsurf's "Planning" feature. Have a nice discussion with Cascade (Claude) all about what it is that neerds to happen - sometimes a very long conversation including test code. Then when everything is very clear, make it happen. Then test and debug the results with all that context. Pretty nice.
This is basically what I do. I have a specific "planning mode" prompt I work through.
It's very, very helpful. However, there are still a lot of problems I only discover/figure out after I've been working in the code.
Can you explain what you do exactly? Do you enable plan mode and use with chat...?
In Zed I switch the AI panel to ask mode and chat with the agent about different approaches and have it draft patches. Then when I think there's a design worth trying, switch to Write mode and have it implement that change + run the tests and diagnostics to verify the code at least compiles, tests pass and follows our style guides. Finally a line by line review + review of the test coverage (in terms of interface surface area) before submitting a PR for another human review.
After watching a few videos trying to understand how people were using LLMs and getting useful results I found that even making a simpler version of the fancy planning mode in the LLM IDEs via the instructions.md produced hugely better productivity gains.
I started adding an instruction file along the lines of "Always tell me your plan to solve the issue first with short example code, never edit files without explicit confirmation of your plan" at the start and it is like a day and night difference in how useful it becomes. It also starts to feel like programming again where you can read through various files and instead of thinking in your head, you write out your thoughts. You end up getting confirmation or push back on errors that you can clean up.
Reading through a sort of wrong sort of right implementation spread across various files after every prompt just really sucked.
I'm not one shotting massive amounts of files, but I am enjoying the lack of grunt work.
Could you share some of the videos that you watched ? Can you make a video yourself ? That will help a lot of us.
You can also always have it create design docs and mermaid diagrams for each task. Outline the why much easier earlier, shifting left
That's essentially what I do, but that doesn't (and cannot) entirely solve the problem.
A major part of software engineering is identifying and resolving issues during implementation. Plans are a good outline of what needs to be done, but they're always incomplete and inaccurate.
Every time that Sonnet is acting like it has brain damage (which is once or twice a day), I switch to Opus and it seems to sort things out pretty fast. This is unscientific anicdata though, and it could just be that switching models (any model) would have worked.
This seems like a case of reversion to the mean. When one model is performing below average, changing anything (like switching to another model) is likely to improve it by random chance...
Anthropic say Opus is better, benchmarks & evals say Opus is better, Opus has more parameters and parameters determine how much a NN can learn.
Maybe Opus just is better
Even if it's better on average, doesn't mean it's better for every possible query
This is a great use case for sub-agents IMO. By default, sub-agents use sonnet. You can have opus orchestrate the various agents and get (close to) the best of both worlds.
AFAIK subagents inherit the default model since v1.0.64. At least that's the case for me with the Claude Code SDK — not providing a specific model makes subagents use claude-opus-4-1-20250805.
In this case I don't think the controller needs to be the smartest model. I use sonnet as the main driver and pass the heavy thinking (via zen mcp) onto Gemini pro for example, but I could use openai or opus or all of them via OpenRouter.
Subagents seem pretty similar to using zen mcp w/ OpenRouter but maybe better or at least more turnkey? I'll be checking them out.
Amp (ampcode.com) uses Sonnet as its main model and has GPT o3 as a special purpose tool / subagent. It can call into that when it needs particularly advanced reasoning.
Interestingly I found that prompting it to ask the o3 submodel (which they call The Oracle) to check Sonnet's working on a debugging solution was helpful. Extra interesting to me was the fact that Sonnet appeared to do a better job once I'd prompted that (like chain of thought prompting, perhaps asking it to put forward an explanation to be checked actually triggered more effective thinking).
Is there a way to get persistent sub-agents? I'd love to have a bunch of YAML files in my repository, one for each sub-agent, and have those automatically used across all Claude Code instances I have on multiple machines (I dev on laptop and desktop), or across the team.
In my experience the best use for subagents is saving context.
Example: you need to review some code to see if it has proper test coverage.
If you use the "main" context, it'll waste tokens on reading the codebase and running tests to see coverage results.
But if you launch an agent (a subprocess pretty much), it can use a "disposable" context to do that and only return with the relevant data - which bits of the code need more tests.
Now you can either use the main context to implement the tests or if you're feeling really fancy launch another sub-agent to do it.
Yep: https://docs.anthropic.com/en/docs/claude-code/sub-agents
Thanks!
Great, now even computers need to leave the IC track if they want continued career progression.
Maybe context rot? If model's output seems to be getting worse or in a rut, then try just clearing context / starting a new session.
Switching models with the same context, in this case.
switching models great best practice whether get stuck or not
can look at primal check the mean or dual get out of local minima
in all cases, model, tokenizer, etc is just enough different that will generally pay off in spaces quickly
They both seem to behave differently depending on how loaded the system seems to be.
I have suspected for a long time that hosted models load shed by diverting some requests to lesser models or running more quantized versions under high load.
I think OpenRouter saves tokens by summarizing queries through another model, IIRC.
Exactly that.
> yet the general consensus and my own experience seem to be that Sonnet is much much better
Given that there’s nothing close to scientific analysis going on, I find it hard to tell how big the “Sonnet is overall better, not just sometimes” crowd is. I think part of the problem is that “The bigger model is better” feels obvious to say, so why say it? Whereas “the smaller model is better actually” feels both like unobvious advice and also the kind of thing that feels smart to say, both of which would lead to more people who believe it saying it, possibly creating the illusion of consensus.
I was trying to dig into this yesterday, but every time I come across a new thread the things people are saying and the proportions saying what are different.
I suppose one useful takeaway is this: If you’re using Claude Max and get downgraded from Opus to Sonnet for a few hours, you don’t have to worry too much about it being a harsh downgrade in quality.
Opus seems better to me on long tasks that require iterative problem solving and keeping track of the context of what we have already tried. I usually switch to it for any kind of complicated troubleshooting etc.
I stick with Sonnet for most things because it's generally good enough and I hit my token limits with it far less often.
Same. I'm on the $200 plan and I find Opus "better", but Sonnet is more straight forward. Sonnet is, to me, a "don't let it think" model. It does great if you give it concrete and small goals. Anything vague or broad and it starts thinking and it's a problem.
Opus gives you a bit more rope to hang yourself with imo. Yes, it "thinks" slightly better, but still not good enough to me. But it can be good enough to convince you that it can do the job.. so i dunno, i almost dislike it in this regard. I find Sonnet just easier to predict in this regard.
Could i use Opus like i do Sonnet? Yes definitely, and generally i do. But then i don't really see much difference since i'm hand-holding so much.
I use both. Sonnet is faster and more cost efficient. It's great for coding. Where Opus is noticeably better is in analysis. It surpasses Sonnet for debugging, finding patterns in data, creativity and analysis in general. It doesn't make a lot of sense to use Opus exclusively unless you're on a max20 plan and not hitting limits. Using Opus for design and troubleshooting and Sonnet for everything else is a good way to go.
Im on the Max plan and generally Opus seems to do better work than Sonnet. However, that’s only when they allow me to use Opus. The usage limits, even on the max plan, are a joke. Yesterday I hit the limits within MINUTES of starting my work day.
I'm a bit confused by people hitting usage limits so quickly.
I use Opus exclusively and don't hit limits. ccusage reports I'm using the API-equivalent of $2000/mo
You always have to ask which plan they're paying for. Sometimes people complain about the $20 per month plan...
There's no Opus quota on that plan at all.
In this case I'm replying to someone who lead with "I'm on the Max plan" but I realize now that's ambiguous, maybe they are on 5x while I'm on 20x.
That's insane. Are you accounting for caching? If not, there's no way this is going to last
I'm using ccusage to get the number, I think it just looks at your history and calculates based on tokens vs API pricing. So I think it wouldn't account for caching.
But I totally agree there's no way it lasts. I'm mostly only using this for side projects and I'm sitting there interacting with it, not YOLO'ing, I do sometimes have two sessions going at the same time but I'm not firing off swarms or anything crazy. Just have it set to Opus and I chat with it.
Claude Code definitely reports cached tokens, and I think CCusage does too, so it wouldn’t make sense for the calculation to be based on full pricing when they have the cached values.
Is this on x5? Because ever since they booted all the freeloaders I’ve not once seen the “you are approaching usage limits” message. Anyway, the “you are approaching usage limits” message shows up when you are over 50% of your tokens for that timeframe, so it’s not sure useful.
Yeah, you need to actively cherry pick which model to use in order to not waste tokens on stuff that would be easily handed by a simpler model.
same here constantly hit the Opus limits after minutes on Max plan
If I'm using cursor then sonnet is better, but in claude code Opus 4 is at least 3x better than Sonnet. As with most things these days, I think a lot of it comes down to prompting.
This is interesting. I do use Cursor with almost exclusively Sonnet and thinking mode turned on. I wonder if what Cursor does under the hood (like their indexing) somehow empowers Sonnet more. I do not have much experience with using Claude Code.
I now eagerly await Sonnet 4.1, only because of this release.
Opus really shines for completing long-running tasks with no supervision. But if you are using Claude Code interactively and actively steering it yourself, Sonnet is good enough and is faster.
I don't believe anyone saying Sonnet yields better results than Opus though, as my experience has been exactly the opposite. But trade-off wise, I can definitely see it being a better experience when used interactively because of its speed and lower cost.
I use opus or gemini 2.5 pro for plan mode and sonnet for act mode in Cline. https://cline.bot
It's my experience that Opus is better at solving architectural challenges where sonnet struggles.
With aggressive Claude Code use I didn't find Sonnet better than Opus but I did find it faster while consuming far fewer tokens. Once I switched to the $100 Max plan and configured CC to exclusively use Sonnet I haven't run into a plan token limit even once. When I saw this announcement my first thing was to CMD-F and see when Sonnet 4.1 was coming out, because I don't really care about Opus outside of interactive deep research usage.
My opinion of Opus is that it takes the correct action 19/20 times, where Sonnet takes the correct action 18/20 times. It’s not strictly necessary to use Opus, but if you have the subscription already it’s just a pure win.
100% opus all the time. Sonnet seems to get confused much faster and need more hand holding in my experience.
I've found with limited context provided in your prompt, opus is just awful compared to even gpt-4.1, but once I give it even just a little bit more of an explanation, it jumps leagues ahead.
I notice that on the "Agentic Coding" benchmark cited in the article Sonnet 4 outperformed Opus 4 (by 0.2%), and under performs Opus 4.1 (by -1.8%).
So this release might change that consensus? If you believe the benchmarks are reflective of reality anyways.
> If you believe the benchmarks are reflective of reality anyways.
That's a big "if." But yeah, I can't tell a difference subjectively between Opus and Sonnet, other than maybe a sort of placebo effect. I'm more careful to write quality prompts when using Opus, because I don't want to waste the 5x more expensive tokens.
I feel the same way. I usually use Opus to help with coding and documentation, and I use Sonnet for emails and so on.
Yes, Opus is very noticeably better at programming in both Rust and Zig in my experience. I wish it were cheaper!
It's ridiculously overpriced in the API. Just like o3 used to be.
Opus is superior to understand the big picture and the direction.
Sonnet is great at banging it out.
Strategy I'm playing with, we'll see how good of results I get, is to prompt Opus to analyze and plan but not implement.
E.g. prompt to read a paper, read some source, then write out a terse document meant to be read by machine not human.
Then switch to Sonnet, have it read that document, and do the actual implementation work.
Just more ancedata, but I entirely agree. I can't say that I am happy with Sonnet's output at any point, really, but it still occasionally works, whereas Opus has been a dumpster fire every single time.
That’s very strange. Sonnet is hot garbage and Opus is a miracle, for me. I also don’t see anyone praising sonnet anywhere.
They restarted Claude Plays Pokemon with the new model: https://www.twitch.tv/claudeplayspokemon
(He had been stuck in the Team Rocket hideout (I believe) for weeks)
The finest of AI, probably using electricity/water for 100s of homes can not even beat a very simple children game with millions of texts guides etc. about it.
When can we replace doctors with it?
Alright, well, Opus 4.1 seems exactly as useless as Opus 4 was, but it's probably eating my tokens faster. Wish they let you tell somehow.
At least Sonnet 4 is still usable, but I'll be honest, it's been producing worse and worse slob all day.
I've basically wasted the morning on Claude Code when I should've just been doing it all myself.
I've also noticed Sonnet starting to degrade. It's developing some of the behaviours that put me off the competition in the first place. Needless explanations, filler in responses, wanting to put everything in lists, even increased sycophancy.
Major AI companies are not doing nearly enough to address the sycophancy problem.
I get that it's not an easy problem to solve, but how is Anthropic supposed to solve the actual alignment problem if they can't even stop their production LLMs from glazing the user all the time? And OpenAI is somehow even worse.
I feel like this is just related to my projects getting bigger. Claude Code is trying to keep up with my project evolving from 2k lines of code to 100k lines. Of course it’s going to feel worse.
I think it is how our expectations of the latest model change over time.
I expect to be completely blown away by GPT-5 in the first few days and then over time I will figure out the limitations of the model. Then I will be less impressed because you don't know what it can't do at first.
My project is basically the same size as when I started using it.
> I've basically wasted the morning on Claude Code when I should've just been doing it all myself.
Welcome to the machine
https://www.youtube.com/watch?v=tBvAxSx0nAM&t=45s
Other than it starting out trying to produce a full and complete web app (or whatever) for my daily yak shaving session instead of the normal "let's talk about and work through this thing" the new Opus 4.1 seems to 'get it' a lot quicker than the old daffy robot did. It asked pertinent questions to understand the system we are working on and accomplished the goal of updating the design document so I don't have to keep explaining details at the start of every chat session. Something, by the way, it always previously failed to do causing me to have to explain stuff each and every time before forward progress could be made.
I do agree it did hit the token limit a lot quicker than before where I could chat for hours without worrying about it.
Either way, still have one last yak to shave for this project so we'll see how efficient it is with that. If it accomplishes the task before burning through all the tokens then win, win, I suppose.
The article says "We plan to release substantially larger improvements to our models in the coming weeks."
Sonnet 4 has definitely been the best model for our product's use case, but I'd be interested in trying Haiku 4 (or 4.1?) just due to the cost savings.
I'm surprised Anthropic hasn't mentioned anything about Haiku 4 yet since they released the other models.
it is barely an improvement according to their own benchmarks. not saying thats a bad thing, but not enough for anybody to notice any difference
i think its probably mostly vibes but that still counts, this is not in the charts
> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
That is a big improvement.
That's why they named it 4.1 and not 4.5
When it's "that's why they incremented the version by a tenth instead of a half" you know things have really started to slow for the large models.
Opus 4 came out 10 weeks ago. So this is basically one new training run improvement.
And in 52 weeks we've gone 3.5->4.1 with this training improvement, meanwhile the 52 weeks prior to that were Claude -> Claude 3. The absolute jumps per version delta also used to be larger.
I.e. it seems we don't get much more than new training run levels of improvement anymore. Which is better than nothing, but a shame compared to the early scaling.
Is it really a bigger jump to go from plausible to frequently useful, than from frequently useful to indispensable?
Why is there supposed to be no step between frequently useful and indispensable? Quickly going from nothing to frequently useful (which involved many rapid hops between) was certainly surprising, and that's precisely the lost momentum.
They released this because competitors are releasing things
They need to leave some room to release 10 more models. They could crank benchmarks to 100% but then no new model is needed lol? Pretty sure these pretty benchmark graphs are all completely staged marketing numbers since they do solve the same problems they are being trained on – no novel or unknown problematic is presented to them.
I am still very early, but output quality wise, yes, there does not seem to be any noticeable improvement in my limited personal testing suite. What I have noticed though is subjectively better adherence to instructions and documentation provided outside the main prompt, though I have no way to quantify or reliably test that yet. So beyond reliably finding Needles-in-the-Haystack (which Frontier models have done well on lately), Opus 4.1 seems to do better in following those needles even if not explicitly guided to compared to Opus 4.
I will only add that it's interesting that in the results graphic, they simply highlighted Opus 4.1 - choosing not to display which models have the best scores - as Opus 4.1 only scored the best on about half of the benchmarks - and was worse than Opus 4.0 on at least one measure.
"You pay $20/mo for X, and now I'm giving you 1.05*X for the same price." Outrageous!
Good! I'm glad they are just giving us small updates. Opus 4 just came out, if you have small improvements, why not just release them? There's no downside for us.
I don't think this could even be called an improvement? It's small enough that it could just be random chance
I’ve always wondered about this actually. My assumption is that they always “pick the best” result from these tests.
Instead, ideally they’d run the benchmark tests many times, and share all of the results so we could make statistical determinations.
This is the bit I'm most interested in:
> We plan to release substantially larger improvements to our models in the coming weeks.
This is so people don't immediately migrate for GPT5
This has been the worse Claude day ever. Just fell apart. Not sure if the release is why, but cursing in documents and can not fix a bug after hours of back and forth.
You're prompting it wrong !
This likely won't move the needle for Opus use over Sonnet while the cost remains the same. Using OpenRouter rankings (https://openrouter.ai/rankings) as a proxy, Sonnet 3.7 and Sonnet 4 combined generates 17x more tokens than Opus 4.
Am I the only one super confused about how to even get started trying out this stuff? Just so I wouldn't be "that critic who doesn't try the stuff he criticizes," I tried GitHub Copilot and was kind of not very impressed. Someone on HN told me Copilot sucks, use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.
Let's see: we have Claude Code vs. Claude the API vs. Claude the website, and they're totally different from each other? One is command line, one integrates into your IDE (which IDE?) and one is just browser based, I guess. Then you have the different pricing plans, Free, Pro, and Max? But then there's also Claude Team and Claude Enterprise? These are monthly plans that only work with Claude the Website, but Claude Code is per-request? Or is it Claude API that's per-request? I have no idea. Then you have the models: Claude Opus and Claude Sonnet, with various version numbers for each?? Then there's Cline and Cursor and GOOD GRIEF! I just want to putz around with something in VSCode for a few hours!
I'm not sure what's complicated about what you're describing? They offer two models and you can pay more for higher usage limits, then you can choose if you want to run it in your browser or in your terminal. Like what else would you expect?
Fwiw I have a Claude pro plan and have no interest in using other offerings so I'm not sure if they're super simple (one model, one interface, one pricing plan)?
When people post this stuff, it's like, are you also confused that Nike sells shoes AND shorts AND shirts, and there's different colors and skus for each article of clothing, and sometimes they sell direct to consumer and other times to stores and to universities, and also there's sales and promotions, etc, etc?
It's almost as if companies sell more than one product.
Why is this the top comment on so many threads about tech products?
In this case, they tried something and were told they were doing it wrong, and they know there's more than one way to do it wrong - wrong model, wrong tool using the model, wrong prompting, wrong task that you're trying to use it for.
And of course you could be doing it right but the people saying it works great could themselves be wrong about how good it is.
On top of that it costs both money and time/effort investment to figure out if you're doing it wrong. It's understandable to want some clarity. I think it's pretty different from buying shoes.
> I think it's pretty different from buying shoes.
Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
Are you a construction worker, a banker, a cashier or a driver? Are you walking 5 miles everyday or mostly sedentary? Do you require steel toed shoes? How long are you expecting them to last and what are you willing to pay? Are you going to wear them on long runs or take them river kayaking? Do they need to be water resistant, waterproof or highly breathable? Do you want glued, welted, or stitch down construction? What about flat feet or arch support? Does shoe weight matter? What clothing are you going to wear them with? Are you going to be dancing with them? Do the shoes need a break in period or are they ready to wear? Does the available style match your preferences? What about availability, are you ok having them made to order or do you require something in stock now?
By comparison I can try 10 different AI services without even needing to stand up for a break while I can't buy good dress shoes in the same physical store as a pair of football cleats.
> Shoe shopping is pretty complex, more so than trialing an AI model in my opinion.
Oh c'mon, now you're just being disingenuous, trying to make an argument for argument's sake.
No, shoe shopping is not more complicated than trialing a LLM. For all of those questions about shoes you are posing, either a) a purchaser won't care and won't need to ask them, or b) they already know they have specific requirements and will know what to ask.
With an LLM, a newbie doesn't even know what they're getting into, let alone what to ask or where to start.
> By comparison I can try 10 different AI services without even needing to stand up for a break
I can't. I have no idea how to do that. It sounds like you've been following the space for a while, and you're letting your knowledge blind you to the idea that many (most?) people don't have your experience.
It sounds like you're generally unfamiliar with using AI to help you at all? Or maybe you're also being disingenuous? It's insanely easy to figure this stuff out, I literally know a dozen people who are not even engineers, have no programming experience, who use these tools. Here's what Claude (the free version at claude.ai) said in response to me saying "i have no idea how to use AI coding assistants, can you succinctly explain to me what i need to do? like, what do i download, run, etc in order to try different models and services, what are the best tools and what do they do?":
Here's a quick guide to get you started with AI coding assistants:
## Quick Start Options (Easiest)
*1. Web-based (Nothing to Download)* - *Claude.ai* - You're here! I can help with code, debug, explain concepts - *ChatGPT* - Similar capabilities, different model - *GitHub Copilot Chat* - Web interface if you have GitHub account
*2. IDE Extensions (Most Popular)* - *Cursor* - Full VS Code replacement with AI built-in. Download from cursor.com, works out of the box - *GitHub Copilot* - Install as VS Code/JetBrains extension ($10/month), autocompletes as you type - *Continue* - Free, open-source VS Code extension, lets you use multiple models
*3. Command Line* - *Claude Code* - Anthropic's terminal tool for autonomous coding tasks. Install via `npm install -g @anthropic-ai/claude-code` - *Aider* - Open-source CLI tool that edits files directly
## What They Do
- *Autocomplete tools* (Copilot, Cursor) - Suggest code as you type, finish functions - *Chat tools* (Claude, ChatGPT) - Explain, debug, design systems, write full programs - *Autonomous tools* (Claude Code, Aider) - Actually edit your files, make changes across codebases
## My Recommendation to Start
1. Try *Cursor* first - download it, paste in some code, and ask it questions. It's the most beginner-friendly 2. Or just start here in Claude - paste your code and I can help debug, explain, or write new features 3. Once comfortable, try GitHub Copilot for in-line suggestions while coding
The key is just picking one and trying it - you don't need to understand everything upfront!
Just play with the 'free tier' on whatever website does the AI thing and figure it out.
Maybe there's a need to try ten different ones but I just stuck with one and can now convince it to do what I want it to do pretty successfully.
Ya know, in the over half a century I've been on this planet, choosing a new pair of shoes is so low on my 'life's little annoyances' list that it doesn't even rise above the noise of all the stupid random things which actually do annoy me.
Maybe the problem is I don't take shoes seriously enough? Something to work on...
You also learned about your shoe needs over the course of a lifetime. A caregiver gave you your first pair and you were expected to toddle around at most with them. You outgrew and replaced shoes as a child, were placed into new scenarios requiring different footwear as you grew up, learning and forming opinions about what's appropriate functionally, socially, economically as you went. You learned what stores were good for your needs, what brands were reputable, what styles and fits appealed to you. It took you more than a decade at minimum to achieve that.
If you allow yourself to be a novice and a learner with AI and LLMs and don't expect to start out as a "shoe expert" where you never even think about this in your life and it's not even an annoyance, you'll find that it's the exact same journey.
And in all the years that LLMs have been available I've yet to find a subscription plan confusing.
Is it though? People complain about sore feet and hear they wear the wrong kind of shoes so they go to the store where they have to spend money to find out while trying to navigate between dress shoes, minimal shoes, running shoes, hiking shoes etc etc., they have to know their size, ask for assistance in trying them on...
Because the offerings are not simple. Your Nike example is silly; everyone knows what to do with shoes and shorts and shirts, and why they might want (or not want) to buy those particular items from Nike.
But for someone who hasn't been immersed in the "LLM scene", it's hard to understand why you might want to use one particular model of another. It's hard to understand why you might want to do per-request API pricing vs. a bucketed usage plan. This is a new technology, and the landscape is changing weekly.
I think maybe it might be nice if folks around here were a bit more charitable and empathetic about this stuff. There's no reason to get all gatekeep-y about this kind of knowledge, and complaining about these questions just sounds condescending and doesn't do anyone any good.
> Why is this the top comment on so many threads about tech products?
Because you overestimate the difference that the representative person understands.
A more accurate analogy is that Nike sells green-blue shoes and Nike sells blue-green shoes, but the blue-green shoes add 3 feet to your jump and green-blue shoes add 20 mph to your 100 yard dash sprint.
You know you need one of them for tomorrow's hurdles race but have no idea which is meaningful for your need.
Also, the green-blue shoes charge per-step, but the blue-green shoes are billed monthly by signing up for BlueGreenPro+ or BlueGreenMax+, each with a hidden step limit but BlueGreenMax+ is the one that gives you access to the Cyan step model which is better; plus the green-blue shoes are only useful when sprinting, but the blue-green shoes can be used in many different events, but only through the Nike blue-green API that only some track&field venues have adopted...
When you walk into a store, you can see and touch all of these products. It's intuitive.
With all this LLM cruft all you get is essentially the same old chat interface that's like the year 2000 called and wants its on-line chat websites back. The only thing other than a text box that you usually get is a model selector dropdown squirreled away in a corner somewhere. And that dropdown doesn't really explain the differences between the cryptic sounding options (GPT-something, Claude Whatever...). Of course this confuses people!
Claude.ai, ChatGPT, etc. are finished B2C products. They're black boxes, encapsulated experiences. Consumers don't want to pick a model, or know what model they're using; they just want to "talk to AI", and for the system to know which model is best to answer any given question. I would bet that for these companies, if their frontend observes you using the little model override button, that gets instrumented as an "oops" event in their metrics — something they aim to minimize.
What you're looking for, are the landing pages of the B2B API products underlying these B2C experiences. That would be https://www.anthropic.com/claude, https://openai.com/api/, etc. (In general, search "[AI company] API".)
From those B2B landing pages, you can usually click through to pages with details about each of their models.
Here's the model page corresponding to this news announcement, for example: https://www.anthropic.com/claude/opus
(Also, note how these B2B pages are on the AI companies' own corporate domains; whereas their B2C products have their own dedicated domains. From their perspective, their B2C offerings are essentially treated as separate companies that happen to consume their APIs — a "reference use-case" — rather than as a part of what the B2B company sells.)
Hey, I'm open to the idea that I'm just stupid. But, if people in your target market (software developers) don't even understand your product line and need a HOWTO+glossary to figure it out, maybe there's also a branding/messaging/onboarding problem?
My hot take is that your friend should show you what they’re using, not just dismiss Copilot and leave you hanging!
Eh, this seems like a take that reeks a bit of "everyone is stupid except me".
I do know the answer to OP's question but that's because I pickle my brain in this stuff. It is legitimately confusing.
The analogy to different SKUs strikes me also inaccurate. This isn't the difference between shoes, shirts, and shorts - it's more as if a company sells three t-shirts but you can't really tell what's different about them.
It's Claude, Claude, and Claude. Which ones code for you? Well, actually, all of them (Code, web/desktop Claude, and the API can all do this)
Which ones do you ask about daily sundry queries? Well, two of them (web/desktop Claude, but also the API, but not Code). Well, except if your sundry query is about a programming topic, in which case Code can also do that!
Ok, if I do want to use this to write code, which one should I use? Honestly, any of them, and the company does a poor job of explaining why you would use each option.
"Which of these very similar-seeming t-shirts should I get?" "You knob. How are posts like this even being posted." is just an extremely poor way to approach other people, IMO.
> It's Claude, Claude, and Claude. Which ones code for you?
Thanks for articulating the confusion better than I could! I feel it's a similar branding problem as other tech companies have: I'm watching Apple TV+ on my Apple TV software running on my Apple TV connected to my Google TV that isn't actually manufactured by Google. But that Google TV also has an Apple TV app that can play Apple TV+.
It's a bit worse than a branding problem honestly, since there's legitimate overlap between products, because ultimately they're different expressions of the same underlying LLMs.
I'm not sure if you ever got a good rundown, but the tl;dr is that the 3 products ("Desktop", Code, and API) all expose the same underlying models, but are given different prompts, tools, and context management techniques that make them behave fairly differently and affect how you interact with them.
- The API is the bare model itself. It has some coding ability because that's inherent to the model - you can ask it to generate code and copy and paste it for example. You normally wouldn't use this except that if you're using some Copilot-type IDE integration where the IDE is doing the work of talking to the model for you and integrating it into your developer experience. In that case you provide API key and the IDE does the heavy lifting.
- The desktop app is actually a half-decent coder. It's capable of producing specific artifacts, distinguishing between multiple "files" it's writing for you, and revisiting previously-written code. "Oh, actually rewrite this in Go." is for example a thing it can totally do. I find it useful for diagnosing issues interactively.
- "Claude Code" is a CLI-only wrapper around the model. Think of it like Anthropic's first-party IDE integration, except there's not an IDE, just the CLI. In this case the integration gives the tool broad powers to actually navigate your filesystem, read specific files, write to specific files, run shell commands like builds and tests, etc. These are all functions that an IDE integration would also give you, but this is done in a Claude-y way.
My personal take is: try Claude Code, since as long as you're halfway comfortable with a CLI it's pretty usable. If you really want a direct IDE integration you can go with the IDE+API key route, though keep in mind that you might end up paying more (Claude Code is all-you-can-eat-with-rate-limits, where API keys will... just keep going).
Wow. After 50 replies to what I thought wasn't such a weird question, your rundown is the most enlightening. Thank you very much.
FWIW it's probably because a lot of us have been following along and trying these things from the start so the nuances seem more obvious but also I feel that some folks feel your question is a bit "stupid", like "why are you suddenly interested in the frontier of these models? where were you for the last 2 years?"
And to some extent it is like the PC race. Imagine going to work and writing software for whatever devices your company writes software for in whatever toolchain your company uses. Then 2-3 years after the PC race began heating up, asking "Hey I only really write code for whatever devices my employer gives me access to. Now I want to buy one of these new PCs but I don't really understand why I'd choose an Intel over a Motorolla chipset or why I'd prioritize more ROM or more RAM, and I keep hearing about this thing called RISC that's way better than CISC and some of these chips claim to have different addressing modes that are better?"
Also when it comes to API integrations, I find some better than others. Copilot has been pretty crummy for me but Zed's Agent Mode seems to be almost as good as Claude Code. I agree with the general take that Claude Code is a good default place to start.
Claude Code running in a terminal can connect to your IDE so you can review its proposed changes there. I’ve found this to be a nice drop in way to try it out without having to change your core workflow and tools too much. Check out the /ide command for details.
If anything, Anthropic has the product lineup that makes the most sense. Higher numbers mean better model. Haiku < Sonnet < Opus which translates to length/size. Free < Pro < Max.
Contrast to something like OpenAI. They've got gpt4.1, 4o, and o4. Which of these are newer than one another? How do people remember which of o4 and 4o are which?
Which Nike shoe is best for basketball? The Nike Dunk, Air Force 1, Air Jordan, LeBron 20, LeBron XXI Prime 93, Kobe IX elite, Giannis Freak 7, GT Cut, GT Cut 3, GT Cut 3 Turbo, GT Hustle 3, or the KD18?
At least with those you can buy whatever you think is coolest. Which Claude model and interface should the average programmer use?
What's the average programmer? Is it someone who likes CLI tools? Or who likes IDE integration? Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
The environment isn't the only difference, it's not "do you prefer CLI or IDE or Web" because they behave differently. Claude Code and Claude web and Claude through Cursor won't give you identical outputs for the same question.
It's not like running a tool in your IDE or CLI where the only difference is the interface. It would be like if gcc ran from your IDE had faster compile times, but gcc run from the CLI gives better optimizations.
The fact that no one is recommending any baseline to start with proves the point that it's confusing. And we haven't even touched on Sonnet v Opus
> Different strokes for different folks and surely the average programmer understands what environment they will be most comfortable in.
That's a silly claim to me, we're talking about a completely new environment where you prompt an AI to develop code, and therefore an "average programmer" is unlikely to have any meaningful experience or intuition with this flow. That is exactly what GP is talking about - where does he plug in the AI? What tradeoffs are there to different options?
The other day I had someone judge me for asking this question by dismissively saying "dont say youve still been using ChatGPT and copy/paste", which made me laugh - I don't use AI at all, so who was he looking down on?
To me that's the silly argument. How many different tools have you ever used? New build system? New linter? How did you know if you wanted to run those on the command line or in your IDE?
And it seems the story you shared sort of proves the point: the web interface worked fine for you and you didn't need to question it until someone was needlessly rude about it.
Because few seem to want to expend the effort to dive in and understand something. Instead they want the details spoonfed to them by marketing or something.
I absolutely loathe this timeline we're stuck in.
This is like being told to buy Nike shoes. Then when you proudly display your new cleats, they tell you "no, I meant you should by basketball shoes. The cleats are terrible."
Because I think that claude has gone beyond tech niche at this point..
Or maybe that's me, but still whether its through the likes of those vibe coding apps like lovable bolt etc.
at the end of the day, Most people are using the same tool which is claude since its mostly superior in coding (questionable now with oss models, but I still use it through kiro).
People expect this stuff to be simple when in reality its not and there is some frustation I suppose.
Not sure is this is sarcasm I'm assuming not.
You're comparing well understood products that are wildly different to products with code names. Even someone who has never wore a t-shirt will see it on a mannequin and know where it goes.
I'm sorry but I cannot tell what the difference is between sonnet and opus. Unless one is for music...
So in this case you read the docs. Which is, in your analogy, you going to the Nike store and reading up on if a tshirt goes on your upper or lower body.
Surely anyone interested in taking out a Claude subscription knows broadly what they're going to use an LLM for.
It's more like going to the Nike store and asking about the difference between the Vaporfly 3 and the Pegasus 41. I know they're all shoes and therefore go on my feet, but I don't know what the difference is unless one is better for riding horses?
On the contrary, I'm confused about why you're confused.
This is a well-known and documented phenomenon - the paradox of choice.
I've been working in machine learning and AI for nearly 20 years and the number of options out there is overwhelming.
I've found many of the tools out there do some things I want, but not others, so even finding the model or platform that does exactly what I want or does it the best is a time-consuming process.
You need Claude Pro or Max. The website subscription also allows you to use the command line tool—the rate limits are shared—and the command line tool includes IDE integration, at least for VSCode.
Claude Code is currently best-in-class, so no point in starting elsewhere, but you do need to read the documentation.
> You need Claude Pro or Max.
Actually, to try it out, prepaid token billing is fine. You are not required to have a subscription for claude code cli. Even just $5 gave me enough breathing room to get a feeling for its potential, personally. I do not touch code often these days so I was relieved not to have to subscribe and cancel again just to play around a little and have it write some basic scripts for me.
Correct. Claude Code Max with Opus. Don’t even bother with Sonnet.
I wouldn't be too prescriptive. I have Pro, and it's fine. I'm not an incredibly heavy user (yet?); I've hit the rate limits a couple times, but not to the point where I'm motivated to spend more.
I haven't tried it myself, but I've heard from people that Opus can be slow when using it for coding tasks. I've only been using Sonnet, and it's performed well enough for my purposes.
Sonnet works fine in many cases. Opus is smarter, and custom 'agents' can be set to use either.
I prefer configuring it to use Sonnet for things that don't require much reasoning/intelligence, with Opus as the coordinator.
Opus is slow, so sessions should be used in parallel, likely across work trees. You shouldn't sit and wait on an Opus agent.
> use Claude. But I have no idea what the right way to do it is because there are so many paths to choose.
Anthropic has this useful quick start guide: https://docs.anthropic.com/en/docs/claude-code/quickstart
What exactly did you try with GitHub copilot? It’s not an LLM itself, just in interface for an LLM. I have copilot in my professional GitHub account and I can choose between chat-gpt and Claude.
Claude Code has two usage modes: pay-per-token or subscription. Both modes are using API under the hood, but with subscription mode you are only paying a fixed amount a month. Each subscription tier has some undisclosed limits, cheaper plans have lower usage limits. So I would recommend paying $20 and trying the Claude Code via that subscription.
I’m looking for cursor alternatives after confusing pricing changes. Is Claude code an option? Can be integrated on an editor/ide for similar results?
My use case so far is usually requesting mechanic work I would rather describe than write myself like certain test suites, and sometimes discovery on messy code bases.
Claude Code is really good for this situation.
If you like an IDE, for example VS Code you can have the terminal open at the bottom and run Claude Code in that. You can put your instructions there and any edits it makes are visibile in the IDE immediately.
Personally I just keep a separate terminal open and have the terminal and VSCode open on two monitors - seems to work OK for me.
No Opus in the $20 tier though sadly
As far as I can tell - that seems to have changed today!
Actually I think I was wrong, the PR material was just vague about it.
What does Opus do extra?
It's a much larger, more capable LLM than Claude Sonnet.
I mean day to day. How is the coding experience different?
VSCode has a pretty good Gemini integration - it can pull up a chat window from the side. I like to discuss design changes and small refactorings ("I added this new rpc call in my protobuf file, can you go ahead and stub out the parts of code I need to get this working in these 5 different places?") and it usually does a pretty darn good job of looking at surrounding idioms in each place and doing what I want. But gemini can be kind of slow here.
But I would recommend just starting using Claude in the browser, talk through an idea for a project you have and ask it to build it for you. Go ahead and have a brain storming session before you actually ask it to code - it'll help make sure the model has all of the context. Don't be afraid to overload it with requirements - it's generally pretty good at putting together a coherent plan. If the project is small/fits in a single file - say a one page web app or a complicated data schema + sql queries - then it can usually do a pretty good job in one place. Then just copy+paste the code and run it out of the browser.
This workflow works well for exploring and understanding new topics and technologies.
Cursor is nice because it's an AI integrated IDE (smoother than the VSCode experience above) where you can select which models to use. IMO it seems better at tracking project context than Gemini+VSCode.
Hope this helps!
Lets see: We have GitHub, and GitHub Enterprise Server, and a GitHub API. Then there's the command line and a desktop version, and one that is just browser based I guess. Then you have different pricing plans, Free, Team, and Enterprise? How is Enterprise different than GitHub Enterprise Server? It's very easy to find evidence to confirm our bias.
Claude code is actually one of the most straightforward products I've used as far as onboarding goes. You download the tool, and follow the instructions. You can use one of the 3 plans, and everything else is automatic. You can figure out token usage and what models and versions to use and how to use MCP servers and all of that -- there's a lot of power -- but you don't need to do ANY of that to get started trying it out.
You're not being:
> That critic who doesn't try the stuff he criticizes
You're being:
> That critic who is trying to confirm their biases
Claude Code is the superior interface in my opinion. Definitely start there.
Cursor + Claude 4 = best quality + UX balance. Pay up for 20/month subscription.
Cursor imports in your VSCode setup. Setting it up should be trivial.
Use Agent mode. Use it in a preexisting repo.
You're off the races.
There is a lot more you can do, but you should start seeing value at this point.
If you're looking for a coding assistant, get Claude Code, and give it a try. I think you need the Pro plan at a minimum for that ($20/mo; I don't think Free includes Claude Code). Don't do the per-request API pricing as it can get expensive even while just playing around.
Agree that the offering is a bit confusing and it's hard to know where to start.
Just FYI: Claude Code is a terminal-based app. You run it in the working directory of your project, and use your regular editor that you're used to, but of course that means there's no editor integration (unlike something like Cursor). I personally like it that way, but YMMV.
Yes. You basically need an LLM to provide guidance on product selection in this brave new world.
It is actually one of my most useful use cases of this tech. Nice to have a way to ask in private so you don’t get snarky answers like: it’s just like buying shoes!
Claude Code CLI.
Thanks. With the CLI, can you get Copilot-ish things like tab-completion and inline commands directly in your IDE? Or do you need to copy/paste to and from a terminal? It feels like running a command on the IDE and then copying the output into your IDE is a pretty primitive way to operate.
My advice is this:
1) Completely separate in your mind the auto-completion features from the agentic coding features. The auto-completion features are a neat trick but I personally find those to be a bit annoying overall, even if they sometimes hit it completely right. If I'm writing the code, I mostly don't want the LLM autocompletion.
2) Pay the $20 to get a month of Claude Pro access and then install Claude Code. Then, either wait until you have a small task in mind or your stuck on some stupid issue that you've been banging your head on and then open your terminal and fire up Claude Code. Explain to it in plain English what you want it to do. Pretend it's a colleague that you're giving a task to over Slack. And then watch it go. It works directly on your source code. There is no copying and pasting code.
3) Bookmark the Claude website. The next time you'd Google something technical, ask it Claude instead. General questions like "how does one typically implement a flizzle using the floppity-do framework"? "I'm trying to accomplish X, what are my options when using this stack?". General questions like that.
From there you'll start to get it and you'll get better at leverage the tool to do what you want. Then you can branch out the rest of the tool ecosystem.
Interesting about the auto-completion. That was really the only Copilot feature I found to be useful. The idea of writing out an English prompt and telling Copilot what to write sounded (and still sounds) so slow and clunky. By the time I've articulated what I want it to do, I might as well have written the code myself. The auto-completion was at least a major time-saver.
"The card game state is a structure that contains a Deck of cards, represented by a list of type Card, and a list of Players, each containing a Hand which is also a list of type Card, dealt randomly, round-robin from the Deck object." I could have input the data structure and logic myself in the amount of time it took to describe that.
I think you should embrace a bit of ambiguity. Don't treat this like a stupid computer where you have to specify everything in minute detail. Certainly the more detail you give, the better to an extent. But really: Treat it like you're talking to a colleague and give it a shot. You don't have to get it right on the first prompt. You see what it did and you give it further instructions. Autocomplete is the least compelling feature of all of this.
Also, I don't remember what model Copilot uses by default, especially the free version, but the model absolutely makes a difference. That's why I say to spend the $20. That gives you access to Sonnet 4 which is where, imo, these models took a giant leap forward in terms of quality of output.
Is Opus as big a leap as sonnet4 was?
Thanks, I shall give it a try.
One analogy I have been thinking about lately is GPUs. You might say "The amount of time it takes me to fill memory with the data I want, copy from RAM to the GPU, let the GPU do it's thing, then copy it back to RAM, I might as well have just done the task on the CPU!"
I hope when I state it that way you start to realize the error in your thinking process. You don't send trivial tasks to the GPU because the overhead is too high.
You have to experiment and gain experience with agent coding. Just imagine that there are tasks where the overhead of explaining what to do and reviewing the output are dwarfed by the actual implementation. You have to calibrate yourself so you can recognize those tasks and offload them to the agent.
There's a sweet spot in terms of generalization. Yes, painstakingly writing out an object definition in English just so that the LLM can write it out in Java is a poor use of time. You want to give it more general tasks.
But not too general, because then it can get lost in the sauce and do something profoundly wrong.
IMO it's worth the effort to know these tools, because once you have a more intuitive sense for the right level of abstraction it really does help.
So not "make this very basic data structure for me based on my specs", and more like "rewrite this sequential logic into parallel batches", which might take some actual effort but also doesn't require the model to make too many decisions by itself.
It's also pretty good at tests, which tends to be very boilerplate-y, and by default that means you skip some cases, do a lot of brain-melting typing, or copy-and-paste liberally (and suffer the consequences when you missed that one search and replace). The model doesn't tire, and it's a simple enough task that the reliability is high. "Generate test cases for this object, making sure to cover edges cases A, B, and C" is a pretty good ROI in terms of your-time-spent vs. results.
Is there any more agent-oriented approach where it just push/pulls a git repo like a normal person would, instead of running it on my machine? I'd like to keep it a bit more isolated and having it push/pull its own branches seems tidier.
Claude does the coding, and edits your files. You just sit back and relax. You don't do any tab completion etc.
Download Claude Code
Create a new directory in your terminal
Open that directory, type in "Claude" to run Claude
Press Shit + Tab to go into planning mode
Tell Claude what you want to build - recommend something simple to start with. Specify the languages, environment, frameworks you want, etc.
Claude will come up with a plan. Modify the plan or break it into smaller chunks if necessary
Once plan is approved, ask it to start coding. It will ask you for permissions and give you the finished code
It really is something when you actually watch it go.
> I just want to putz around with something in VSCode for a few hours!
I just googled "using claude from vscode" and the first page had a link that brought me to anthropic's step by step guide on how to set this up exactly.
Why care about pricing and product names and UI until it's a problem?
> Someone on HN told me Copilot sucks, use Claude.
I concur, but I'm also just a dude saying some stuff on HN :)
If you want your own cheap IDE integration, you can set up VSCode with Continue extension, ollama running locally, and a small agent model. https://docs.continue.dev/features/agent/model-setup.
If you want to understand how all of this works, the best way is to build a coding agent manually. Its not that hard
1. Start with Ollama running locally and Gemma3 QAT models. https://ollama.com/library/gemma3
2. Write a wrapper around Ollama using your favorite language. The idea is that you want to be able to intercept responses coming back from the model.
3. Create a system prompt that tells the model things like "if the user is asking you to create a file, reply in this format:...". Generally to start, you can specify instructions for read file, write file, and execute file
4. In your wrapper, when you send the input chat prompt, and get the model response back, you look for those formats, and make the wrapper actually execute the action. For example if the model replies back with the format to read file, you read the file from your wrapper code and send it back to the model.
Every coding assistant is basically this under the hood with just a lot more fluff and their own IDE integration.
The benefit of doing your own is that you can customize it to your own needs, and when you direct a model with more precision even the small models perform very well with much faster speed.
OP is asking for where to get started with Claude for coding. They're confused. They just want to mess around with it in VSCode. And you start talking about Ollama, PAT, coding your own wrapper, composing a system prompt etc.!?
OP is trying to get LLMs to assist with coding. Implying that coding is something he is capable of, and coding your own wrapper is a great way to get familiarity with these systems.
Download Cursor and try it through that, IMO that's currently the most polished experience especially since you can change models on the fly. For more advanced usecases, CLI is better but for getting your feet wet I think Cursor is the best choice.
Thanks. Too bad you need to switch editors to go that path. I assume the Cursor monthly plans are not the same as the Claude monthly plans and you can't use one for the other if you want to experiment...
Cursor is built on VSCode.
Kilo Code for VSCode is pretty solid. Give it a try.
You just described all of your options in detail - what's the problem? Pick one. Seems like you've got a very thorough grasp on how to get started trying the stuff out, but it requires you to choose how you want to do that.
Github Copilot and Claude code are not exactly competitors.
Github Copilot is autocomplete, highly useful if you use VS Code, but if you are using e.g. Jetbrains then you have other options. Copilot comes with a bunch of other stuff that I rarely use.
Claude code is project-wide editing, from the CLI.
They complement each other well.
As far as I'm concerned the utility of the AI-focused editors has been diminished by the existence of Claude code, though not entirely made redundant.
This isn't correct. GitHub Copilot now totally competes with Claude Code. You can have it create an entire app for you in "Agent" mode if you're feeling brave. In fact, seeing as Copilot is built directly into Visual Studio when you download it, I guess they have a one-up.
Copilot isn't locked to a specific LLM, though. You can select the model from a panel, but I don't think you can plug in your own right now, and the ones you can select might not be SOTA because of that.
I didn't mean it doesn't attempt to compete, I mean it doesn't actually compete. Claude code for agents, Copilot for autocomplete (depending on your editor/IDE).
For single-line autocomplete, which is how I use it, pretty much anything will do the job. I use Copilot only because it integrates well with VS Code. I find the other features to be inferior.
I use Copilot for the same reason (it's already there in Visual Studio). But I think we're talking about different things -- did you try Agent mode in Copilot? (the naming of all these things is getting confusing)
Sonnet 4 in copilot agent mode has been doing great work for me lately. Especially once you realise that at least 50% of the work is done before you get to copilot, as architectural and product specs and implementations plans.
Is Copilot's Agent Mode any good, though?
Ehhh... I wouldn't use it for anything important right now. It often screws up by truncating code files then asking itself "where did all those functions go?" and having to rewrite them from scratch.
When it works, it's great though. I've used it to vibe-code some nice little desktop apps to automate things I needed and it produced way more polished UI than I would have spent the time doing, and the code is pretty much how I would have written it myself. I just set it going and go do some other task for 10 mins and come back to see what changes it made.
Opencode https://github.com/sst/opencode provides a CC like interface for copilot. It's a slightly worse tool, but since copilot with Claude 4 is super cheap, I ended up preferring it over CC. Almost no limits, cheaper, you can use all the Copilot models, GH is not training on your data.
> Github Copilot is autocomplete... comes with a bunch of other stuff that I rarely use.
That bunch of other stuff includes the chat, and more recently "Agent Mode". I find it pretty useful, and the autocomplete near useless.
honestly - copilot free mode; and just play with the agentic stuff can give you a good idea. Attach it to Roo and you'll get a good idea. Realize that if you paid to use a better model; you'd get better results as free doesn't have a ton of premium tokens.
try asking it ?
All the tools, copilot,claude, gemini in vscode are all completely worthless unless in Agent Mode. I have no idea why none of these tools dont default to Agent mode.
Is there any tool like Claude Code that can go into the same "automatic feedback and coding loop" (I don't know if it has an official name) but compatible with using different LLMs.
I've used Aider for a while, and I kind of liked if, but it felt like it needed way more manual work, and I also want to use different models, probably locally hosted. Haven't used Aider in 2 or 3 months, so I don't know if it already has evolved in that way...
edit: in the other hand, the automatic feedback loop means it sometimes go very crazy and the API costs skyrocket easily. But maybe that's another reason to run it locally.
Claude Code Router (https://github.com/musistudio/claude-code-router) lets you use Claude Code with other, non-Anthropic models.
opencode, but note that all the self-hosted llms are much worse at coding than claude code with opus/sonnet.
there's also claude-code-proxy to make claude code use other models.
Claude code proxy technically works with openai but tool use breaks every now and then on o3-mini, making it unusable for me
o3 and o3-pro are just so good. Sonnet goes off the deep end too often and Opus, in my experience, is not as strong at reasoning compared to OpenAI, despite the higher costs. Rarely do we see a worse, more expensive product win - but competition is good and I’m rooting for Anthropic nonetheless!
OpenAI also has Flex processing[1] for o3. I've spent most of my time with Gemini 2.5, but lately been trying out a ton of o3 as it seems to work quite well and I get really cheap tokens (~95% of my agentic tokens are cached which is 75% discount and flex mode adds 50% for $0.25 / million input tokens)
[1] https://platform.openai.com/docs/guides/flex-processing?api-...
Which agents support flex mode?
I've made my own fork of Codex that always uses flex, or you can route agents through litellm and make it add the service_tier parameter. I haven't really seen native support for it anywhere.
o3 feels pretty good to me as well but o3-pro has consistently one shotted problems other LLMs got stuck on.
I'm talking multiple tries of claude 4 opus, Gemini 2.5 pro, o3 etc resulting in sometimes hundreds of lines of code.
Versus o3-pro (very slowly) analyzing and then fixing something that seemed completely unrelated in a one or two line change and truly fixing the root cause.
o3-pro level LLMs at reduced cost and increased speed will already be amazing..
Off the deep end?
It picks a bad path forward and keeps doubling down on it
Probably referring to it's tendency to over-complicate things to the point you have to step in and be like "WTF are you even talking about... Wouldn't it be a lot simpler to just use the original, well planned out design?"
Which it does a lot...
Cheekily announcing during oAI's oss model launch :D
Why is everything releasing today?
If they release before GPT-5, they don't have to compare to GPT-5 in their benchmarks. It's a big PR win to be able to plausibly claim that your model is the best coding model at the time of release.
Could it be nobody wanted to be first and overshadowed, nor the only one left out - and it cascaded after the first announcement? My first hunch, though, was that it had been agreed upon. Game theory I think tells us that releasing same day in the pattern ABC BCA CAB etc would be lowest risk and highest average gain?
Opus 4.1 is now set as default model in Claude Code - just a heads-up.
Not for me.
The improved Opus isn’t about achieving significantly better peak performance for me. It’s not about pushing the high end of the spectrum. Instead, it’s about consistently delivering better average results - structuring outputs more effectively, self-correcting mistakes more reliably, and becoming a trustworthy workhorse for everyday tasks.
Have been using it in Claude Code with Max Plan for one day. The rate of acceptance is noticeably higher.
Has anyone tested it yet? How's it acting?
Tested it on a refactor of Zig code. It worked fine, but was very slow.
No obvious gains I feel from quick chats, but too early to tell.
These benchmark gains aren't that high, so I doubt it is that obvious.
waiting for this, too.
It's interesting that Anthropic maintains current prices for prior state of the art models when doing a new release. Why offer a model with worse performance for the same price? What incentives are they trying to create?
> What incentives are they trying to create?
One obvious explanation is that pricing is strongly related to the price to them, and that their only incentive is for people to use an expensive model of they really need it.
I forget which one of the GPT models was better, faster, and cheaper than the previous model. The incentive there is obviously, "If you want to use the old model for whatever reason, fine, but we really want you to use the new one because costs us less to run."
I'm guessing it's mostly for legacy reasons. When 3.7 came out many people were not happy with it and went back to 3.5; I guess supporting older models for a while makes sense.
Funny Open AI and Anthropic seems to be coordinating their releases on the same day
just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/
How does running it multiple times performs?
LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.
Their limits are just … a real road blocker
huh?
Claude Mad is tens of hours of opus a month, or you can pay per token and have unlimited.
Or did you mean “I wish it was cheaper”?
Ha - the $200 plan should be renamed to "Claude Mad Max" :)
Claude Code has honestly made me at least 10x more productive. I’ve burned through about 3 billion tokens and have been consistently merging 5+ PRs a day, tackling tons of tech debt, improving GitHub Actions, and making crazy progress on product work
only 10x? I'm at least 100x as productive. I only type at a measly 100wpm, whereas Claude can output 100+ tokens a second
I'm outputting a PR every 6 minutes. The reviewers are using Claude to review everything. It used to take a day to add 100 lines to the codebase.. now I can add 100 lines in one prompt
If I want even more productivity (at risk of making the rest of my team look slow) I can tell Claude to output double the lines and ship it off for review. My performance metrics are incredible
So no human reads the actual code that you push to production? Are you not worried about security risks, spaghetti code, and other issues? Or does Claude magically make all of those concerns go away?
forgot the /s
Sorry lol, sometimes difficult to separate the hype boys from actual sarcasm these days
Not sure if joking...?
This is only the beginning. I can see myself having 100 Claude tasks running concurrently - the only problem is edits clash between files. I'm working on having Claude solve this by giving each instance its own repo to work with, then I ask the final Claude to mash it all together as best it can
What's 100x productivity multiplied by 100 instances of Claude? 10,000x productivity
Now to be fair and a bit more realistic it's not actually 10000x because it takes longer to push the PR because the file sizes are so big. Let's call it 9800x. That's still a sizable improvement
Big if true
I also have this feeling that I'm 2-10x more productive. But isn't it curious how a lot of devs feel this way, but no devs that I know have the experience that any of their colleagues have become 2-10x more productive?
<raises hand> Our automated test folks were chronically behind, struggling to keep up with feature development. I got the two assigned to the team that was the most behind set up with Claude Code. Six weeks later they are fully caught up, expanding coverage, and integrating AI code review into our build pipeline.
It's not 10x, but those guys do seem like they've hit somewhere around 2x improvement overall.
10x means to me that i can finish a month of work in max 2 days and go cloud watching. What does it mean for you?
Sometimes 10x can mean that I start things that I would have never started before, knowing it would take a long time. Or that I can have any of the agentic stuff "explore" libs, stacks and frameworks I wanted to look at, but had no time. Or distill some vague docs and blog posts to find common use cases for tech x. And so on.
It's not always a literal 10x time for taskA w/ AI vs taskA w/o AI...
A 60 minute script becomes 6 minutes
What type of work do you do and what type of code do you produce?
Because I've found it to work pretty amazingly for things that don't need to be exact (like data modeling) or don't have any security implications (public apps). But for everything else I end up having to find all the little bugs by reading the code line by line, which is much slower than just writing the code in the first place.
How do you maintain high confidence in the code it generates ?
My current bottleneck is having to review the huge amounts of code that these models spit out. I do TDD, use auto-linting and type-checking.... but the model makes insidious changes that are only visible on deep inspection.
You have to review your code for quality and bugs and errors now just as you did last month or last year. Did you never write bugs accidentally before?
We're all bottlenecked on reviewing now. That's a good thing.
There was a greater awareness of exactly what I'd written. By definition, I would not have written those bugs in, as long as I had known edge cases in my mind.
Lapses of judgement and syntax errors happen, but they're easier to spot because you know exactly what you're looking at. When code is written by a model, I have to review it 3 times.
1st to understand the code. 2nd to identify lapses in suspicious areas. 3rd to confirm my suspicions through interactive tests, because the model can use patterns I'm unfamiliar with, and it takes me some googling to confirm if certain patterns used by the model are outright bugs or not. The biggest time sink is fixing an identified bug, because now you're doing it in someone-else's (model's) legacy code rather than a greenfield feature implementation.
It's a big productivity bump. But, if reviewing is the bottleneck, then that upper bounds the productivity gains at ~4x for me. Still incredible technology, but the death of software-engineering that it is claimed to be.
The only way you could be 10x more productive is omit you were doing nothing before.
can you share your workflow?
> 1 min read
What the point of these?
Kind of interesting that we live in an area of AI super advanced, but still make basic UI/UX mistake. The tagline of this blog post shouldn't be "1 min read".
It's not even accurate. I timed myself not reading fast but not slow, took me 3 min 30s. Maybe the images need be OCRed to make the estimation more accurate.
Claude lost me after I used it for a day. Their pricing model is bonkers. There is no way any developer in their right mind would go with Claude.
Their API pricing is bonkers, their subscription is a great deal for what you get
Claude plus failed me today badly compared to chatGPT plus.
I uploaded a web design of mine (jpeg) and asked Claude to create the html/css. Asked GPT to do the same. GPT's code looked the closet to the design I created and uploaded. Just five to ten small tweaks and I was done vs. Claude it would have taken me almost triple the steps.
I actually subscribed to both today (resubscribed to GPT) and going to keep testing which one is the better front-end developer (i am, but got to embrace AI ).
Will the price for 4 go down? I still find Opus completely unusable for the cost/performance, as someone who spends thousands per month on tokens. There's really no noticeable difference from Sonnet, at nearly 10x the price.
Well wait another 24hrs…
Is it just me, or is Opus 4.1 substantially worse in Claude Code than Opus 4.0 was? I feel like I'm using Sonnet.
It's making really stupid errors and I have to work three times as much to get the same results as last week.
Is it just me or is it super slow?
Notice how Anthropic has never open sourced any of their models.
This makes them (Anthropic) worse than OpenAI in terms of openness.
Since in this case as we all know. [0]
"What will permanently change everything is open source and transparent AI models that are smaller and more powerful than GPT-3 or even GPT-4."
[0] https://news.ycombinator.com/item?id=34865626
On the other hand, they have always exposed their raw chain of thought, so you know exactly what you're paying for, unlike OpenAI who hides it. Similarly they allow an actual thinking budget rather than vague "low, medium, high", again unlike OpenAI. They also allow API access to all their models without draconic send-us-your-personal-data-KYC, once more unlikely OpenAI.
They might not fit your personal definition of "openness", but they do fit many other equally valid interpretations of that contept.
For me this is the big news of the day. Looks insane.