Claude Opus 4.8 is here. Is it as good as they say?
I got a few hours of early-access testing with Anthropic’s newly released model Opus 4.8. I walk through real coding, design, and strategy tasks across Claude Code and Claude Cowork, and give you my unfiltered view on what impressed me and what didn’t. — What you’ll learn: Where Opus 4.8 excels: greenfield prototypes, one-shot features, and fast execution Where it struggles: the last 10%, edge cases in existing codebases, and hallucinations How Opus 4.8 compares to Opus 4.7 on business strategy work Why I’m still reaching for Opus 4.7 on data-heavy strategy and roadmap work The new features shipping alongside the model: dynamic workflows with parallel subagents and effort control in Claude.ai and Cowork The prompting and harness strategy I’d use to get the most out of it — In this episode, we cover: (00:00) Introduction to Opus 4.8 (00:44) Benchmark performance and pricing (01:53) First coding test: Building a prototyping tool (03:00) Where it failed: The last 10% problem (03:27) The hallucination problem (04:23) Testing Opus 4.8 on existing codebases (05:24) The ambition test: Building games for a 9-year-old (07:03) Business strategy test: 4.7 vs 4.8 (08:23) The roadmap test (09:17) Final verdict — References: • System Card: Claude Opus 4.8: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf • Introducing Claude Opus 4.8 on X: https://x.com/claudeai/status/[redacted card]?s=20 — Where to find Claire Vo: ChatPRD: https://www.chatprd.ai/ Website: https://clairevo.com/ LinkedIn: https://www.linkedin.com/in/clairevo/
- Published
- Published May 28, 2026
- Uploaded
- Uploaded Jun 12, 2026
- File type
- POD
- Queried
- 00
- Source
- podcasters.spotify.com
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:04] - Welcome back to How I AI. I'm Claire Vo, product leader and AI obsessive here on a mission to help you build better with these new tools. Today we have a very special mini episode because Anthropic just dropped Opus 4.8. [00:17] their latest state-of-the-art coding model. And I got a few hours of early access and I'm here to share my very early thoughts about where this model is intended to perform well, [00:28] where it did a great job and totally impressed me and where there's still a little bit further to go. Let's get to it. [00:33] As you can tell, I am not in my regular How I AI studio, and that's because I am so excited to give you my early thoughts on Opus 4.8. [00:41] and couldn't wait between meetings to share what I thought. [00:44] So to get started, I want to talk about what this model is, what Anthropoc has told us about its benchmarks, performance, and what it's good at. So Anthropoc is shipping, Opus 4.8. It is supposed to be their step change model for agents. And there's a couple things they've called out that this model does particularly well. [01:03] It's supposed to be more honest, a less-designed flop, [01:07] longer horizon autonomy on long-running tasks, and enterprise-ready. So it means it follows its instructions. [01:13] And they're saying that Sweebench Pro, they're hitting 69.2% [01:18] which is almost 5 points higher than Opus 4.7, almost 10 points higher than GPT 5.5, and 15 points higher than Gemini 3.1. [01:27] Now this model is not cheap. It's $5 per input tokens and $25 per million output tokens.
[01:34] And then same as 4.7 effort defaults to high and fast mode can be a lot faster. [01:40] This is what they say, this is what you're going to read on the blog post. And so on paper, this is a very exciting model. But I want to tell you my personal experience using this model and where I thought it did a really good job. [01:50] And again, where it did not do a perfect job. And so when I was giving feedback to the team, I said, [01:56] Surprise, surprise, LOL. It's a good coding model. [01:59] in that when I opened up Plodcode and asked it to do [02:03] A fairly complex one-shot [02:05] brand new surface area task, it did a pretty good job. So [02:09] I asked in Cloud Code, Opus 4-8, to build a prototyping [02:15] capability and chat PRD. So we make PRDs. I said, let's just go whole hog. Let's compete with the big boys. Let's make an entire prototyping tool. [02:22] And I gave it some architecture decisions I wanted to make, what platforms I wanted to use, how I wanted it to function. [02:28] It went through plan and then it autonomously coded for, I would say, about 20 minutes. [02:34] and shipped it. And when I push this live to my preview branch, [02:37] It worked. [02:38] And so I would say from a one-shot feature, it did quite a good job. The code was right and it followed the architecture I want. [02:45] Where it failed was this last 10%. And this is really going to be my theme of this episode. [02:52] It does really, really well. [02:54] until it doesn't do well. And I found it did not do well consistently over time with the same types of trouble. So [03:00] I'm curious, as you all get your hands on this model, if you have the same experience I did,
[03:04] where it does like really, really, really well and then struggles in edge cases and the details. [03:09] So, [03:10] What it nailed here is it did take the spec, it planned the work, it shipped the feature. [03:14] But then as soon as I got it live and started trying to take it to the next level, the next level, the next level, it really struggled. [03:21] and started to ship bugs. And even more than its inability to finish that last 10%, when it was bug hunting, [03:28] It hallucinated. And I am going to tell you, I have not seen a straight up hallucination in a very, very, very long time. [03:36] But over my experience early testing Opus 4.8, both on business use cases as well as coding use cases, [03:43] It's. [03:43] 100% made up things based on a hypothesis, not data. And this was really interesting to me. This was on [03:51] high effort. And so I don't think it was effort or reasoning. There's something about this model where it's really not grounding itself as effectively as I've seen in other models. [04:01] Again, this was a one-shot, but then very specifically propped it up on scoped surface areas for follow-ups like [04:08] I saw a bug in the preview branch. [04:11] and got these hallucinations. And so this is a really [04:14] Interesting reflection of this bug. I'm going to have to run at it a little bit more to see if this holds over time with coding use cases. [04:20] But it was kind of the theme of my test here. Okay, the headline is a little dramatic. It says, in real codebases, the edges destroy it. [04:28] This is not Opus 4.8. This is just Claude Cowork fail here. It doesn't have the screen, the screenshots. I'll have to show you the GitHub for this.
[04:35] But basically what I saw is when I pointed it at existing... [04:39] code, it also struggled to sort of insert itself and understand the edges of where it was supposed to work. [04:44] So let me give you an example of this. I had a couple of branches in flight that I needed to rebase, that I needed to bring up to base because we had shipped a big underlying PR and it kind of messed up. [04:54] the state of the code. And so I asked Opus 4.8 to [04:58] rebase and check these branches for code. [05:00] And as you can see here, I had to do cycle after cycle of [05:04] rebase and fixes because it was continuing to ship really edge case [05:10] Bugs? [05:11] into the code. And again, this was my experience. I thought it did a really good job one shot on a surface area [05:17] But then when you got into the specifics, it struggled to understand the elevation at which it should [05:23] be operating. [05:24] The third Git coding use case that I tried [05:27] was a fun one, which is I pulled up Claude Code and asked it, just what are some fun things we can one shot with Claude Code that my nine-year-old would think is rad? And [05:35] I really tried to push it to say, make it really interesting. Think about the edges of... [05:41] agentic coding and [05:43] Aside from the code quality itself, which I struggled with, sort of had highs and lows, [05:48] The other thing I reflected on when I was coding with Opus 4.8 is it just wasn't [05:53] ambitious enough. And so it gave me this awesome prompt, which was build a game, then play it yourself by watching the screen and tweaking the difficulty until it's fun for a nine-year-old. Amazing. This is state-of-the-art coding agent. [06:04] It's going to cook. [06:05] Let me show you what it actually shipped.
[06:08] It shipped. [06:09] this. Which is like fine, of course magic. Like I would have never been able to ship this by myself without a lot of effort. [06:16] but not pushing the edges of agentic coding and [06:19] Even when I said, great, let's make it... [06:21] 3D, let's do something even more fun. [06:24] It ships something like this, which again is super cool. I would have never been able to do this. [06:29] But... [06:31] It's not 10x agentic coding blow my mind impressive. [06:35] And so this is where I really struggled with Opus 4.8 is [06:39] I kept saying more, more, do better, do better. And it just wasn't as ambitious as I've seen before. [06:45] other models be. So [06:48] In terms of coding, I think it does a totally serviceable job [06:51] I wouldn't say it's bad at coding. I would just say my experience has been it struggles with the last 10%. [06:56] It's not exceptional at orienting itself inside existing code bases, [07:00] And then it's just not that ambitious. [07:03] Now let's talk about [07:05] business work. So I also tested Opus 4.8 in Claude Cowork. [07:09] And I tested it on strategy and I gave it this very broad prompt and I tested [07:15] 4.7 versus 4.8. And I basically said, based on what you can gather about my last three months, [07:19] Where am I spending my time versus where my priorities should be if I want to 10x my business? [07:24] I gave it access to all the same business context. [07:27] And then once it did that analysis, I said, please write me a strategy prompt. [07:32] And this is where the performance of Opus 4.7 versus Opus 4.8 [07:37] really became apparent.
[07:39] Opus 4.7 was very numbers-acreed. You can see this table here. I obfuscated some of the numbers, but... [07:45] It was very numbers anchored. It was very structured and rooted in [07:50] real data. [07:52] While both of these exercises did have access to the same data, [07:56] Opus 4.8 had a harder time discovering the relevant data [08:00] and it over-rotated on small data points and took them as [08:04] truth as opposed to [08:06] what I experienced Opus 4/7 doing, [08:08] which is it zoomed out a lot more and put everything in context. [08:12] Now, again. [08:13] This is... [08:14] mutual one shots. It was basically two shot. It was like analyze my time and then give me a strategy to grow my business. [08:20] But the difference between these two were very, very high. I then asked it a follow up prompt to build a roadmap. [08:27] And again, 4.7, very anchored in specifics, very good strategy. [08:32] And 4.8 was incredibly hand wavy. And in fact, with Opus 4.8, it gave me a roadmap. And then I said, we have all this. Did you search through GitHub? Did you look online? [08:43] And what's really funny again with the hallucination is you see here, no, I didn't. This is a common thing that I had Opus 4.8 say to me. [08:50] No, I didn't search GitHub. No, I didn't actually look up that data. No, I didn't actually validate that bug. [08:56] Now again, this was early access, so I'm not 100% sure if this is prompting error, if it's the shape of the model, if it's the harness that needs to be tuned. [09:04] But I consistently got this experience of the model hallucinating or over-rotating on a hypothesis that it had.
[09:11] as opposed to being anchored in true code truth or in true business truth. [09:16] And so, [09:17] I honestly would continue to reach for Opus 4x7, which I think did an exceptional job on strategy. [09:24] versus for eight which i think was a lot more hand-wavy and just over-rotated on things i didn't think was important [09:30] Now, that being said, what positively impressed me? [09:33] Voice is great. [09:35] Claude is not an annoying girlfriend, is what I would say. [09:38] It was easy to read. It didn't have slop tells. It was token efficient. It felt like it was talking enough, but not too much. [09:46] And it was fast. Now I got early access. Who knows what the production latency is? [09:51] BUT [09:52] with fast mode, I anticipate you'll have this fast experience. So I think the ergonomics were very nice. Now, if we zoom out and I say, [09:59] The writing was very good, and then [10:02] Opus 4.7 wrote this slide. I don't know if I love this slide that much. So hopefully 4.8 would have done a better job with the voice and ergonomics. But I do think the experience of using the model was very nice. It had no complaints. It was not annoying. It did not have ticks and tells. [10:18] just the outputs were not exactly what I wanted. [10:21] So here's my theory, and this is what I saw. It's just over-tuned and has kind of narrow [10:26] vision. So it's smart, it's fast, it's efficient. [10:29] but it's overly confident absent true validation. That's what I would want you to walk away from in my review of Opus 4.8. [10:36] It really latches onto specific data points, specific code points, [10:40] It draws conclusions for them and then says, this must be truth. And so it sort of misses the forest for trees.
[10:46] both in coding and in strategy and [10:50] This might be part of its efficiency. Like, I thought it was super efficient. But does that come at the cost of accuracy? And would I rather a long-running, [10:59] sort of relentless [11:01] coding model really going deep and validating its own opinion. [11:05] before shipping. So I didn't quite experience [11:08] I would say this more honest and long-horizon autonomy... [11:13] I did see it was fast. I did find it was enjoyable to work with. I think it followed instructions well, but it stayed too much in scope, if that makes sense, because it didn't zoom out. [11:22] and contextualize the work that it's doing. [11:25] My verdict, I mean, all these models are great. They're all magic. So like, let's be real. Every model is magic. The fact that I could do any of this in just a couple hours is pretty genius. [11:36] But... [11:37] I would use it for Greenfield prototypes. It's really impressive on a one-shot. I think its design is better. It got rid of the italics emphasis words, which were driving me crazy. [11:46] from quad design. [11:47] It's good at tool use. It's fast. It's not annoying. [11:51] where I would test it and really figure out the right prompting strategy and the right harness strategy is with [11:57] existing code bases and branches with real edge case, [12:00] with strategy work that requires you to think about numbers. And again, you can probably prompt this, but I would just think about that prompting. [12:07] And I would just double check where it's really confident because my experience was its confidence was not rooted in fact. [12:14] Again, I'm really excited to see this model come out. There's a couple more features as well in Cloud Code, as well as Cloud AI and Cowork in Cloud Code.
[12:23] You now have dynamic workflows which can let you spin off hundreds of parallel sub agents. [12:28] and in cloud.ai and co-work you now can set effort control from low to max which you were able to do in cloud code so these are all [12:36] really interesting shifts in both the harness and the model, [12:40] I would say, [12:41] It's a good model. It's not the most amazing model. It didn't blow my mind. It has some quirks to it, what I think you need to be aware of. But I'm definitely going to keep testing it because with the benchmarks, with the work that's gone into the product, I think it's a model. [12:54] worth keeping your eye on. [12:55] So that's it. That's my quick review of Opus 4.8 that just came out today from Anthropic. [13:00] I'd love to hear your experience with it, especially how it does in coding, how it does in design. [13:05] and whether or not it gives you strategy anchored in reality. [13:08] Thanks for joining How I AI. [13:11] Thanks so much for watching. If you enjoyed the show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. [13:19] You can also find this podcast on Apple Podcasts, Spotify, or your favorite podcast app. Please consider leaving us a rating and review, which will help others find the show. You can see all our episodes and learn more about the show at howiaipod.com. [13:36] See you next time.
Want to learn more?
Ask about this episode