I have been round know-how lengthy sufficient that little or no excites me, and even much less surprises me. However shortly after OpenAI’s ChatGPT was launched, I requested it to write a WordPress plugin for my wife’s e-commerce site. When it did, and the plugin labored, I used to be certainly shocked.
That was the start of my deep exploration into chatbots and AI-assisted programming. Since then, I’ve subjected 14 massive language fashions (LLMs) to 4 real-world assessments.
Additionally: Only 8% of Americans would pay extra for AI, according to ZDNET-Aberdeen research
Sadly, not all chatbots can code alike. It has been slightly over two years since that first check, and even now, 4 of the 13 LLMs I examined cannot create working plugins.
The quick model
On this article, I will present you ways every LLM carried out in opposition to my tests. There at the moment are 5 chatbots I like to recommend you utilize.
Two of them, ChatGPT Plus and Perplexity Professional, price $20 per thirty days every. The free variations of the identical chatbots do properly sufficient that you possibly can most likely get by with out paying. Two different really helpful merchandise are from Google and Microsoft. Google’s Gemini Pro 2.5 is free, however you are restricted to so few queries that you simply actually cannot use it with out paying.
Additionally: I tested 10 AI content detectors – and these 5 correctly identified AI text every time
Microsoft has a number of Copilot licenses, which may get dear, however I used the free model with surprisingly good outcomes. The ultimate one, Claude 4 Sonnet, is the free model of Claude. Oddly sufficient, the free model beat the paid-for model, so we’re not recommending Claude 4 Opus.
However the remaining, whether or not free or paid, are usually not so nice. I will not threat my programming tasks with them or suggest that you simply do, till their efficiency improves.
I’ve written heaps about using AIs to help with programming. Except it is a small, easy mission like my spouse’s plugin, AIs cannot write total apps or applications. However they excel at writing just a few traces and are usually not unhealthy at fixing code.
Reasonably than repeat every little thing I’ve written, go forward and browse this text: How to use ChatGPT to write code.
If you wish to perceive my coding assessments, why I’ve chosen them, and why they’re related to this overview of the 13 LLMs, learn this text: How I test an AI chatbot’s coding ability.
The AI coding leaderboard
Let’s begin with a comparative have a look at how the chatbots carried out, as of this installment of our best-of roundup:
Subsequent, let’s take a look at every chatbot individually. I am again as much as discussing 14 chatbots, as a result of we’re splitting out Claude 4 Sonnet and Claude 4 Opus as separate assessments. GPT-4 is now not included since OpenAI has sunsetted that LLM. Prepared? Let’s go.
- Handed all assessments
- Stable coding outcomes
- Mac app
- Hallucinations
- No Home windows app but
- Generally uncooperative
- Worth: $20/mo
- LLM: GPT-4o, GPT-3.5
- Desktop browser interface: Sure
- Devoted Mac app: Sure
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 4 of 4
ChatGPT Plus with GPT-4o handed all my assessments. Certainly one of my favourite options is the supply of a devoted app. After I check net programming, I’ve my browser set on one factor, my IDE open, and the ChatGPT Mac app operating on a separate display.
Additionally: I put GPT-4o through my coding tests and it aced them – except for one weird result
As well as, Logitech’s Prompt Builder, which will be activated with a mouse button, will be set as much as make the most of the upgraded GPT-4o and connect with your OpenAI account, permitting for a easy thumb faucet to run a immediate, which may be very handy.
The one factor I did not like was that one in every of my GPT-4o assessments resulted in a dual-choice reply, and a kind of solutions was improper. I would quite it simply gave me the right reply. Even so, a fast check confirmed which reply would work. Nonetheless, that situation was a bit annoying.
- A number of LLMs
- Search standards displayed
- Good sourcing
- E-mail-only login
- No desktop app
- Worth: $20/mo
- LLM: GPT-4o, Claude 3.5 Sonnet, Sonar Giant, Claude 3 Opus, Llama 3.1 405B
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: No
- Checks handed: 4 of 4
I significantly thought-about itemizing Perplexity Pro as the perfect total AI chatbot for coding, however one failing saved it out of the highest slot: the way you log in. Perplexity would not use a username/password or passkey and would not have multi-factor authentication. All of the device does is electronic mail you a login PIN. The AI would not have a separate desktop app, as ChatGPT does for Macs.
What units Perplexity other than different instruments is that it could possibly run a number of LLMs. When you cannot set an LLM for a given session, you’ll be able to simply go into the settings and select the lively mannequin.
Additionally: Can Perplexity Pro help you code? It aced my programming tests – thanks to GPT-4
For programming, you may most likely need to persist with GPT-4o, as a result of that mannequin aced all our assessments. However it could be fascinating to cross-check your code throughout the completely different LLMs. For instance, when you’ve got GPT-4o write some common expression code, you may contemplate switching to a unique LLM to see what that mannequin thinks of the generated code.
As we’ll see under, most LLMs are unreliable, so do not take the outcomes as gospel. Nonetheless, you should utilize the outcomes to verify your authentic code. It is form of like an AI-driven code overview.
Simply remember to change again to GPT-4o.
- Worth: Free for restricted use, then token-based pricing
- LLM: Gemini Professional 2.5
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 4 of 4
The final time I checked out Gemini, it failed miserably. Not fairly as unhealthy as Copilot on the time, however unhealthy. Gemini Professional 2.5, nonetheless, has carried out fairly admirably. My solely actual situation with it’s entry. I discovered myself minimize off from the free model after solely operating two of the 4 assessments.
Additionally: Gemini Pro 2.5 is a stunningly capable coding assistant – and a big threat to ChatGPT
I waited a day after which ran the third check, and acquired minimize off once more. Lastly, on the third day, I ran my fourth check. Clearly, you’ll be able to’t do any actual programming in case you can solely ask one or two questions earlier than being shut down. So, in case you enroll with Gemini Professional 2.5, bear in mind that Google charges by tokens (principally, the quantity of AI you utilize). That may make it fairly tough to foretell your bills.
- Worth: Free for primary Copilot, or charges for different Copilot licenses
- LLM: Undisclosed
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 4 of 4
In all my earlier analyses of Microsoft Copilot, the outcomes had been the worst of the LLMs. Copilot acquired nothing proper. It was astonishing how unhealthy it was. However I stated then that, “The one constructive factor is that Microsoft all the time learns from its errors. So, I will verify again later and see if this outcome improves.”
Additionally: I retested Microsoft Copilot’s AI coding skills in 2025 and now it’s got serious game
And boy, did it ever. This day out, Microsoft handed all 4 of my assessments. Even higher, it did this with the free model of Copilot. Sure, Microsoft has many paid applications for Copilot, however if you wish to give it the AI spin, level your self to Copilot and use it.
- Worth: Free
- LLM: Claude 4
- Desktop browser interface: No
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 4 of 4
That is a kind of occasions when AI implementations will be actual head-scratchers. In our earlier assessments, Claude 4 Sonnet completed on the backside of the barrel, failing all 4 of our assessments. This time, nonetheless, Sonnet handed each check. So, what is the head-scratcher? Opus, the Claude 4 mannequin, which is a fee-paid model, didn’t do as properly: it failed half the assessments.
Additionally: Anthropic’s free Claude 4 Sonnet aced my coding tests – but its paid Opus model somehow didn’t
So, sure. The free model labored like a champ. And the one you are paying wherever from $20 to $250 a month for, relying on the plan? Nicely, that one failed half of the assessments. Go determine.
- Totally different LLM than ChatGPT
- Good descriptions
- Free entry
- Solely accessible in browser mode
- Free entry seemingly solely non permanent
- Worth: Free (for now)
- LLM: Grok-1
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 3 of 4
I’ve to say, Grok shocked me. I assume I did not have excessive hopes for an LLM that appeared tacked on to the social community previously often known as Twitter. Nonetheless, X is now owned by Elon Musk, and two of Musk’s firms, Tesla and SpaceX, have towering AI capabilities.
It is unclear how a lot Tesla and SpaceX AI DNA is in Grok, however we are able to assume there’ll seemingly be extra work. As of now, Grok is the one LLM not primarily based on OpenAI LLMs that made it into the really helpful listing.
Additionally: X’s Grok did surprisingly well in my AI coding tests
Grok did make one mistake, however it was a comparatively minor one {that a} barely extra complete immediate might simply treatment. Sure, it failed the check. However by passing the others and even doing an nearly excellent job on the one it handed, Grok earned itself a spot as a contender.
Keep tuned. That is an AI to look at.
- Immediate throttling
- Might minimize you off in the midst of no matter you are engaged on
- Worth: Free
- LLM: GPT-4o, GPT-3.5
- Desktop browser interface: Sure
- Devoted Mac app: Sure
- Devoted Home windows app: No
- Multi-factor authentication: Sure
- Checks handed: 3 of 4 in GPT-3.5 mode
ChatGPT is obtainable to anybody without cost. Whereas each the Plus and free variations assist GPT-4o, which handed all my programming assessments, the free app has limitations.
OpenAI treats free ChatGPT customers as in the event that they’re within the low cost seats. If site visitors is excessive or the servers are busy, the free model of ChatGPT will solely make GPT-3.5 accessible to free customers. The device will solely permit you a sure variety of queries earlier than it downgrades or shuts you off.
Additionally: How to use ChatGPT to write code – and my favorite trick to debug what it generates
I’ve had a number of events when the free model of ChatGPT successfully instructed me I would requested too many questions.
ChatGPT is a superb device, so long as you do not thoughts it shutting down. Even GPT-3.5 did higher on the assessments than all the opposite chatbots, and the check it failed was for a reasonably obscure programming device produced by a lone programmer in Australia.
So, if finances is vital to you and you may wait while you’re minimize off, then use ChatGPT without cost.
- Free
- Handed most assessments
- Vary of analysis instruments
- Restricted to GPT-3.5
- Throttles immediate outcomes
- Worth: Free
- LLM: GPT-3.5
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: No
- Checks handed: 3 of 4
I am threading a fairly high quality needle right here, however as a result of Perplexity AI’s free model relies on GPT-3.5, the check outcomes had been measurably higher than the opposite AI chatbots.
Additionally: 5 reasons why I prefer Perplexity over every other AI chatbot
From a programming perspective, that is just about the entire story. Nonetheless, from a analysis and group perspective, my ZDNET colleague Steven Vaughan-Nichols prefers Perplexity over the opposite AIs.
He likes how Perplexity offers extra full sources for analysis questions, cites its sources, organizes the replies, and provides questions for additional searches.
So, in case you’re programming, but in addition engaged on different analysis, contemplate the free model of Perplexity.
- Free
- Open supply
- Environment friendly useful resource utilization
- Weak common data
- Small ecosystem
- Restricted integrations
- Worth: Free for chatbot, charges for API
- LLM: DeepSeek MoE
- Desktop browser interface: Sure
- Devoted Mac app: No
- Devoted Home windows app: No
- Multi-factor authentication: No
- Checks handed: 3 of 4
Whereas DeepSeek R1 is the brand new reasoning hotness from China that has all of the pundits punditing, the true energy proper now (no less than based on our assessments) is DeepSeek V3. This chatbot handed nearly all of our coding assessments, doing in addition to the (now principally discontinued) ChatGPT 3.5.
Additionally: I tested DeepSeek’s R1 and V3 coding skills – and we’re not all doomed (yet)
The place DeepSeek V3 fell was in its data of considerably extra obscure programming environments. Nonetheless, it beat Google’s Gemini, Microsoft’s Copilot, and Meta’s Meta AI, which is kind of an accomplishment. We’ll be preserving an in depth watch on every DeepSeek mannequin, so keep tuned.
Chatbots to keep away from for programming assist
I examined 13 LLMs, and 9 handed most of my assessments this time round. The opposite chatbots, together with just a few pitched as nice for programming, solely handed one in every of my assessments.
Additionally: The five biggest mistakes people make when prompting an AI
I am mentioning them right here as a result of folks will ask, and I did check them totally. A few of these bots are high quality for different work, so I will level you to their common evaluations in case you’re interested in their performance.
DeepSeek R1
Not like DeepSeek V3, the superior reasoning model, DeepSeek R1, didn’t showcase its reasoning capabilities in our programming assessments. Unusually, the brand new failure space was one which’s not all that onerous, even for a primary AI — the common expression code for our string operate check.
Additionally: I tested DeepSeek’s R1 and V3 coding skills – and we’re not all doomed (yet)
However that is why we’re operating these real-world assessments. It is by no means clear the place an AI will hallucinate or simply plain fail, and earlier than you go believing all of the hype about DeepSeek R1 taking the crown away from ChatGPT, run some programming assessments. To date, whereas I am impressed with the much-reduced useful resource utilization and the open-source nature of the product, its coding high quality output is inconsistent.
GitHub Copilot
GitHub’s Copilot integrates fairly seamlessly with VS Code. The AI makes asking for coding assist fast and productive, particularly when working in context. That is why it is so disappointing that the code the AI outputs is commonly very improper.
Additionally: I put GitHub Copilot’s AI to the test – and it just might be terrible at writing code
I can not, in good conscience, suggest you utilize the GitHub Copilot extensions for VS Code. I am involved that the temptation will likely be too nice to insert blocks of code with out adequate testing — and that GitHub Copilot’s produced code shouldn’t be prepared for manufacturing use. Strive once more subsequent 12 months.
Claude 4 Opus
In a totally baffling flip of occasions, the paid-for model of the Claude 4 mannequin, Opus, failed half of my assessments. What makes this outcome baffling is that the free model, Claude 4 Sonnet, handed all of them. I do not know what to say other than AI will be bizarre.
Additionally: Anthropic’s free Claude 4 Sonnet aced my coding tests – but its paid Opus model somehow didn’t
Meta AI
Meta AI is Fb’s general-purpose AI. As you’ll be able to see above, it failed three of our 4 assessments.
Additionally: 15 ways AI saved me time at work in 2024 – and how I plan to use it in 2025
The AI generated a pleasant consumer interface, however with zero performance. It additionally discovered my annoying bug, which is a reasonably severe problem. Given the precise data required to seek out the bug, I used to be shocked that the AI choked on a easy common expression problem. However it did.
Meta Code Llama
Meta Code Llama is Fb’s AI explicitly designed for coding assist. It is one thing you’ll be able to obtain and set up in your server. I examined the AI operating on a Hugging Face AI occasion.
Additionally: Can Meta AI code? I tested it against Llama, Gemini, and ChatGPT – it wasn’t even close
Weirdly, although each Meta AI and Meta Code Llama choked on three of 4 of my assessments, they choked on completely different issues. AIs cannot be counted on to present the identical reply twice, however this outcome was a shock. We’ll see if that adjustments over time.
However I like [insert name here]. Does this imply I’ve to make use of a unique chatbot?
Most likely not. I’ve restricted my assessments to day-to-day programming duties. Not one of the bots has been requested to speak like a pirate, write prose, or draw an image. In the identical approach we use completely different productiveness instruments to perform particular duties, be at liberty to decide on the AI that helps you full the duty at hand.
The one situation is in case you’re on a finances and are paying for a professional model. Then, discover the AI that does most of what you need, so you do not have to pay for too many AI add-ons.
It is solely a matter of time
The outcomes of my assessments had been fairly shocking, particularly given the numerous enhancements by Microsoft and Google. Nonetheless, this space of innovation is improving at warp speed, so we’ll be again with up to date assessments and outcomes over time. Keep tuned.
Have you ever used any of those AI chatbots for programming? What has your expertise been? Tell us within the feedback under.
You possibly can observe my day-to-day mission updates on social media. Make sure you subscribe to my weekly update newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.