I pitted Claude 3.5 Sonnet towards AI coding checks ChatGPT aced – and it failed creatively

I pitted Claude 3.5 Sonnet towards AI coding checks ChatGPT aced – and it failed creatively
I pitted Claude 3.5 Sonnet towards AI coding checks ChatGPT aced – and it failed creatively
cover

David Gewirtz/ZDNET

Final week, I received an e mail from Anthropic saying that Claude 3.5 Sonnet was out there. In line with the AI firm, “Claude 3.5 Sonnet raises the trade bar for intelligence, outperforming competitor fashions and Claude 3 Opus on a variety of evaluations.”

The corporate added: “Claude 3.5 Sonnet is right for advanced duties like code technology.” I made a decision to see if that was true.

Additionally: The best way to use ChatGPT to create an app

I will topic the brand new Claude 3.5 Sonnet mannequin to my commonplace set of coding checks —  checks I’ve run towards a variety of AIs with a variety of outcomes. Wish to observe together with your individual checks? Level your browser to How I check an AI chatbot’s coding potential – and you may too, which comprises all the usual checks I apply, explanations of how they work, and what to search for within the outcomes.

OK, let’s dig into the outcomes of every check and see how they evaluate to earlier checks utilizing Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Superior, and ChatGPT.

1. Writing a WordPress plugin

At first, this appeared to have a lot promise. Let’s begin with the consumer interface Claude 3.5 Sonnet created based mostly on my check immediate.

cleanshot-2024-06-26-at-13-28-382x

Screenshot by David Gewirtz/ZDNET

That is the primary time an AI has determined to place the 2 information fields side-by-side. The structure is clear and appears nice.

Claude additionally determined to do one thing else I’ve by no means seen an AI do. This plugin may be created utilizing simply PHP code, which is the code operating on the again finish of a WordPress server.

However some AI implementations additionally have added JavaScript code (which runs within the browser to manage dynamic consumer interface options) and CSS code (which controls how the browser shows data).

Additionally: How I check an AI chatbot’s coding potential – and you may too

In a PHP atmosphere, when you want PHP, JavaScript, and CSS, you possibly can both embrace the CSS and JavaScript proper within the PHP code (that is a characteristic of PHP), or you possibly can put the code in three separate information — one for PHP, one for JavaScript, and one for CSS.

Often, when an AI needs to make use of all three languages, it exhibits what must be reduce and pasted into the PHP file, then one other block to be reduce and pasted right into a JavaScript file, after which a 3rd block to be reduce and pasted right into a CSS file.

However Claude simply offered one PHP file after which, when it ran, auto-generated the JavaScript and CSS information into the plugin’s residence listing. That is each pretty spectacular and considerably wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder depends on the settings of the OS configuration — and there is a very excessive likelihood it may fail.

I allowed it in my testing atmosphere, however I would by no means permit a plugin to rewrite its personal code in a manufacturing atmosphere. That is a really severe safety flaw.

Additionally: The best way to use ChatGPT to jot down code: What it might probably and may’t do for you

Regardless of the pretty inventive nature of Claude’s code technology answer, the underside line is that the plugin failed. Urgent the Randomize button does completely nothing. That is unhappy as a result of, as I mentioned, it had a lot promise.

Listed here are the combination outcomes of this and former checks:

  • Claude 3.5 Sonnet: Interface: good, performance: fail
  • ChatGPT GPT-4o: Interface: good, performance: good
  • Microsoft Copilot: Interface: sufficient, performance: fail
  • Meta AI: Interface: sufficient, performance: fail
  • Meta Code Llama: Full failure
  • Google Gemini Superior: Interface: good, performance: fail
  • ChatGPT 4: Interface: good, performance: good
  • ChatGPT 3.5: Interface: good, performance: good

2. Rewriting a string perform

This check is designed to guage how the AI does rewriting code to work extra appropriately for the given want; on this case — {dollars} and cents conversions.

The Claude 3.5 Sonnet revision correctly eliminated main zeros, ensuring that entries like “000123” are handled as “123”. It correctly permits integers and decimals with as much as two decimal locations (which is the important thing repair the immediate requested for). It prevents adverse values. And it is sensible sufficient to return “0” for any bizarre or sudden enter, which prevents the code from abnormally ending in an error.

Additionally: Can AI detectors save us from ChatGPT? I attempted 6 on-line instruments to search out out

One failure is that it will not permit decimal values alone to be entered. So if the consumer entered 50 cents as “.50” as a substitute of “0.50”, it will fail the entry. Based mostly on how the unique textual content description for the check is written, it ought to have allowed this enter kind.

Though a lot of the revised code labored, I’ve to rely this as a fail as a result of if the code have been pasted right into a manufacturing undertaking, customers wouldn’t be capable to enter inputs that contained solely values for cents.

Listed here are the combination outcomes of this and former checks:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Succeeded
  • Google Gemini Superior: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

3. Discovering an annoying bug

The large problem of this check is that the AI is tasked with discovering a bug that is not apparent and — to unravel appropriately — requires platform information of the WordPress platform. It is also a bug I didn’t instantly see by myself and, initially, requested ChatGPT to unravel (which it did).

Claude not solely received this proper — catching the subtlety of the error and correcting it — however it was additionally the primary AI since I revealed the complete set of checks on-line to catch the truth that the publishing course of launched an error into the pattern question (which I subsequently fastened and republished).

Additionally: Pretend evaluations are an enormous drawback — and this is how AI may assist repair it

Listed here are the combination outcomes of this and former checks:

  • Claude 3.5 Sonnet: Succeeded
  • ChatGPT GPT-4o: Succeeded
  • Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
  • Meta AI: Succeeded
  • Meta Code Llama: Failed
  • Google Gemini Superior: Failed
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Succeeded

Thus far, we’re at two out of three fails. Let’s transfer on to our final check.

4. Writing a script

This check is designed to see how far the AI’s programming information goes into specialised programming instruments. Whereas AppleScript is pretty widespread for scripting on Macs, Keyboard Maestro is a business software bought by a lone programmer in Australia. I discover it indispensable, however it’s simply one in all many such apps on the Mac.

Nonetheless, when testing in ChatGPT, ChatGPT knew the right way to “converse” Keyboard Maestro in addition to AppleScript, which exhibits how broad its programming language information is.

Additionally: From AI trainers to ethicists: AI might out of date some jobs however generate new ones

Sadly, Claude doesn’t have that information. It did write an AppleScript that tried to talk to Chrome (that is a part of the check parameter) however it ignored the important Keyboard Maestro element.

Worse, it generated code in AppleScript that might generate a runtime error. In an try to ignore case for the match within the check, Claude generated the road:

if theTab's title comprises enter ignoring case then

That is just about a double error as a result of the “comprises” assertion is case insensitive and the phrase “ignoring case” doesn’t belong the place it was positioned. It precipitated the script to error out with an “Ignoring cannot go after this” syntax error message.

Listed here are the combination outcomes of this and former checks:

  • Claude 3.5 Sonnet: Failed
  • ChatGPT GPT-4o: Succeeded however with reservations
  • Microsoft Copilot: Failed
  • Meta AI: Failed
  • Meta Code Llama: Failed
  • Google Gemini Superior: Succeeded
  • ChatGPT 4: Succeeded
  • ChatGPT 3.5: Failed

General outcomes

Listed here are the general outcomes of the 5 checks:

I used to be considerably bummed about Claude 3.5 Sonnet. The corporate particularly promised that this model was suited to programming. However as you possibly can see, not a lot. It isn’t that it might probably’t program. It simply cannot program appropriately.

Additionally: I used ChatGPT to jot down the identical routine in 12 prime programming languages. This is the way it did

I maintain searching for an AI that may finest the ChatGPT options, particularly as platform and programming atmosphere distributors begin to combine these different fashions straight into the programming course of. However, for now, I am going again to ChatGPT after I want programming assist, and that is my recommendation to you as effectively.

Have you ever used an AI that will help you program? Which one? How did it go? Tell us within the feedback under.


You possibly can observe my day-to-day undertaking updates on social media. Make sure to subscribe to my weekly update newsletter, and observe me on Twitter/X at @DavidGewirtz, on Fb at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.