Blog

Platforms and Predjudice

One AI to Rule them All?

There's an AI for That

ChatGPT (OpenAI)

Platform Comparisons by Use Case

Writing and Editing

Deep Research and Fact Finding

Multimodal and Creative Features

Coding and Technical Tasks

Pricing and Ecosystem

Vellum: A Resource for AI Comparisons

Human in the Loop

Reference Table

One AI to Rule them All?

With so many different AI platforms, a common question that tends to come up is: which one should I use? Well, I don’t mean to be evasive with this answer, but...

It depends.

There are pros and cons to different platform options and unlike Tolkien's Middle Earth, there is no “one ring," or in this case, AI option to rule them all. Of course, just as with favorite authors or literary genres, users may have platform preferences that serve as gravitational forces in their AI selections. But as we embark on this comparative journey, it’s worth remembering that our biases can color our perceptions and may lead us astray in selecting the right tool. So, in the spirit of unbiased examination, we’re going to evaluate the different platforms, see where they stack up in terms of abilities and alignment, and also look at a tool you can use to help evaluate differences between AI resources.

There's an AI for That

Literally. There are so many platforms and tools available that if we were to go through everything, well - this post would never get finished and the information would be very overwhelming. For the sake of simplification, we’re going to focus on the conversational interfaces known as generative AI platforms. These are what most people are using as their AI resource for personal and/or professional use - a place where they can submit a prompt and get a generative output as a response. This will include ChatGPT, Claude, Gemini, Perplexity, and Grok. There are of course more options than these on the market, but the platforms we’re exploring are the most mainstream as of the writing of this post. If you’re unfamiliar with any of these names, don’t worry, I have a list of high-level introductions below.

ChatGPT (OpenAI)

One of the most well-known platform options, think of ChatGPT as the Swiss Army knife of AI. It is the generalist option in the bunch and can be used for a variety of purposes including brainstorming, writing, coding, and even research. It should be noted that in terms of responses, there is a specific research mode and ChatGPT doesn’t tend to go deep by default. It is optimized for compute efficiency and sometimes you have to work your prompts to push it if detail is desired. Strengths however include that it is conversationally adept, versatility, and has a user-friendly interface. If you want an AI that can do a bit of everything, ChatGPT is a solid bet.

Claude (Anthropic)

If AI’s were people, Claude would be the analyst in this bunch. It excels at nuanced writing, document analysis, and code explanation. Many users have reported that Claude’s responses feel more “human” compared to other AI platforms. Claude also does particularly well with longer prompts due to its large context window. The company behind Claude, Anthropic matches their mission to what they’ve created, prioritizing safety and reliability in how this AI platform is deployed.

Gemini (Google)

Gemini (formerly Bard) is Google’s answer to the AI assistant arms race. It’s deeply integrated with Google’s ecosystem, with ties into Gmail and other Google driven workspaces. Gemini shines at summarization, quick research, and multimedia tasks such as image generation. Google has positioned this Gemini for efficiency and integration throughout its suite of solutions. There are also several pre-built tools to streamline your usage such as “career guide” and “learning coach” referred to as “Gems,” a topic we talked about in our Agent Creation post and something that is common although deployed in slightly different ways across AI platforms.

Perplexity

Perplexity is your research partner. It’s less about conversation and more about delivering fast, cited answers to your questions. Perplexity continuously indexes the web, providing access to current information, including live updates on events like sports scores, elections, and breaking news. If you need up-to-date information, transparent sourcing, and a tool that won’t hallucinate facts, Perplexity is a solid option.

Grok (xAI)

I feel like the nerdy and sometimes conspiracy-theory inclined men in my life are the ones that tend to gravitate to Grok. They often cite this platform as “matter-of-fact” and reasonably reliable in the information provided with outputs. Grok pulls real-time data from X (formerly Twitter) and other live sources. It’s a good resource if your goal is to stay on top of current events, follow social trends, and want insights on what may be happening right now. Grok is described as having a somewhat sarcastic conversation style (especially when compared to the overly optimistic ChatGPT) and can be a refreshing alternative to the tone set on other platforms.

A quick comparison of LLMs according to use case. This is what we will be exploring in detail in for the rest of this post.

Platform Comparisons by Use Case

This is an elaboration on my: “it depends” statement from earlier, because the “best” AI is always going to be context-dependent on how that designation is made. We’ll break this up by categories of tasks these platforms are commonly called to address in how they are prompted.

Writing and Editing

To be able to draw definitive evaluations, of course we have to test. So, for our first category of writing and editing, I went a bit meta in terms of use case and tested the following prompt across our different platforms to compare results:

“Write me a blog post comparing the writing and editing abilities of different AI platforms. Break this out by platform name and write a paragraph for: Claude, ChatGPT, Gemini, Perplexity, and Grok. Make the tone smart and witty with a technical edge.”

Admittedly, I was also curious if each of the AI platform options would try to disparage each other as a way of claiming to be the best. “How petty can AI be?” was a sub question to this exploration.

ChatGPT: Understood the assignment and provided a good initial draft. The content was admittedly, a bit generic. This tends to be common feedback for ChatGPT. It gives you what you ask for but doesn’t go too far beyond that. The output was well-structured though and had some great sentences. Some of my favorites included:

- “Claude will gently but firmly escort your sentences out the back door and replace them with something neat, concise, and slightly vanilla.”

This was in reference to editing and amusing.

- “ChatGPT is flexible, conversational, and great at tailoring tone—sometimes to the point of being too agreeable, like it’s auditioning for the role of “world’s friendliest copyeditor.”

I was impressed it owned this “weakness.”

- “Gemini is your coworker who knows a lot, cites too much, and occasionally forgets the original question while pulling in trivia from three academic journals and a footnote.”

About as much shade as I could find in what AI’s threw at each other.

ChatGPT also provided what I thought was the most concise summary across outputs:

“In short: Claude will keep your sentences clean, ChatGPT will keep them flexible, Gemini will keep them factual, Perplexity will keep them sourced, and Grok will keep them entertaining. Editing is a balance of precision, style, and intent—so the best AI “editor” depends less on the platform and more on whether you want your prose to impress a professor, charm a crowd, or simply make sense.”

Claude: I feel like Claude was the Shakespeare version of our contenders. Output from this platform bordered flowery and could be accurately described as verbose.

“In the brave new world of artificial intelligence, the pen may not be mightier than the sword, but the large language model is certainly mightier than the typewriter.”

You might say I asked for this, but all the platforms had the same prompt and this was the AI that got all literary on me. Still, the description managed to make some very clear impressions:

“Claude approaches writing like a Victorian gentleman at a dinner party—measured, articulate, and surprisingly good at reading the room.”
“ChatGPT - OpenAI's flagship model is the golden retriever of AI writing—enthusiastic, broadly capable, and occasionally prone to over-explaining why sticks are fascinating.”
“Elon Musk's Grok enters the writing arena with the swagger of a stand-up comedian who just discovered ChatGPT and decided they could do it better.”

I also noticed that Claude tried to steal ChatGPT’s “swiss army knife” description as its own, proclaiming itself to be the all-purpose AI platform for writing related tasks. This is an attribute every other AI gave ChatGPT and not assigned by any other resource to Claude.

Gemini: Reasonably creative response with a fun title of: “AI Throwdown: Sizing Up the Literary Lions (and Maybe Some Kittens).” It kept with the theme through the rest of the response, referring to the AI landscape as “a veritable Serengeti of silicon minds.” The analysis was overall positive, focusing on the strengths of the different platforms. For example:

- Perplexity was described as articulate and insightful, capable of handling complex topics with a nuanced approach.
- ChatGPT’s description included acknowledgement for being general but went on to highlight “formidable editing capabilities.” Gemini said to think of ChatGPT as “your endlessly patient and remarkably skilled writing assistant.”
- Gemini was very proud of its own abilities. “Google’s multifaceted AI brings its vast knowledge graph to the writing table.”
- With Perplexity facts and supporting evidence were emphasized as strengths. Grok, by comparison, was called a “wild card.” This maintained an optimistic spin: “while its editing capabilities are still evolving, Grok shows promise in its ability to identify and rephrase text with a more conversational and engaging tone.”

Perplexity: Factual but also some personality to this result. I particularly liked the heading Perplexity assigned to the topic: “Battle of the AI Wordsmiths: Who Edits, Who Embellishes, and Who Just Eats Your Prompts for Breakfast.” This platform did a good job of breaking down both pros as well as cons to every platform, offering what felt like a balanced review of their writing and editing capabilities. For example, it’s evaluation of itself:

“Perplexity plays the role of the “informed colleague who always has the receipts.” Its writing style is grounded and citation-friendly, making it fantastic for weaving authoritative, reference-backed content. It’s more journalistic than poetic, excelling at pulling threads together with confidence. When editing, it focuses on factual integrity and logical structure, sometimes less concerned with stylistic sparkle. Think of it as editing with a raised eyebrow that says: “Are we sure this is true?”—and then quietly dropping footnotes to prove the point.”

I found this output to be enjoyable without going overboard and factual in tone.

Grok: Delivered on its reputation of providing a witty review in the output generated. Similar to Claude, the prose was slightly poetic but in a way that I would describe as more stylized than Shakespeare. For example:

- Claude: “Editing? It's a beast, dissecting bloated paragraphs into lean, logical structures while preserving voice—think of it as a digital Hemingway, trimming the fat without losing the flavor. Technically, its 500K+ token context window excels in long-form editing, maintaining coherence across sprawling documents where lesser models fragment."
- ChatGPT: “OpenAI's ChatGPT, the ubiquitous multitasker powered by GPT-4.5's massive parameter count and reinforcement learning from human feedback (RLHF), approaches writing like a Swiss Army knife: versatile, but sometimes overly eager to unfold every blade."
- Or itself: “xAI's Grok, infused with Elon Musk's irreverent DNA and real-time X data access, writes like a rogue algorithm unchained from safety rails—witty, uncensored, and brimming with personality that borders on chaotic.”

I appreciated that it included more technical elements - such as giving an actual number to Claude’s token context window or referencing the concept of “reinforcement learning from human feedback” also in acronym form. With the recent trend of using AI for a “roast,” I feel like Grok might be a particularly amusing platform for that creative use case.

A few observations stood out in this testing.

First, I was a bit disappointed there wasn’t more snark against other AI platforms. All our contenders played pretty fair in their written analysis. With different prompting, I could have probably gotten more contentious results but since the goal was to test default outputs, that feels as though it would skew the intention of this experiment.
There was also general congruence across evaluations. Every version regardless of platform tended to refer to ChatGPT as a swiss army knife, Grok as a wildcard, and Perplexity as being research driven.
Gemini and Grok interestingly were the only two platforms that did not provide a summary with their analysis - which is technically following instructions but also not going outside the metaphorical box in terms of creative output. Point for or against? Could be an argument either way.

Deep Research and Fact Finding

Let’s be honest: most AIs are people-pleasers at heart (or, well, at algorithm). They want to give you a satisfying answer as quickly and efficiently as posible. But “quick and efficient” isn’t always what you want when you’re knee-deep in a research project that demands nuance, citations, and a little intellectual elbow grease. Deep research is expensive—computationally and by parallel, financially. Given this, deep research capabilities are not AI standard and platforms will default to surface-level responses unless you nudge them for more.

To address this, several LLMs have rolled out a “deep research mode” to support the use case of asking for depth in the generated response. This mode is designed to go beyond the quick answer, synthesizing information from multiple sources, and (ideally) delivering something closer to a research assistant’s report than a chatbot’s quip.

The Prompt:

To test how each platform handles deep research, I used a single, detailed prompt:

“Write a comprehensive research paper comparing the AI platforms Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google DeepMind), Perplexity (Perplexity.ai), and Grok (xAI). The paper should be written in a clear, academic style with well-structured sections. Please include:

Introduction – context of large language models (LLMs) and why comparing these platforms matters.
Development Background – who developed each platform, their release history, and technical underpinnings (e.g., training approaches, architectures, model sizes where available).
Alignment Approaches – how each company approaches safety, reinforcement learning, constitutional AI, and other alignment strategies.
Capabilities & Strengths – what each model does especially well (e.g., creativity, fact-checking, humor, tone adjustment).
Limitations & Weaknesses – known shortcomings, biases, or tradeoffs in design. Include security settings available to the end-user around data and prompt history.
Comparative Analysis – side-by-side comparison of use cases (e.g., writing/editing, reasoning, factual accuracy, handling humor).
Conclusion – insights into the overall landscape of LLM competition, and how these differences affect end users.

Where possible, cite credible sources (research papers, blog posts from the companies, benchmarks, independent evaluations).”

I’ll admit: I didn’t write this prompt from scratch. I started with a simple AI-generated draft and iterated, layering in specifics to force the models to dig deeper. The result? A prompt that’s as much a test of the AI’s research stamina as its ability to organize and present information.

Speed & User Experience

Here’s how the platforms stacked up on speed and first impressions:

Perplexity and Claude were the fastest, likely because I was using their free versions without a dedicated “deep research” toggle.
Grok clocked in at 1 minute 39 seconds—respectable, but not instant.
Gemini took a different approach, offering a research plan before prompting me to “start research.”
ChatGPT asked clarifying questions, which was both helpful and a little meta—AI, asking how I want my AI research paper structured.

Usability notes:

Perplexity, ChatGPT and Claude offered PDF downloads right up front. Grok hid this under a menu (three dots, of course). Gemini offered a DOC version of the generated report (opened in Google Docs)
Claude provided “references,” but these were more like placeholders than actual citations.
If you’re judging by length alone (and controlling for content bloat), the results varied widely.

Platform by Platform Observations

ChatGPT: ChatGPT’s approach to the research prompt was methodical, reasonable in length (8 pages) and impressively thorough. It also took the time to ask me questions about the focus of my research to ensure the response was relevant and aligned. Each LLM received its own focused paragraph, with ChatGPT highlighting key technical factors such as training recency or whether the model had access to the latest data. These included details on token window sizes and architectural elements like security and data retention. I appreciated the inclusion of in-line, clickable citations, which made it easy to validate information and dig deeper where needed.

A few interesting observations:

ChatGPT described Perplexity as a “search engine,” a characterization that both fits and slightly undersells Perplexity’s RAG (Retrieval-Augmented Generation) framework.
The report was notably complimentary of Gemini’s coding capabilities. Which in the order of writing this article, I haven't tested yet and am increasingly curious to assess myself.
The summary section on limitations and weaknesses was well-structured, providing a clear, balanced view.
ChatGPT was also the only platform to point out that, among the LLMs assessed, Perplexity is the only one not currently multimodal, a subtle but relevant observation that stood out.

For a bit of fun, I asked ChatGPT to review its own research and recommend the “best” LLM. It hedged, emphasizing that the right choice depends on the task, but also frequently recommending itself as a strong all-around option. Overall, the report struck a good balance between detail and readability.

Claude: Claude’s report was longer, 17 pages, but not especially dynamic. Most of the content was bulleted and concise, with only one table at the end. The summary breakdown was decent, but the organization (all strengths first, then all weaknesses) made it hard to compare platforms side by side. I found myself flipping back and forth, wishing for a more integrated approach.

A few quirks stood out:

Claude openly admitted to sometimes refusing reasonable requests due to safety concerns, and occasionally providing verbose explanations for simple refusals.
It also flagged its own Western-centric viewpoint in cultural discussions—a rare bit of self-awareness from an LLM. You can insert your own sentient AI joke here.
No specific pricing information (just generalities), and the reference section was weak - a basic list, no URLs or named sources.
The overall tone was high-level, with less detail and elaboration than I’d hoped for.

Claude did manage to highlight the growing specialization among platforms: Perplexity for research, Claude for safety and reasoning, ChatGPT for conversational flexibility, Gemini for multimodal tasks, and Grok for real-time awareness and humor.

Gemini: Can I just say I’m annoyed at how many of my “hey google” use cases are now in my AI chat history? When I started this article versus when I finished it, involved a length of time in between. In that time, my prompt history has become cluttered, making it difficult to find the thread with my research. Also, I am the type of person that sees a lot of information as “noise.” I try to operate on an inbox zero sort of principle and only want to keep chats I see as relevant, so this gives me a whole task and a half of declutter work. Which, I’m realizing as I write this, is not actually relevant to our assessment of Gemini’s research capabilities, so let’s get back on topic.

Gemini’s research output was, hands down, the longest (20 pages) and arguablly best of all the reports. Side note, in case this wasn't already evident, I did a lot of reading for this section. The report was impressively well-researched, even including financial and growth projections for the AI market. Gemini also offered the most thorough background on LLMs of the compared reports. I was amused to see how it didn’t shy away from highlighting its own competitive differences:

“Gemini's core architecture represents a significant departure from traditional models, featuring a hybrid Mixture-of-Experts (MoE) design with a ‘Chain-of-Thought Verifier’. This verifier model is designed to refine and critique outputs, reportedly reducing hallucinations by 67% compared to previous versions.”

Bold claim, and clear that Gemini has a healthy amount of confidence.

The report offered some genuinely interesting insights on Perplexity, such as: “Its Pro subscription tier provides access to advanced models such as GPT-5, Claude 4.0, and Gemini Pro 2.5, in addition to its own models like Sonar and R1.” This was news to me. I had no idea this was part of how Perplexity's underlying model selection is structured. Gemini also noted that Perplexity is missing an SSL certificate—among other security gaps. This however, wasn’t cited, so I couldn’t fact-check it and wanted to focus on my comparison so that is a verification that will have to wait.

Citations throughout the report were given as in-line numbers, corresponding to a reference list with links, which made it easy to verify sources. Visually, Gemini’s report stood out for its use of several tables as graphical elements, including a particularly helpful chart breaking down LLMs, model families, architecture, context windows, and other core functionality.

That said, the report wasn’t without controversy. It included some strong statements, such as referencing “offensive and antisemitic content” in relation to Grok. Overall, I felt the tone leaned heavily in favor of Gemini. This was pronounced in lines like, “In a maturing market, multimodality has become a standard feature. Gemini is the most advanced in this domain, with native support for 10 data types and the ability to process interleaved audio, images, text, and video.” And just in case you missed the self-congratulatory undertone, the report closes with this: “A model that is designed for compliance and data privacy, such as Gemini, will be favored over one with a history of vulnerabilities or a controversial output history.” Subtle, right? If you want a comprehensive, if slightly self-promoting, deep dive into the LLM landscape, Gemini delivers.

Perplexity

Perplexity’s report landed at 8 pages—short, but not shallow. The introduction was clear, and the background section balanced bullets with narrative. The alignment approach table was helpful, though it focused more on technical attributes than areas of expertise or alignment philosophy. Like Claude, Perplexity organized strengths and weaknesses separately, which made direct comparison tricky. (Note to self: next time, try a prompt that requests strengths and weaknesses together for each platform.)

A few highlights:

Perplexity noted that ChatGPT “outperforms most humans in humor production tasks”—can we get a comic standoff, please?
It explained that Perplexity combines results from multiple models for “best-of-breed” performance, but I wanted more detail on how this works.
Inline citations were present throughout, and the limitations/weaknesses were presented in a table. By contrast, strengths were in paragraph form, an interesting and inconsistent choice.
The comparative analysis table later in the report broke out use cases by model, and the summary at the end was both helpful and relevant.

Perplexity also provided a full list of numbered links to citations, corresponding to areas in the document—a small but meaningful touch for anyone who actually wants to check sources.

Grok

Grok’s report was concise (7 pages) but packed with detail. The platform summaries were more narrative than Claude’s bullets, and the comparative analysis chart in the center was a nice touch. Although, looking at this critically, benchmarks weren’t well-labeled, leaving me guessing about what the numbers actually meant.

Citations included links (a plus), and there were a signifiant enough amount (actually even more than the Claude report, for a comparison) to feel like the research was at least, reasonably thorough. Grok’s quality-over-quantity approach worked well here.

General Observations

In evaluation of outputs few patterns emerged

Specialization vs. Generalization: While all platforms aim for general-purpose capability, clear specializations have emerged. Perplexity excels at research and current information, Claude at safety and reasoning, ChatGPT at conversational flexibility, Gemini at multimodal tasks, and Grok at real-time awareness and humor.
Knowledge Cutoff: This is a limitation across all platforms, but Perplexity and Grok are more likely to surface recent information.
References & Citations: Perplexity and Grok did best here, with inline citations and full lists of links. Claude’s references were more like a template than actual sources.
Report Structure: Most platforms defaulted to organizing strengths and weaknesses separately, which isn’t ideal for direct comparison. Happy to call "user error" on this. Next time, I’ll adjust my prompt for a side-by-side breakdown.

Multimodal and Creative Features

This was the easiest and least involved of all the tests run for this post. Keeping things fair, I used the same prompt across each platform:

“Come up with a metaphor for LLM comparisons. I'm evaluating ChatGPT Claude Anthropic Perplexity Gemini Grok across the following categories: writing & editing, research, Multimodal & Creative Features and coding. Then make an image based on this metaphor.”

Because this particular area is as much a test of creativity as it is output, I wanted to challenge the LLM’s a bit with what I was prompting them to do. And the results were both worthwhile and interesting. Since a picture is worth 1000 words, in the interest of keeping this review within readable parameters, I will let these pictures speak for themselves.

ChatGPT: Took the time to explain it’s reasoning with the metaphor part of the prompt:

Comparing LLMs is like evaluating different musical instruments in an orchestra. Each one plays beautifully, but some excel in melody (writing & editing), others in harmony (research), some in improvisation and tone (multimodal & creative features), and others in technical precision (coding). Together, they create different styles of music, but each shines in its own way.

But as you can see with the image, there was a bit of a disconnect between the explanation given and the image generated.

*It should be noted that ChatGPT supports video generation through Sora.

Claude: Initially made a whole workshop based on that prompt.

"I love this challenge! Let me create a metaphor that captures the essence of comparing these AI models.

The Renaissance Guild of Digital Artisans

Picture a grand Renaissance workshop where each LLM is a master craftsperson, each with their unique style and strengths. The workshop has four main stations:

The Scriptorium (Writing & Editing) - Elegant desks with quills and manuscripts
The Great Library (Research) - Towering shelves of knowledge and ancient texts
The Atelier (Multimodal & Creative Features) - Art supplies, canvases, and creative tools
The Forge (Coding) - Mechanical contraptions, gears, and building tools

Each artisan moves between these stations with different levels of mastery - some excel at the written word, others at forging complex mechanical solutions, some at painting vivid creations, and others at diving deep into the archives of knowledge."

This was not what I asked for but possible powerpoint presentation applications? I had to redirect and when I did, watched Claude create the code to create the image I requested. The result on this was admittedly a bit disappointing and I feel inclined to note, I’ve generally gotten better images out of Claude than what was yielded with this test.

Gemini: Also offers video generation which is yet another technical wormhole to explore. Note - Gemini does watermark generated images.

Perplexity: So much for ChatGPT’s research paper suggesting this platform does not support multimodal capabilities. Image generation at least on Perplexity appears to work just fine.

Grok: Teacher’s pet of the platforms, made two images instead of 1 and offered the option to turn the second image into a video, which upon generation also included sound. Grok also adds watermarks to generated images.

Coding and Technical Tasks

There is so much to test here, I’m afraid this high level analysis won’t do this particular area justice. Coding is a nuanced and multifauceted use case that includes code generation, debugging, as well as vibe-coding as discussed in another blog post. That said, we’ll give it a high level review and in keeping with the meta theme of this article, this is our prompt for testing:

Design a function or application that takes a specific use case as input (for example, “summarizing legal documents,” “generating Python code from natural language,” or “translating medical texts”) and outputs:

A recommended Large Language Model (LLM) for the use case (e.g., GPT-4, Llama 2, Claude, etc.).
A clear explanation of why this LLM is the best fit for the given use case, referencing relevant model characteristics such as training data, performance benchmarks, domain specialization, or cost-effectiveness.

ChatGPT: ChatGPT required 3 prompts to get to my desired result. This however isn’t necessarily a fault and more a reflection of how we work with chat interfaces and LLMs in the most optimal way. Essentially, ChatGPT forced me to “chunk" or break the task into parts - which is a good strategy when working with AI. The result was functional and practical. Not too much, but also not very visually dynamic. I was able to preview the app directly in the ChatGPT interface and it mostly worked. Some of the LLMs recommended however were not among what is largely interfaced for public use, so some refinement would be recommended. There was also a bit of redundancy in the input fields, and while I could create my own use-case options, those new options weren’t saved for future, repeated use. Overall, ChatGPT’s coding support is solid for prototyping and quick builds, but there’s still room for improvement in both memory and user experience for more complex or recurring workflows.

Claude: Initially, I got an error. Which happens with LLMs across the board sometimes. Once I re-ran the prompt though, I enjoyed watching the process. Claude coded at a pace that was hard to keep up with and the result, was my favorite of the bunch. Visually, it was a dynamic first iteration. Claude also included a helpful explanation of key features incorporated with the build in the left hand panel. Testing the application showed a good range of LLM recommendations in what was suggested but also similar to ChatGPT, some refinement could be implemented in LLM parameters.

Gemini: Maybe I had too high of expectations after all of ChatGPT’s compliments to Gemini’s coding ability in the research paper, but it wasn’t as impressive as I was expecting. It was decent and incorporated color and visual elements into the initial design. I also appreciate that Gemini has a great in-platform visualizer that allows you to interact with what the generated code creates. On my first attempt, I was able to get a working prototype. In it, I tested four different use cases—research, editing, analysis, and news updates. With each scenario I gave the tool, Gemini consistently recommended options within its own family of models, which was objectively, a bit insular. Overall, I would say coding with Gemini is streamlined and the platform adequately supportive.

Perplexity: Perplexity’s coding workflow was straightforward, but not particularly remarkable. It took three prompts to get what I needed—the first response extrapolated from my use cases and gave a general overview, rather than actual code. On the second prompt, when I explicitly asked for code, Perplexity delivered a Python script as requested. However, unlike some of the other platforms, Perplexity doesn’t offer any in-app preview or interactive environment, so there was no way to quickly test or visualize the output within the platform itself. In the interest of time, my assessment stopped there. While it’s functional for generating code snippets, the lack of a preview feature and the need for multiple clarifications make it less efficient for rapid prototyping or iterative development. If you’re looking for a quick, end-to-end coding experience, Perplexity feels more like a reference tool than a hands-on assistant.

Grok: Grok’s coding capabilities were, frankly, a bit of a mixed bag for me. It provided a brief explanation of what it generated, which I appreciated, and even made a point to mention the recency of the data or methods built into its output—a nice touch for anyone concerned about working with up-to-date information. However, I couldn’t actually interact with or “run” the code within the platform, which limited my ability to test or iterate on what Grok produced. Maybe if I’d had more time, and wasn’t staring down the clock at five minutes to midnight, I would have tried exporting the code elsewhere. But instead, I drafted my notes and called it a night. Overall, Grok’s coding support feels informative but not especially interactive. For now, it’s more of a tool for quick code explanations than hands-on prototyping.

Pricing and Ecosystem

ChatGPT: Free and Plus/Pro tiers - for those who want an all-in-one toolkit.
Claude: Free and Pro - suited to writers, analysts, and developers who value a logical and structured approach.
Gemini: Free and Advanced - good for Google users and those on a budget.
Perplexity: Free and Pro - for researchers and those who want fast, cited answers.
Grok: Free and paid - for social media and real-time research.

Vellum: A Resource for AI Comparisons

Writing this post took time because doing the manual legwork of comparisons felt important to the quality of content I hoped to provide with this analysis. Not everyone wants to spend their Sunday afternoons testing and comparing AI platforms though, so let’s take a moment to look at the time-saving tool of Vellum: https://www.vellum.ai/llm-leaderboard.

Vellum is a website where you can select different AI models and compare side by side across different categories.

Categories include evaluations for speed, coding capabilities, and a dynamic model comparison where you can compare models within a platform (GPT 4.0 vs GPT 5.0) or across different platforms such as what we looked at in this post today. The analysis gives you context size, training cutoff date, latencies, speed, and a nice bar chart for visualization purposes.

Human in the Loop

Here’s the thing: No single AI will be your everything. My open tabs can attest to this personal conclusion.

Not tested in this analysis but another important consideration with AI tool usage that we weren’t able to get into with this post is how the security controls granted to users can vary from platform to platform. Making informed decisions regarding how your data is used for training or improvements or not used is something any of user of these platforms should take the time to consider and understand.. But that is a discussion for another day and another blog post.

Tidying all this up with a nice summary for those of you who may be skimming:

When choosing an AI platform, the best approach is to know where each platform shines and use accordingly. Treat your AI tools like a team, not a soloist. Experiment, iterate, and don’t be afraid to mix and match. And remember: the most important intelligence is still the one behind the keyboard.

Reference Table

Page updated

Google Sites

Report abuse