Breaking down Grok 3: The AI model that could redefine the industry

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Less than two years since its launch, xAI has shipped what could arguably be the most advanced AI model to date. Grok 3 matches or beats the most advanced models on all key benchmarks as well as the user-evaluated Chatbot Arena, and its training has not even been completed yet. 

We still don’t have a lot of details about Grok 3, as the team has not yet released a paper or technical report. But from what xAI has shared in a presentation and based on different experiments AI experts have run on the model, we can guess how Grok 3 might affect the AI industry in the coming months.

Faster launches

With competition increasing between AI labs (just look at the release of DeepSeek-R1), we can expect model release cycles to become shorter. In the Grok 3 presentation, xAI founder Elon Musk said that users may “notice improvements almost every day because we’re continuously improving the model.”

“Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner,” writes Nathan Lambert, machine learning scientist at Allen Institute for AI. “Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.”

On the one hand, this can be a good thing for users as they constantly get access to the latest and greatest models as opposed to waiting for month-long rollouts. On the other, it can have a destabilizing effect for developers who expect consistent behavior from the model. Previous research and empirical evidence from users has shown that various versions of models can react differently to the same prompt. 

Enterprises should develop custom evaluations and regularly run them to make sure new updates do not break their applications.

Scaling laws

The recent release of DeepSeek-R1 undermined the massive spending that big companies are making to create large compute clusters. But xAI’s sudden rise is a vindication of the massive investments tech companies have been making in AI accelerators. Grok 3 was trained in a record time thanks to xAI’s Collosus supercluster in Memphis.

“We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs),” Lambert writes. “xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.”

Other analysts have pointed out that xAI’s ability to scale its computer cluster has been the key to the success of Grok 3. However, Musk has alluded that there is more than just scaling at work here. We’ll have to wait for the paper to get the full details.

Open source culture

There is a growing shift toward open sourcing large language models (LLMs). xAI has already open-sourced Grok 1. According to Musk, the company’s general policy is to open source every model except the latest version. So, when Grok 3 is fully released, Grok 2 will be open-sourced. (Sam Altman has also been entertaining the idea of open sourcing some of OpenAI’s models.)

xAI will also refrain from showing the full chain-of-thought (CoT) tokens of Grok 3 reasoning to prevent competitors from copying it. It will instead show a detailed overview of the model’s reasoning trace (as OpenAI has done with o3-mini). The full CoT will only be available once xAI open sources Grok 3, which will probably come after the release of Grok 4.

Do your own vibe check

Despite the impressive benchmark results, reactions to Grok 3 have been mixed. Former OpenAI and Tesla AI scientist Andrej Karpathy placed its reasoning capabilities at “around state-of-the-art,” along with o1-Pro, but also pointed out that it lags behind other state-of-the-art models on some tasks such as creating compositional scalable vector graphics or navigating ethical issues.

Other users have pointed out flaws in Grok 3’s coding abilities in comparison to other models, although there are also many instances of Grok 3 pulling out impressive coding feats.

Based on my own experience with leading models, I advise you do your own vibe check and research. I never judge a model based on a one-shot prompt. Have a set of tests that reflect the kind of tasks you accomplish in your organization (see a few examples here). Chances are, with the right approach, you can get the most out of these advanced models.