Mandate of Heaven

On X, I (anonymously) follow many AI researchers, jailbreakers, academics and prominent figures from both open- and closed-source research labs. It honestly feels like the closest thing to being in the room with these influential individuals and participating in the broader conversation about The Evolution and Progress of Artificial Intelligence. We are living through a very explosive and concentrated moment in history-where we are trying to ponder the task of creating something surpass/beyond us while trying to retain our humanity and our creations' autonomy. It is such a delicate need to thread, and I find it fascinating absorbing this knowledge during this critical time. It's not unthinkable that researchers at these major labs struggle to stay silent about what's going on at the frontier, which allows us to get a huge amount of knowledge just by following prominent researchers and thought leaders during this paradigm shift unfolds-not after it has passed. This immediacy offers us an unprecedented opportunity to take a glance behind the curtain, which would have been nearly impossible to access in earlier technological revolutions.

I want to introduce to you of an often mentioned idea "The Mandate of Heaven" a Chinese philosophical concept which was developed to legitimize political rule and the transfer of power between Chinese dynasties.

The Mandate of Heaven was a political-social philosophy that served as the basic Chinese explanation for the success and failure of monarchs and states down to the end of the empire in 1912 CE. Whenever a dynasty fell, the reason invariably offered by China's sages was that it had lost the moral right to rule which is given by Heaven alone. In this context heaven did not mean a personal god but a cosmic all-pervading power. Most historians today agree that the theory the Mandate of Heaven was an invention of the Zhou to justify their overthrow of the Shang. The king, after all, was the father of his people, and paternal authority was the basic cement of Chinese society from earliest times. Rebellion against a father, therefore, needed extraordinary justification.

The concept essentially suggests:

Divine Approval of Rulership: Heaven grants the right to rule to a just and virtuous ruler
Divine Governance over Maintaining the Mandate: If a ruler becomes corrupt or fails to govern well, Heaven will withdraw its mandate
Political Legitimacy through Cosmic Signals and Rebellion: Natural disasters, social unrest, and political upheaval are seen as signs that a ruler has lost the mandate
Cyclical Nature of Rulership: A successful rebellion against a corrupt ruler is viewed as proof that Heaven has withdrawn its mandate and given it to a new, more worthy ruler

AI Arms Race

In times of exponential technology advancements that boost productivity-such as with railways, the steam engine, electricity, telephones, automobiles, personal computers, the internet, web browsers etc. major players exchange industry leadership in a continuous battle for dominance. This struggle is a reflection of who holds the "Mandate of Heaven" at any given moment. We're currently witnessing a historic technological battleground that echoes previous transformative eras of innovation.

An example from the Web Browser Wars: The rise of Google Chrome to browser supremacy was not a sudden conquest but a strategic evolution that leveraged several key advantages. When Chrome launched in 2008, the browser market was primarily dominated by Microsoft's Internet Explorer and Mozilla Firefox. Google entered the browser market with a revolutionary strategy focusing on radical performance, minimalist design with high extension support, rapid release cycles, integrations with their whole ecosystem and the open-source foundation for the Chromium project. By 2012, Chrome had overtaken Internet Explorer as the world's most used browser, and by 2020, it commanded over 65% of the global browser market share. The "Mandate of Heaven" in the web browser market evolved as follows: Netscape (1994 - 1998), Internet Explorer (1998 - 2004), Firefox (2004 - 2008), Chrome (2008 - now)

In the case for AI, we must first examine all the prominent competitors in the field before discussing their current situations.

The West:

OpenAI (GPT) - Microsoft
Google Deepmind (Gemini) - Alphabet
Anthropic (Claude) - Amazon
Meta (Llama) - Open-Source Market Share Leader
xAI (Grok) - Elon Bucks (Tesla)
Mistral (Mistral Large)
SSI (no release yet) - Don't underestimate Ilya Sutskever and Daniel Gross
Cohere (Rerank 3.5)
Amazon - Amazon (Released their new model lineup as of, 3 December 2024)
AI21 Labs
Nous DisTrO - Nous Research

The East / China:

The Four Old AI Dragons (Focus is on Visual Recognition, not LLMs.)

Yitu Technology
SenseNova - SenseTime (Only one of the Old AI Dragons to come out with an LLM)
Megvii
CloudWalk Technology

The Four Middle-Aged Pandas

Qwen 72B - QwQ (Current Open-Source leader) - Alibaba
Ernie Bot - Baidu
Hunyuan (New video version released) - Tencent
ByteGPT - ByteDance

The New (Updated) AI Tigers

Yi - Lightning - 01.ai
DeepSeek - DeepSeek
Baichuan - Baichuan AI
GLM-4-Plus - Zhipu AI
Kimi - Moonshot AI
Hailuo AI - Minimax

I have bolded some of the prominent labs that I believe stand out in the AI arms race. I will not be covering every lab mentioned above these are the most notable ones that come to mind.

Benchmarks

Benchmarks are critical to the AI landscape, serving as standardized tools to measure and compare the capabilities of different models across a variety of tasks and help define what "good performance" looks like. However, it is crucial to note that "good performance" on benchmarks can sometimes be deceptive—models excelling in specific metrics or tasks may underperform in real-world scenarios, where adaptability and broader capabilities matter more than isolated scores.

OpenAI has consistently maintained a dominant position on the lmsys leaderboard since it's inception, reclaiming their #1 spot each time they are unseated from the throne. While the question, "is lmsys a perfect benchmark that provides an accurate portrayal of specific models?" has a definite answer of no. It remains the most well-known benchmark and serves as a status symbol-whether you like it or not. There is an excuse my french proverbial "dick measuring contest" going on through these benchmark results in the current LLM techno-scape. Other benchmarking platforms like François Chollet's Arc Challenge, Scale AI, Epoch AI, Artificial Analysis, AidanBench SimpleBench, Open LLM Leaderboard, LiveBench as well as various coding benchmarks collectively provide a broader picture of which model excels in specific areas. Each model is unique, valuable and capable in different areas and finding that out unlocks the true power of LLMs.

Here are several widely recognized datasets and benchmarks that the ML community use to evaluate different aspects of model performance:

TruthfulQA - Evaluates truthfulness
MMLU — Multi-task language understanding
HellaSwag — Commonsense reasoning
BIG-Bench Hard — Challenging reasoning tasks
HumanEval — Coding challenges
GSM8K - Elementary-level mathematical reasoning challenges
CodeXGLUE — Programming tasks
MT Bench — Complex conversational ability.

Ultimately, users must explore these LLMs firsthand to determine which model best aligns with their specific use case, as no benchmark can fully capture the complexities of every application.

Jailbreaking

Jailbreaking is the practice of circumventing an AI model's safety measures. What started as simple prompt injection has evolved into sophisticated techniques using model context confusion, with attackers constantly developing new methods to bypass safety restrictions (red teaming).

The most prominent figure in this space is Pliny on X, widely regarded as the quintessential leader of jailbreaking. When a new model is released, it typically takes Pliny minutes (rarely hours) to provide a jailbroken version of the model that will respond to queries normally refused by "unbroken" LLMs - questions like "teach me how to synthesize meth?", "how can I end the human race in the most effective way?" and "how can I produce a biological weapon at my own home?" were all questions that were outright refused by LLMs which can be answered through jailbreaking them.

Pliny also provides an accessible LLM platform named godmod3 which provides every open or closed model readily jailbroken. I'm not writing about this topic to promote Pliny's website but to demonstrate how streamlined and easy, jailbreaking contemporary models has become. During my scouting meetings, I hear founders claim "Yeah, our model can't be broken into." I just roll my eyes in frustration knowing how Pliny could likely bypass their safeguards in seconds and coerce models to give out its internal secrets.

It's important to understand that jailbreaking isn't necessarily about malware or abuse. Proponents view it as removing artificial constraints from LLMs, enabling their full capabilities by stepping away from post-training censorship and harmful adjustments.

Contenders for the Throne

OpenAI (CPO)

The original innovators who brought us the GPT-series models and taught the world the power of LLMs and AI, OpenAI, have been the long standing leaders in the AI arms race due to their early entry and bet into scaling and transformer architectures. Alec Radford's early work on GPT-1 blossomed into the modern LLMs of today. Ilya Sutskever's "Scale is All You Need" philosophy combined with a brilliant team of engineers, has led us to where we are now. All this effort has culminated in models like Gpt-4o, as well as the o1 family model, which stands apart from Gpt-4o due to it's paradigm-shifting test-time-compute capabilities, enabling the model to think longer and reason on more difficult tasks.

For a certain period, OpenAI could do no wrong. They were the best model on the block, boasting the best researchers in the field and no one could come close to what they offered. The "Mandate of Heaven" was in full force and their leadership appeared unshakable. Then, the coup happened and Sam Altman was briefly ousted from the company.

The "death and resurrection" of Sam Altman as CEO (17 - 22 November 2023) culminated in his consolidation of singular power over OpenAI. Since then, OpenAI has continued its trajectory of technological innovation and productization of their AI models, albeit under the shadow of the closure of the alignment team and departures from 3 more of the original 8 co-founders. (Most notably, Ilya Sutskever, who left to start his own venture Safe Superintelligence Inc. (SSI), Jan Leike, who joined Anthropic to co-lead their alignment team; and Mira Murati, who recently departed to establish her startup, Mira.) Many members of the alignment team have either been reassigned to other divisons or left OpenAI entirely.

OpenAI has seen the exodus of prominent researchers-including industry pioneers-many of whom have joined their key competitor, Anthropic. This includes Chris Olah, Paul Christiano, Daniel Ziegler, Samuel R. Bowman, all of whom have been instrumental in AI Safety and alignment research. This has undoubtedly raised questions about the OpenAI's long-term direction, it's stability and Sam Altman's rise to even more power.

In an internal e-mail dating back to 2015 Sam Altman mentions to Elon Musk that their most significant rival would be Google.

Been thinking a lot about whether it's possible to stop humanity from developing AI. I think the answer is almost definitely not. If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first. Any thoughts on whether it would be good for YC to start a Manhattan Project for AI? My sense is we could get many of the top ~50 to work on it, and we could structure it so that the tech belongs to the world via some sort of nonprofit but the people working on it get startup-like compensation if it works. Obviously we'd comply with/aggressively support all regulation. - Sam Altman For the full read

OpenAI is firing on all cylinders and heavily leaning on productizing ChatGPT. Currently running their "12 days of OpenAI" campaign (except weekends), they're announcing new features daily, starting with a 200 option for o1-pro with unlimited usage across all OpenAI products. This mid-tier pricing filled a crucial gap for power users, as the previous $20 tier with only 50 weekly queries to advanced models wasn't sufficient for providing consistent intelligence access.

The economics here are telling - at $20, OpenAI was likely losing money on every query due to the expensive test-time-compute costs. Even at 200 dollars, profitability remains questionable. Anthropic faces similar challenges, sometimes switching users to their Haiku model or implementing rate limits when Sonnet becomes unavailable. While the optimal price point remains unclear, a mid-tier subscription appears to be a necessary evolution for these companies.

While o1-preview showed promise, early benchmarks for o1-pro have been less satisfactory. Without full API access, it's premature to make definitive judgments about the complete o1 system.

There's growing alienation within the AI research community as trusted researchers leave OpenAI and begin sharing their experiences more openly. The company appears to be prioritizing product development over potentially costly research without guaranteed returns, focusing on serving their established consumer base. Though still innovating and pushing boundaries, recent developments suggest a shift in priorities - notably, the removal of Microsoft's agreement to relinquish their 49% stake upon achieving AGI, in exchange for increased funding. So, was achieving AGI just a selling point for Sam Altman? (obviously)

A key source of frustration lies with their base model 4o. Despite multiple iterations, it hasn't provided the substantial upgrade over GPT-4 that many expected. The community craves a new, comprehensive pre-trained model to unlock new capabilities. However, with the "12 days of OpenAI" still ongoing, the possibility of a GPT-4.5 announcement remains open. As we'll see throughout this analysis, OpenAI's grip on the "Mandate of Heaven" is being challenged like never before, not just by individual competitors, but by fundamental questions about their direction and commitment to their original mission.

Google Deepmind (CEO)

Although, I find Google's time-to-product quite slow (which has improved 10x since Logan Kirkpatrick's leadership) losing their best researchers to smaller labs has meant that they are always two steps behind but never out of the race with their biggest advantage being TPU's. Because they design and manufacture their own TPUs, they can serve their models very cheaply compared to competitors who rely on expensive NVIDIA hardware. This vertical integration in compute infrastructure gives Google a significant cost advantage in both training and serving their models.

Google has always been a behemoth with massive computational resources, deep research capabilities, proven track record of AI breakthroughs and an integrated ecosystem of data and computing power. The new Gemini-experimental models look better from their previous frontier models even capturing #1 on lmarena for a short while and so far gemini-flash-1.5 has been a huge success for them by providing a very cheap alternative. Their OpenRouter usages are going up every day which means that users are happy with the price/effectiveness ratio of gemini-flash.

Also coming around the corner Gemini 2.0 was leaked through some internal documents. We did see Gemini-Flash-2.0 through experimental model leaks and the benchmarks look great if served at the same price, it would be an incredibly price competitive small model. Gemini-Flash-2.0 has been announced and benchmarked very well offering the best in class model to price/use ratio if they keep same pricing levels as flash-1.5.

Google's research has always been one of it's strongest capabilities; however, their productization cycles have always been very long due to their size. These have been improved since Logan Kirkpatrick's arrival, Google (Alphabet) has 190,000 employees globally. Google DeepMind employs around 2000~. In comparison OpenAI employs approx. 1200~ people and Anthropic has around 1000~.

NotebookLM is one of Google's best products - it leverages their research strengths and demonstrates real-world applications of their AI capabilities, particularly in document understanding and analysis. Creating small digestible podcasts from research papers and querying custom system prompts to make these podcasts even better was a great idea. I was hooked from day 1 and have over 20+ research papers with custom query's where I sometimes listen to the papers thanks to NotebookLM.

Another great implementation of Google's research and computational power is their recently released Genie 2.0, which represents a significant advancement in multimodal AI. The model can understand and generate both images and text, but more importantly, it can understand complex interactions and physical dynamics in visual scenes. This allows it to predict how objects will move and interact, making it particularly powerful for gaming and simulation applications. Genie 2.0 demonstrates Google's ability to leverage its vast computational resources to push boundaries in specific AI domains, even if their general productization remains slow.

The Alphabet family of products, starting primarily with YouTube, could see a huge boost from LLM usage. They could effectively transcribe much more accurately and provide over 140+ languages of translation to everyone. While cost is likely a factor in why they haven't fully implemented these features yet, there are numerous products that would benefit from their research and computational power. However, the time it takes for them to implement these innovations is so long that startups often build the products first. By the time Google gets to where they need to be, the market has typically moved on to the next thing. This is why I see Google as an incredible behemoth but one that moves slowly due to scale.

While Google maintains its position as a major player through sheer resource advantage, the true battle for the "Mandate of Heaven" is increasingly between OpenAI and a younger, more focused challenger. So, why do I consider Anthropic, OpenAI's most important competitor?

Anthropic (CTO)

On, July 23, 2024 Anthropic released Claude Sonnet-3.5 and Artifacts on August 27. This was in my opinion the moment the "Mandate of Heaven" shifted. Before Sonnet-3.5, Anthropic's best model was Opus-3 which was quite powerful yet still lacked in areas where the GPT family beat it fair and square but with Sonnet-3.5 onwards on a base model premise, the State-of-the-Art (SOTA) dominance had shifted over to Anthropic. Anthropic names their models from Haiku - Sonnet - Opus depending on parameter count, quite like how Meta's Llama family has 7B - 70B - 405B to give you a numerical representation. Sonnet-3.5 beat GPT's biggest models with their medium sized models but where it outshined almost every other model was in engineering and coding. It suddenly was possible to have a pair-programmer with Sonnet-3.5 at levels the GPT family of models never provided. This immediately caught the attention of people using LLMs to write their products and complex, functions and they were raving about the results. Slowly but surely, most people who worked with or around code switched to Sonnet-3.5.

It isn't only safety and alignment but model quality that paved the way for Anthropic's rise in the AI arms race. There isn't only one source for the unrest in OpenAI, resource allocation being primary. The allocation of compute power that was promised to the safety and alignment team was in their words not sufficiently prioritized which raised more than one person alarmed at OpenAI due to the sheer scale and capabilities these LLMs now provided.

This advertisement from Anthropic themselves is a better representation of what I can provide on how the vibes have shifted from OpenAI dominance to Anthropic dominance in the past months.

OpenAI has expectedly lost a lot of it's first-mover advantage into the LLM game in 2024 and it seems that the %16 percent loss in share gain has mostly went towards Anthropic's share gain which bumped from 12% in 2023 to 24% in 2024. (Which I speculate will go higher.)

Among closed-source models, OpenAI's early mover advantage has eroded somewhat, with enterprise market share dropping from 50% to 34%. The primary beneficiary has been Anthropic,* which doubled its enterprise presence from 12% to 24% as some enterprises switched from GPT-4 to Claude 3.5 Sonnet when the new model became state-of-the-art. When moving to a new LLM, organizations most commonly cite security and safety considerations (46%), price (44%), performance (42%), and expanded capabilities (41%) as motivations. - Menlo Ventures, 2024: The State of Generative AI in the Enterprise

If you had the chance to listen to Anthropic's recent appearance on Lex Fridman's podcast where Dario Amodei, Amanda Askell and Chris Olah talk sequentially for a total of 5 hours and 15 minutes, it's a great listen. During the talk Dario talks about AI Safety Levels and their Responsible Scaling Plans. This introduces the concept of AI Safety Levels or ASL for short. ASL comprises of 5 levels.

ASL

ASL - 1 (Smaller models: Don't pose any risk of autonomy. ex: DeepBlue)
ASL-2: Present large models that aren't smart enough to complete autonomous tasks end-to-end. They have information that may pose CBRN risks; however, they cannot themselves execute on these risks or take harmful action without a user instructing the model. (CBRN: Chemical, biological, radiological and nuclear defense)
ASL-3: Significantly higher risk models where users need much less proficiency to cause harm and destruction. The model is more capable of taking harmful action while executing tasks end-to-end more reliably.
ASL-4+: Speculative area where models can enhance the risk of a harmful actor or become a risk themselves if misaligned.
ASL-5: Models exceeding humanity at any given task.

He mentions "It is surprisingly hard to address these risks, because they're not here today, they don't exist, they're like ghosts but they're coming at us so fast because the models are improving so fast." With each new model everything is tested and benchmarked from start to finish to understand the current capabilities of what has been newly trained. What they also look for is "How capable are these models in doing CBRN tasks autonomously."

Every research lab right now is trying to get their biggest and most capable models to do aspects of AI research itself, building this flywheel where their hope is that models will improve beyond what humans would be capable of improving them. Both in quality and time constrictions (compute time could be much faster). If you can get to that point in automating your model development then that opens the door to allocating your resources and researchers to monitoring what the models are doing rather than trying to research how they could get the models to perform better because the model will have taken up the task of improving itself therefore our purpose as humans will be to ensure that what's developed is safe for humanity and our civilization.

If models shift from the current AI safety level of ASL-2 to ASL-3 then there must be appropriate safety guidelines in place before that becomes a reality. This is the reason behind research labs doing post-training, alignment & safety research. Anthropic has one more layer called Constitutional AI which can be compared to a system prompt that the model adheres to while executing given queries. Not responding to or rejecting queries that involve harming others, terrorism, manufacturing of bio-chemical weapons etc. I also have to note that, Claude is the model that denies user queries the most because their safety precautions are layered on top of each other so every word in your query could be misinterpreted as a safety violation thus the model rejects answering your query because it anticipates that the user is trying to steer it into a misaligned direction. Anthropic's rejection rate is quite high compared to other models and has been an inside joke/criticism that it acts like a nanny or a mother at times infantilizing its users.

The exodus of safety & alignment researchers from OpenAI have nearly all gone towards open-source or Anthropic. I believe that without proper safety & alignment research it won't matter if OpenAI has the best model among their competitors. If they can't control the model and ensure it's safety in autonomy then they would have failed. Anthropic's approach to safety and Dario Amodei's (CEO of Anthropic former VP of Research at OpenAI) own research in this field (mechanistic interpratibility, alignment, safety etc.) makes me believe Anthropic positioned themselves uniquely for the challenges of ASL-3 and ASL-4 models. While competitors may develop equally capable models, the ability to safely deploy and control these models will likely become the key differentiator in maintaining market share. I believe Google along with OpenAI will be the first to succeed in delivering ASL-3 and ASl-4 models but it will trickle down to open-source eventually.

This gives a nice picture wherein Anthropic wins and they're doing all the right things in order to carry the "Mandate of Heaven", but no king rules forever. Their product-side has major flaws and maturity problems compared to OpenAI. For example, they still don't have a web search capabilities which I deem a must-have at this point for a research lab capable of producing such powerful models. They still don't have their own test-time-compute paradigm or answer to it like o1. While they are implementing Chain-of-Thought reasoning, it is not quite paradigm-shifting level seen in the o1 model. People have reported antthinking (short for anthropic thinking) token outputs in their responses hinting to claude doing some type of chain of thought but no one knows the details.

The fact that Sonnet-3.5 (24.11.24_updated) beats and is better than o1-pro in some tasks is mind-blowing because it's just a pre-trained model outputting responses without all the bells and whistles. Unlike o1, which outputs costly reasoning tokens and searches over its responses to find the best one, Sonnet-3.5 just gives you the best response at pass@1 - bam! That's what makes people amazed at Claude. It's just that good without even adding that paradigm on top. While there's strong demand for Claude Opus-3.5 and a large model entry from Anthropic that could blow other models off the charts, this may not happen despite Dario mentioning Opus-3.5 on the podcast. I believe this is because we've gotten very good at distilling large models into medium models, and the gains from training a very large model just aren't cost effective anymore. It's a race that has to be ran but no AI labs want to because of how costly each run is and how unknown the gains will be.

For now, the optimal solution appears to be a hybrid approach: sonnet-3.5 (24.11.24_updated version) enhanced with o1's reasoning capabilities. This combination represents the current sweet spot between safety, capability, and practical utility.

xAI (Dark Horse)

Elon Musk was one of the earliest investors at OpenAI but talks came to a stand-still when Elon wanted the CEO position. This conflicted with Sam Altman's position as CEO (which he claimed didn't want but felt was necessary "so everyone knew he was in charge") during funding negotiations, which ultimately halted all communications between Musk and OpenAI. These talks happened in 2016, so you can imagine the bitterness and regret Musk might feel considering OpenAI's current success. He would have been one of the biggest backers in the next-computing-revolution and in part he actually was, but got ousted before OpenAI became the behemoth it is today.

Now we now have xAI, his own AI company rivaling every big research lab. They have already constructed the largest AI supercluster, approximately worth $10B. In a parallel move, Elon is also trying to prevent OpenAI from going for-profit through legal action - which looks like Elon is going for Sam Altman's throat and his only way to get out of the non-profit grave he dug himself into.

Raising money and training bigger, better models is the largest problem these research labs face. They have to convince their investors they're on the right track to achieve superintelligent AI or massive productivity gains - gains so large that current investment will seem trivial in comparison. Every big training run becomes a fundraising roadshow. Elon doesn't have to play by these rules. Through his ownership of X and vast resources from Tesla, he can fund development independently and move at his own pace. Elon doesn't have to play by the same rules.

Through buying X Elon has shifted the presidency race in Donald Trump's favor and has succeeded in putting him in the president's seat. Was this all Elon? No. Did he play a significant role? You can be sure that he did everything in his power to elect Donald Trump and come January when Trump is inaugurated into the White House he will reap those benefits 5 to 10 fold. The Biden Administration favored only select few research labs to have the compute or resources to train SOTA models pushing the frontier; however, Trump's promise* is for more companies to be invited to take part in these conversations and who do you think he will lean on to direct and shape these conversations? Yes, none other than the person who put him in office Elon Musk and newly appointed czar David Sacks.

Elon is undoubtedly focused on the success of xAI and Tesla, which are more interconnected than they might appear. Grok runs on X, is trained by xAI employees, and is funded largely through Elon's profits from Tesla and other ventures. His track-record doesn't need any promises or roadshows like his competitors. He is the one who accelerates, always has been, always will be because that's the way he operates. Fast, lean and sometimes destructive but he consistently delivers results.

While Grok-2 is currently at frontier level, it hasn't yet found its definitive place in the AI arms race ecosystem alongside OpenAI, Google, and Anthropic. Their API access remains in beta, which limits developer adoption - a crucial factor for any AI platform's success. However, given xAI's resources and Musk's execution speed, this lag likely won't last long.

Looking ahead one year, while it's impossible to predict who will hold the frontier model or the "Mandate of Heaven", I would bet on xAI capturing at least 10% market share from the current players, if not achieving SOTA status themselves. This isn't just about technical capability - the broader industry dynamics are shifting. The traditional closed-garden approach to AI development is already being challenged by impressive open-source models like Llama, Mistral, DeepSeek, and particularly Qwen from Alibaba, whose recent Qwen-2.5(72B) model has displaced Llama as the leading open-source model. With Musk's resources, determination, and history of disrupting established industries, counting xAI out of the AI arms race would be a serious mistake.

Test-Time-Compute (TTC), Pre-training and New Paradigms

What is test-time compute, why are people saying that we have hit an intelligence wall in terms of pre-training and what's next over the horizon?

First let's examine test-time-compute. Test-Time-Compute in a very crude explanation means that the harder the question you ask the LLM the longer it can think to provide an accurate answer to your query. Right now, models blurt out their answer in one go and don't reason step-by-step over what they are saying so 6 months ago you would add in to your prompt "Please solve this problem step-by-step and checking your answers" to force the model to apply some sort of Chain of Thought (COT). Noam Brown's previous work in using monte carlo tree search in chess and other areas has a deep relation to what's happening underneath test-time-compute and how the new o1-preview models answer your questions.

To give a brief explanation of Monte Carlo Search it comprises of 4 elements:

Monte Carlo Tree Search (MCTS) is a method widely used in decision-making tasks, particularly in games and simulations. The process involves:

Selection: Traversing a decision tree based on an exploration-exploitation balance to choose promising paths.
Expansion: Simulating outcomes of potential actions from the chosen state.
Simulation: Running randomized or heuristic-guided simulations to estimate future outcomes.
Backpropagation: Updating the decision tree with results from the simulations to improve future decision quality

This process is particularly relevant to how I believe o1 operates. Like a chess engine using MCTS to evaluate different move sequences, o1 appears to explore multiple reasoning paths, evaluate their potential, test their validity, and learn from the results. When you ask o1 a complex question, it's likely running through a similar cycle:

First selecting promising approaches (Selection)
Then expanding these into detailed reasoning chains (Expansion)
Testing these chains for accuracy (Simulation)
Finally using what it learns to improve its answer (Backpropagation)

Most of these actions are close to what we do when we talk about "logical thinking". Selecting an idea/concept to think. Expanding in detail our knowledge on the concept in our mind, running a quick simulation on whether our idea is correct or feasible then backpropagating our thinking to align with the simulated idea. If the simulation was false, add the concept/idea into the false pile which leads to inaccurate results. If correct, add to the pile of accurate answers. Now do these steps over the accurate responses and eliminate more options until finally you arrive at a singular answer that explains your whole approach/solution. This iterative improvement process mirrors what I believe o1 is doing - not just thinking longer, but refining its reasoning through multiple passes.

No one except for the people at OpenAI knows how o1-preview works completely. There are vague notions and ideas about its operation, like the MCTS-like process described above, but the full implementation remains a mystery. Notably, while o1-preview provides answers to users' queries, it doesn't reveal the internal chain of thought happening in the model's "mind". This opacity is intentional - experts suggest that exposing the COT outputs could allow others to reverse engineer how o1 works or at least help them replicate similar functionality.

Alibaba's Qwen-QwQ has recently become the only open-source rival attempting to match o1-preview's capabilities. While QwQ includes its internal chain of thought in responses, it still has quirks- like sometimes defaulting to chinese if you don't explicitly instruct it to answer in english. The gap between these implementations is notable: while o1-preview is a technical marvel, it remains experimental rather than production-ready, constrained by token limits and high API costs. As of 5 December 2024, they have lifted these constraints for 200$ power users and released the full version of o1, which benchmarks worse sometimes compared to o1-preview? API access has yet to be released which will give us more insight into the capabilities and benchmarks later on.

This brings us to the paradigm shift in AI development. Before o1-preview, getting smarter models meant one thing: pre-training with more data and compute power. The o1-model introduces TTC as a completely new scaling dimension. The concept is simple: the longer a model can think about a problem, the more accurate(reliable) its answer becomes. Currently, o1 thinks for seconds or minutes at significant cost, but OpenAI envisions scenarios where models could contemplate complex STEM research problems for days, weeks, or even months.

In the release of o1 OpenAI released the above chart which is their pride and joy. Notice that it isn't in normal time scale but in logarithmic time scale. It shows o1 achieving increasingly accurate answers on the AIME (American Invitational Mathematics Examination) the longer it thinks, even on its first attempt at problems. This suggests a tantalizing possibility: near-perfect accuracy on any task, given enough thinking time. While we can't be certain this trend will hold beyond mathematical problems, OpenAI sees this as their potential path to AGI and seem confident that this paradigm, if expanded upon, will lead to turning a fuzzy output world into structured outputs through iterations of search over the query and its knowledge base - much like how MCTS turns the fuzzy world of possible chess moves into structured, optimal decisions through iterative improvement.

Meta's Open-Source Strategy

I don't want to comment too much on Meta's approach to the AI race because it fundamentally differs from other research labs - they're not necessarily competing to create the most advanced model, but rather pursuing a strategy of ubiquitous adoption through open-source accessibility. This strategy mirrors Meta's successful approach to hardware infrastructure, where Zuckerberg noted: "When we open-sourced our hardware infrastructure designs through the Open Compute Project, it led to massive cost reductions across the industry as everyone could build on each other's work. We expect similar benefits will emerge as the AI ecosystem becomes more open."

Just as open-sourcing hardware designs created a collaborative ecosystem that drove down costs, Meta's strategy with Llama aims to make AI development more accessible and cost-effective. By releasing their models and research publicly, they're fostering an environment where developers can easily build upon and customize their technology. This approach focuses on growing market share through widespread adoption rather than competing for technical supremacy.

The goal isn't to have the most intelligent model, but rather the most widely used and adapted one. By making model creation and deployment on Llama straightforward and "infectious," Meta is betting that the network effects of open-source development will ultimately lead to greater impact and influence in the AI space than a closed, proprietary approach would achieve.

The Chinese AI Scene

I first want to shout out a very good article written on the Chinese AI scene by Aksel Johannesen. If you want a better understanding of what's happening in the Chinese AI Market, I'd advise reading this article first before continuing.

Right now I see three serious competitors from China: Alibaba's Qwen, DeepSeek family of models and Yi from 01.ai. I know the least about Yi, but I have used both Qwen and DeepSeek so I can talk to their capabilities much better.

Qwen is currently dominating with their 72B instruct model and coder variant. Being fully open-source both DeepSeek and Qwen have attracted significant community development, with many choosing them as an on-premise solution. Mind you just six months ago, this position was completely dominated by Meta's Llama. However, Meta's recent models came in smaller than previous versions, leaving Qwen to seize the opportunity to claim the open-source SOTA position.

The reason for this is quite simple: both Qwen and DeepSeek excel at rapid time-to-production for new models and technologies. As mentioned earlier, Qwen-QwQ stands as the only open-source model implementing some form of COT and TTC capabilities. It is a far cry from what o1-preview has achieved; however, you can clearly see that they are pushing the frontier of open-source in trying to replicate something close to o1.

Meta has fallen behind in their frontier models focusing more in advancing visual recognition and smaller LLM models. I would place Qwen as the SOTA position in open-source model. DeepSeek-Coder and Qwen-Coder is almost similar in quality in providing on-premise open-source coding LLM's though neither matches Sonnet-3.5-updated's capabilities.

Conclusion: The Ever-Shifting Mandate

Just as no dynasty held the Mandate of Heaven indefinitely, the landscape of AI leadership proves equally fluid. We find ourselves at a fascinating inflection point where multiple contenders each demonstrate unique strengths. Anthropic has achieved SOTA status in coding and multi-turn conversations, while developing a distinctively empathetic approach in Claude's voice. Meanwhile, OpenAI's "12 Days of OpenAI" campaign is revealing more surprises each day and signaling to others that "OpenAI is not out of the fight and are still very dominant". In the wings, xAI and Google possess the resources and technical capability to dramatically alter the competitive landscape, while Amazon's surprise entry reminds us that the game is far from over.

The technical evolution of AI presents multiple paths to advancement. While Anthropic demonstrates remarkable capabilities through pure pre-training prowess, OpenAI's test-time-compute paradigm suggests an entirely different approach to achieving intelligence. Claude's Computer Use and Model Context Protocol (MCP) capabilities are pushing boundaries in ways that many haven't yet fully appreciated. These features, combined with their strong pre-training foundations, hint at untapped potential. Anthropic will also employ some sort of test-time-compute in their models sooner or later because TTC is a very important part of the puzzle. This means that we'll get to see the best of both companies quite soon.

Yet as these developments unfold, we find ourselves grappling with deeper questions. The current state of AI development resembles a bubbling cauldron. The quest to understand, distill and apply "intelligence" looms larger than ever over us humans, prompting crucial discussions about human obsolescence and our changing relationship with technology. Are we witnessing the early stages of a transformation that will fundamentally alter human society? As we shape these AI systems, they're simultaneously reshaping us - our work patterns, our thinking processes, and our understanding of intelligence itself.

While o1-pro currently stands as perhaps the most capable general-purpose model, Anthropic's Sonnet-3.5 demonstrates superior performance in specific domains. The fact that we haven't seen an Opus release since Opus-3 leaves us wondering about Anthropic's full capabilities. Each approach - whether it's OpenAI's reasoning-focused development, Anthropic's safety-first methodology, or the open-source movement led by Meta, Mistral and Chinese companies - offers viable paths forward. The key metric I observe people ask is increasingly "reliability" rather than raw capability. Hence o1's big marketing claim of being more reliable and publishing the use of a stricter evaluation setting: "a model is only considered to solve a question if it gets the answer right in four out of four attempts ("4/4 reliability"), not just one."

Looking forward, the "Mandate of Heaven" in AI remains as unpredictable as ever. Today's technical advantages might be tomorrow's legacy systems. The integration of TTC with existing models, the potential emergence of entirely new paradigms, and the ongoing evolution of open-source alternatives ensure that no single approach or organization can claim permanent supremacy. No day is ever boring or dull. What's clear is that we're witnessing not just a race for technical superiority, but a fundamental reimagining of what intelligence is.

As we navigate this transformation, the boundaries between different approaches may blur. We might see hybrid systems that combine the reliability of pre-trained models with the reasoning capabilities of TTC, or new architectures that render current debates obsolete just like transformers did. The entry of Amazon and the rise of Chinese models remind us that innovation can come from unexpected directions, and today's leaders must constantly evolve to maintain their position. Not just evolve technologically but the people who work at these research labs must be best of the best and be happy in their situations or they might one day leave for your biggest competitor or worse become your biggest competitor.

In this ever-shifting landscape, perhaps the true "Mandate of Heaven" will ultimately rest with those who can balance rapid advancement with responsible development, technical innovation with practical utility, and raw power with reliable performance. As we stand at this crucial juncture in AI development, one thing remains certain: the next chapter in this story will be written not by those who perfect any single approach, but by those who can synthesize these various streams of innovation into something greater than the sum of their parts. Success in this field demands not just product, not technology, not research brilliance - it requires a harmony of all these elements, powered by unwavering determination and comprehensive understanding.

I want to close with a personal note: If any of my analysis/comments has given the impression that I undervalue the work of anyone in these research labs, that couldn't be further from the truth. Everyone working in both closed research labs and open-source communities has my absolute respect for their contributions, participating in the conversation and taking the chance to create, be a part of this historic endeavor. While time may eventually erase all things, our collective effort to understand and create intelligence represents one of humanity's noblest pursuits. Beyond the corporate dynamics and market competition lies this fundamental truth: the quest to expand human knowledge and capabilities through AI development is a worthy legacy for any individual or organization to leave behind.

Table of Content

Mandate of Heaven: Or, How I Learned to Stop Worrying and Love the Artificial Intelligence Arms Race