Gemini Pro vs GPT-4: An in-depth comparison of LLM models

If you’re following the war of generative AI and LLM models, your head must be buzzing with all the new launches, variations, capabilities, and new terminologies. Earlier this year Meta launched Llama-2, which is already stale news. Just in the past few weeks, OpenAI and Google launched their highly anticipated multimodal LLMs, marking a significant leap in the evolution of AI capabilities. These models are the next iteration of AI models and are expected to have superior reasoning and problem solving skills along with the ability to go beyond text for processing inputs, it is now the ultimate showdown for Gemini Pro vs GPT-4. ChatGPT and Google’s Bard (now renamed as Gemini) already use these models and make AI available to everyone. Alternatives such as Bind AI also can now leverage these powerful AI models, and democratize AI to everyone.

What is a Multimodal LLM?

First things first. Let’s start with what is a multimodal LLM model? And is it the same as LLM? Or should we call it a LMM (Large Multimodal Model)? Multimodal LLM is a type of model that can process and generate responses based on inputs which are not just text, but also other types of inputs such as images, code, and potentially, audio and video. So they kind of are large language models as they are trained on large amounts of data, however the data is not limited to just text. OpenAI specifically calls GPT-4 a large multimodal model. This significantly expands the type of applications that can be built using AI.

Here are some examples of potential applications that can be built using a multimodal LLM model:

Smart Sunglasses: Think of a smarter version of “Google Glass” or “Spectacles by Snap”. One of the biggest reasons for the failure of Google Glass (apart from it being weirdly awkward) was its lack of significant functionality besides taking pictures. Imagine a truly smart glass, not the exploding one which Tom Cruise wears in MI-2, which can in real-time analyze everything you see and hear, make sense of it and lets you converse or take actions based on that. For this to work though, you’ll need the model to operate super fast and process a huge amount of data in real-time.
Nutrition App: An application where you can take pictures of your food, it can describe all the ingredients and evaluate the nutrition composition such as calories, grams of protein/carbs/fats.
Create websites by scribbling a mockup on a piece of paper: Imagine you draw four boxes on a drawing book, take a picture and ask a LLM to create a UI design and a html code which you can actually use on your website.
Solve Math or Science problems by taking a screenshot: While not recommended to do so, this is an app your kids would love! In fact, this was one of the use cases which is included in Google’s evaluation of the model.

From Google’s Gemini Paper: An example of Gemini verifying a student’s solution to a physics paper problem. The model is able to correctly recognize all of the handwritten content and verify the reasoning. It also correctly generates LATEX.

Now, let’s dive into our two contenders for multimodal models: Google Gemini and OpenAI GPT-4. Let’s start with Google.

What is Google Gemini Pro?

Google Gemini was unveiled by CEO Sundar Pichai and Demis Hassabis (Deepmind) as a breakthrough multimodal AI system from Google’s DeepMind research division. It’s a cutting-edge multimodal LLM that builds upon the capabilities of its predecessors, PaLM and LaMDA.

The Gemini product line consists of three variants – Ultra, Pro and Nano – each optimized for different use cases. Gemini Ultra targets complex multifaceted tasks through its enormous computational capabilities. Gemini Pro provides excellent scalability for diverse enterprise applications. Meanwhile, Gemini Nano’s efficiency enables on-device deployment on resource-constrained devices, imagine this model sitting on your smartphone and answering questions without even needing any access to the internet – that will be game changing!

Unlike Google’s prior language models with narrow single domain capabilities, Gemini stands out for its ability to process multiple data types through its unified internal representations. Gemini’s architecture is designed to handle diverse inputs including text, code, images and in future audio and video streams. This is achieved through Gemini’s powerful multimodal transformers trained on massive proprietary datasets from Google services.

Google Gemini multimodal LLM – High level architecture

What is GPT-4?

Note: OpenAI recently launched GPT-40 which outperforms GPT-4 in cost, speed and reasoning.

GPT-4 is a multimodal LLM model by OpenAI, the creators of ChatGPT. Its predecessors include GPT-3.5 and GPT 3.0 which are excellent models for text completion and reasoning, however they lack the ability to take image and other types as an input. It’s important to note that GPT-4 is not a successor to Dall.E-3, which is a text-to-image system for creating an image from a given prompt. If you are curious, both these models use a different architecture. Dall.E uses a diffusion architecture which operates like a creative alchemist, transforming noise into rich meaningful images layer-by-layer. Whereas GPT-4 uses a transformer architecture which excels at capturing long-range dependencies within text sequences, meaning they can understand relationships between words and concepts even if they’re far apart in a sentence or paragraph.

From OpenAI GPT-4 Technical Report: An example of GPT-4 responding to image using a defined prompt

Gemini Pro vs GPT-4: Comparing model size, architecture and reasoning abilities

Now lets comprehensively compare Gemini and GPT-4 on the following parameters: Model Scale, Architecture, published benchmarks. Since Gemini Nano is a smaller model intended for devices such as smartphones, we’ll limit the comparison of GPT-4 to Gemini Pro & Gemini Ultra based on availability of information.

Model Size and Parameters

Gemini’s multimodal LLM model is available in three different sizes: Ultra, Pro, and Nano. In order to accommodate a range of user requirements and computing capabilities, each is tuned for differing degrees of complexity and scalability.

Gemini Nano is the smallest of the three Gemini models, trained on 1.8 to 3.25 billion parameters, which is considerably less parameters compared to alleged 1+ trillion parameters of GPT-4 or even the other Gemini models. However, it is noteworthy because Gemini Nano is designed to function well on gadgets like smartphones despite its tiny size, which is evidence of Google’s emphasis on usability and accessibility. Imagine, the model running on your smartphone, and even without any connection to the internet it could still power immensely valuable functionality such as navigation, web search, image processing, problem solving and much more. This is phenomenal if this actually works as anticipated, especially in countries and regions where internet connectivity isn’t the best.

Gemini Pro is a much larger model than Gemini Nano, with 540B+ parameters. This model is already powering Bard, Google’s ChatGPT competitor. Gemini Pro is already available via APIs as well. While this is a powerful model, it is expected to be more comparable to GPT 3.5 rather than GPT-4. We’ll dive deeper into the benchmarks in the next section and do a head-to-head comparison.

Gemini Ultra is expected to be the most potent version, with much larger parameters. Details regarding the models aren’t publicly disclosed yet and we’re awaiting an official launch date from Google.

GPT-4, the far more powerful successor to the already remarkable GPT-3.5, is said to include as many as 1.7 trillion parameters, demonstrating its extensive and profound learning capabilities. It is expected to have better reasoning, problem solving and multi-modal capabilities.

Comparison of Gemini Pro, GPT-4 and GPT 3.5-turbo

Architecture

Understanding these models’ design is essential to appreciating their advantages. Although the transformer architecture serves as the foundation for both Gemini and GPT-4, Gemini’s multimodal LLM capabilities enable it to handle and synthesize a variety of data kinds, including text, code, audio, pictures, and videos. Because of its adaptability, Gemini can supposedly process and comprehend a wider variety of data inputs, which increases its usefulness in a variety of contexts.

Built upon the foundation of its predecessors, Gemini takes a modular approach. Instead of a single monolithic network, it utilizes specialized sub-modules for different tasks like factual language understanding, commonsense reasoning, and even humor generation. This modularity allows for fine-tuning each component, potentially leading to greater proficiency in specific areas.

In contrast, GPT-4 sticks to the decoder-only architecture that made its predecessors successful. However, it undergoes a significant scaling-up, boasting a mind-boggling parameter count in the trillion range. This sheer size promises increased capacity for learning and generating complex, nuanced language. Additionally, GPT-4 incorporates attention mechanisms that allow it to focus on relevant parts of the input during processing, potentially leading to improved coherence and accuracy.

Gemini Pro vs GPT-4 benchmarks for reasoning, math, code generation and other areas

For evaluating LLM models there are a bunch of standard tests and benchmarks which are typically leveraged for an apples-to-apples comparison of models. As an example, MMLU benchmark measures the model’s performance across a total of 57 tasks (such as elementary mathematics, US history, computer science, and law), with 15,908 questions collected and divided into a few-shot development set, a validation set, and a test set. HellaSwag, DROP, MATH are some of the other benchmarks which are typically used. We’ll use the published benchmarks displayed in the table below across 9 different evaluations comparing Gemini Ultra, Gemini Pro, GPT-4 and GPT-3.5.

MMLU, HellaSWAG and other benchmark comparison for Google Gemini Pro vs Gemini Ultra vs GPT-4

Looking at the benchmark, it looks like Google Gemini Ultra is a serious competitor in the AI arena. Its achievement in the Massive Multitask Language Understanding (MMLU) test is particularly noteworthy, as it achieved an astounding 90.0%, indicating broad knowledge across several disciplines. Gemini Ultra also performs exceptionally well on multi-step reasoning problems, as demonstrated by its Big-Bench Hard benchmark result of 83.6%. It exhibits good performance in both reading comprehension (DROP) and commonsense reasoning (HellaSwag), with the latter coming in slightly below GPT-4. With scores of 74.4% when creating Python code and 53.2% while solving tough math questions, it is equally adept at solving difficult arithmetic issues as well. Of special importance is Gemini’s multimodal learning technique, which incorporates text, visuals, music, and more. It is trained using a large amount of confidential data from Google’s wide range of services and Google’s powerful TPUv5 CPUs. Because of this, Gemini has a distinct advantage in creating information that is more logical, fluid, pertinent, and adaptive.

However, GPT-4 is a titan in the realm of generative AI, with hundreds of billions of parameters. It is renowned for its capacity to produce prose that is both cohesive and fluid over a wide range of topics. The results of GPT-4 are noteworthy in benchmarks such as DROP (80.9%), Big-Bench Hard (83.1%), and MMLU (86.4%). It has a higher commonsense reasoning score than Gemini (95.3% in HellaSwag). In terms of code creation, GPT-4 performs admirably, however Gemini beats it by a small margin. Despite of these outstanding accomplishments, GPT-4 still has issues with excessive operating expenses, moral dilemmas, and a lack of transparency in its results. We’ll do a evaluation against real-world use cases and post a separate article later.

The comparison of these AI models takes into account their potential influence on several areas, not only numerical values. With these models able to generate original text, graphics, and more, the emergence of generative AI represents a significant change in the content production process. But despite these advantages, Gemini and GPT-4 struggle with problems including resource costs, environmental effects, moral conundrums, and the difficulty of maintaining content quality.

Here’s a details assessment for each of the benchmarks and how Gemini and GPT-4 multimodal LLMs perform:

Multiple-choice Questions (MMLU): Gemini Ultra leads with 90.04%, followed by GPT-4 with 87.29%. Gemini Pro is at 79.13%. These models are well-suited for tasks that require understanding nuanced differences between options.
Grade-school Math (GSM8K): Gemini Ultra again performs the best at 94.4%, with GPT-4 at 92.0%. Gemini Pro has 86.5%. These models would excel in educational applications, particularly in teaching and learning environments.
Math Problems (MATH): Gemini Ultra shows superiority in math problems with 53.2%, GPT-4 has 52.9%, and Gemini Pro lags at 32.6%. They can be used in computational contexts where mathematical reasoning is required.
BIG-Bench-Hard: Gemini Ultra scores 83.6%, GPT-4 has 80.3%, and Gemini Pro is at 75.0%. These models could be used in complex problem-solving tasks that involve understanding and generating natural language.
Python Coding (HumanEval): Gemini Ultra has 74.4%, while GPT-4 is close with 67.0%. Gemini Pro is at 67.7%. These models are suited for software development assistance and educational tools for learning to code.
Natural Language to Code (Natural2Code): Gemini Ultra performs at 74.9%, with GPT-4 at 73.9% and Gemini Pro at 69.6%. They are particularly useful in programming-related fields for code generation and understanding.
Reading Comprehension (DROP): Here, GPT-4 excels with 82.4%, Gemini Ultra is at 74.1%, and Gemini Pro is at 74.1%. These models can be implemented in applications that require reading and understanding complex texts.
Common-sense Multiple Choice (HellaSwag): GPT-4 leads with 95.3%, followed by Gemini Ultra at 87.8% and Gemini Pro at 84.7%. These models could be beneficial in systems that require a high level of common-sense reasoning.
Machine Translation (WMT23): Gemini Ultra has 74.4%, GPT-4 is not listed, and Gemini Pro is at 71.7%. These models could be used in multilingual translation services and international communication tools.

Each model has its strengths and ideal application areas, and the choice of which to use would depend on the specific requirements of the task at hand.

Summary

To sum up, the introduction of Google Gemini and the continuous development of GPT-4 represent important turning points in the AI landscape. Whether this will cause a fundamental shift in the experiences as we know today remains to be seen. If you are looking to leverage any of these models for building your experiences, here’s a summary of which model performs the best:

Google Gemini Pro: Comparable with OpenAI GPT-3.5. It has multimodal capabilities which GPT 3.5 does not, however overall performance and reasoning is lacking as compared to GPT-4. It may be best for applications where a very high degree of accuracy is not as critical, or where the tasks are more constrained and do not require the cutting-edge capabilities of the more advanced models.
Google Gemini Ultra appears to be the most versatile and robust across different types of tasks. It would be best suited for applications that require high accuracy and a broad understanding of various domains, such as advanced tutoring systems, complex problem-solving tools, and versatile AI assistants.At least based on the benchmarks, it seems close or better than OpenAI GPT-4. Since it’s not released yet, we’ll wait and see how it actually does.
OpenAI GPT-4: A significant capability enhancement as compared to it’s predecessor, and seems to be the superior model publicly available currently. GPT-4 shows particular strength in tasks requiring common-sense reasoning and reading comprehension, making it ideal for applications in content creation, summarization, and advanced customer service bots that need to understand and process large volumes of text.

If you’ve tested these models yourself, let us know what your thoughts are. If you are building your own LLM applications, please try Bind!