Is Google's new AI model Gemini truly superior to ChatGPT?
"Generative AI"

While both models are examples of "generative AI," which learns to find patterns of input training information to generate new data (pictures, words, or other media), ChatGPT is a large language model (LLM) that focuses on producing text. Google Deepmind recently announced Gemini, its new AI model to compete with OpenAI's ChatGPT.

Google offers a conversational online app called Bard that was based on a model called LaMDA (trained on dialogue), much like ChatGPT, a web app for talks, is based on the neural network known as GPT (trained on massive volumes of text). However, Google is now updating that in light of Gemini.
The fact that Gemini is a "multi-modal model" sets it apart from other generative AI models like LaMDA. This indicates that it is directly compatible with a variety of input and output formats, including text, pictures, audio, and video. Consequently, a new abbreviation is surfacing: LMM, or large multimodal model; this should not be confused with LLM.
OpenAI unveiled GPT-4Vision, a model that can process text, audio, and images, in September. But unlike Gemini, which claims to be a completely multimodal model, this one isn't.
For instance, OpenAI has proven that ChatGPT-4, which is powered by GPT-4V, converts speech to text on input using another deep learning model called Whisper, even though it can operate with audio inputs and generate speech outputs. GPT-4V only works with text since ChatGPT-4 uses a separate model to convert text to speech on output.

Similar to ChatGPT-4, ChatGPT-4 can also generate images, but it does so by creating text prompts that are sent to Dall-E 2, a different deep learning model, which turns written descriptions into images.
Google, on the other hand, intended Gemini to be "natively multimodal." This indicates that a variety of input formats, including text, graphics, audio, and video, can be directly handled and output by the core model.
The Verdict
The differences between these two methods may appear scholarly, but they are significant. According to Google's technical report and other qualitative testing conducted thus far, the present publicly accessible version of Gemini, known as Gemini 1.0 Pro, is more capable of being comparable to GPT 3.5 than it is of being typically superior to GPT-4.
Additionally, Google revealed some results demonstrating that Gemini 1.0 Ultra, a more potent form of Gemini, is more potent than GPT-4. For two reasons, it is challenging to evaluate this. First, as Google hasn't released Ultra yet, results can't currently be independently verified.
Second, Google released a somewhat misleading presentation film, which makes it difficult to evaluate the company's claims. You can see it below. The Gemini model is seen in the video providing smooth, engaging commentary during a live television feed.
That presentation in the video, however, was not done in real time, as Bloomberg first reported. The three cup and ball trick, for instance, requires Gemini to keep track of which cup the ball is beneath. This is only one example of the unique tasks the model had previously taught itself. To accomplish this, a series of still photos featuring the presenter's hands on the cups being switched were sent to it.
Optimistic Outlook
Gemini and huge multimodal models, in spite of these problems, are generative AI's most intriguing developments in my opinion. The competitive landscape of AI technologies and their potential for growth are the reasons behind this. In a recent piece, I mentioned that GPT-4 was trained on roughly 500 billion words, which is roughly equivalent to high-quality, publicly accessible material.
The quantity of training data and growing model complexity are often what drive deep learning models' performance. Since we are almost out of fresh training data for language models, this raises the question of how further advances could be made. Nonetheless, multimodal models provide access to vastly expanded repositories of training data, including audio, video, and image formats.
AIs like Gemini, which can be immediately educated on all of this data, are expected to have far greater capabilities in the future. As an illustration, I anticipate that models trained on video will create complex internal representations of "naïve physics." This is the fundamental knowledge that both humans and animals possess regarding gravity, movement, causation, and other physical events.
Google's Gemini marks the rise of a big competitor who will aid in the advancement of the field. It goes without saying that OpenAI is undoubtedly developing GPT-5, and we can anticipate that it will be multimodal and feature amazing new features.
All of this being said, I am eager to see the rise of huge multimodal open-source and non-commercial models, which I think will occur in the future years.
A few aspects of Gemini's implementation appeal to me as well. For instance, Google has unveiled Gemini Nano, a somewhat lighter version that can operate straight on smartphones.
Such lightweight models have significant privacy benefits and lessen the environmental impact of AI computing. I have no doubt that competitors will follow suit as a result of this discovery.
About the Creator
Raymond Inosanto
Raymond a Filipino writer who's life and writing are intertwined in a narrative of courage, vulnerability, and resilience.


Comments
There are no comments for this story
Be the first to respond and start the conversation.