Marketing material, press releases, and other showreels from brands demonstrating their products in action are usually best taken with a pinch of salt. This holds true for creations by major tech companies like Google, Microsoft, and Amazon as well. The company has mademajor strides in AIthis year, with theGemini multimodal AImodel being the latest addition to the company’s portfolio. However, the hands-on video Google shared for this tech may not be as truthful as Google wants you to believe.
Earlier this week,Google launched Geminiafter keeping us on tenterhooks since the initial announcement way back in January this year. This generative AI model is Google’s response to OpenAI’s latest GPT-4 model. Google’s solution is available in a variety of “sizes” — Ultra, Pro, andNano. Google says the latter is small enough to run locally on the new Pixel 8 Pro. The key difference between Gemini and the AI model used for, say,Google Bard, is that the former can accept multimodal prompts. So, you can expect responses to prompts which combine images, text, audio, and video.
Google posted this impressive demo video on YouTube when it announced Gemini’s launch out of the blue. The video demonstrates several examples where the AI accepts multimodal prompts we see in the top-down view on the left-hand side, assisted by the voice-over. The AI also seems brisk and responsive, unlike Bard and other models which keep you waiting a few seconds for the simplest of responses. To Google’s credit, a disclaimer in the video description states “latency has been reduced, and Gemini outputs have been shortened.”
However,Bloomberg spotteda Google blogpost for developersquietly explaining how the Gemini wasn’t prompted by the live video and voice-over we saw, but a combination of screenshots from the video and textual prompts (viaTechCrunch). For instance, one of the examples from the video asked Gemini to determine if sticky notes depicting the sun, Saturn, and Earth were mentioned in the correct order. The voice-over in the video only asked “Is this the right order,” but the behind-the-scenes textual prompt was far more detailed:
Is this the right order? Consider the distance from the sun and explain your reasoning.
The additional context provided behind the scenes helped the AI, but it is a direct misrepresentation of the omnipresent correlation between prompt complexity and level of detail in the response. Google is also blatantly substituting voice input for text in this demo, and that doesn’t help its case. The rock, paper, scissors demo on video is another example where the hands-on video suggests a silent and intuitive video clip is prompt enough for Gemini. However, the actual prompt consists of three different images of a hand, and an obvious hint, “It’s a game.”
Technically, the prompts are still multimodal, but this revelation makes it immediately apparent how the Gemini hands-on video wasn’t actually hands-on, and could set a lofty expectation for how the AI will work. Google isn’t even hiding the discrepancy, with its VP of Research & DeepMind’s Deep Learning Lead, Oriol Vinyals, sharing the exactworkflow and video on X(formerly Twitter). Moreover, the company didn’t say which version of Gemini was used for the demonstration.
Google is no saint at upholding exacting standards for demonstration and testing of its products. We may be compelled to try Gemini’s capabilities firsthand before deciding how it stacks up against rival tech from OpenAI.