Google Gemini is a large scale language model trained jointly across image, audio, video, and text data for the purpose of building a model with both strong generalist capabilities across modalities alongside cutting-edge understanding and reasoning performance. This model is using Gemini Pro on Vertex AI, with enhanced performance, scalability, deployability.
Gemini is designed to process and reason across different inputs like text, images, video, and code. On Labelbox platform, Gemini supports wide range of image and language tasks such as text generation, question answering, classification, visual understanding, answering questions about math, etc.
Gemini is Google’s largest and most capable model to date. It is the first AI model to surpass human experts on the Massive Multitask Language Understanding (MMLU) benchmark, and supposed SOTA performances on multi-modal tasks.
There is a continued need for research and development on reducing “hallucinations” generated by LLMs. LLMs also struggle with tasks requiring high level reasoning abilities such as casual understanding, logical deduction, and counterfactual reasoning.
Technical report on Gemini: a Family of Highly Capable Multimodal Models