AI unlocked models linking LLMs to Google’s Data Commons

Share it

Cutting-edge language models (LLMs) driving today’s advancements in artificial intelligence are growing more complex. These models have the ability to sift through extensive amounts of text, generate summaries, propose innovative ideas, and even draft programming code. Despite their remarkable capabilities, LLMs at times confidently present information that may be inaccurate. This issue, referred to as “hallucination,” poses a significant obstacle in generative AI.

Today, we are pleased to introduce groundbreaking research developments that directly address this challenge by mitigating hallucination through grounding LLMs in real-world statistical data. Alongside these research breakthroughs, we are thrilled to unveil DataGemma, the initial open models specially crafted to link LLMs with comprehensive real-world data sourced from Google’s Data Commons.

Data Commons: An extensive repository of publicly accessible, reliable data

Data Commons serves as a publicly accessible knowledge graph housing over 240 billion detailed data points encompassing hundreds of thousands of statistical variables. This platform gathers public data from reputable entities such as the United Nations (UN), the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), and Census Bureaus. By amalgamating these datasets into a unified collection of tools and AI models, Data Commons empowers policymakers, researchers, and organizations in acquiring precise insights.

Think of Data Commons as an expansive, continuously expanding repository brimming with dependable, publicly available data spanning various subjects, including health, economics, demographics, and the environment. You can engage with this information in your own language through our AI-driven natural language interface. For instance, you can delve into which nations in Africa have experienced the most significant surge in electricity accessibility, the relationship between income and diabetes in US regions, or any query of your interest.

Utilizing Data Commons to combat hallucination

With the surge in generative AI adoption, our objective is to anchor those encounters by integrating Data Commons into Gemma, our suite of lightweight, cutting-edge open models constructed using the same research insights and technology applied in developing the Gemini models. These DataGemma models are now accessible to researchers and developers.

DataGemma will amplify the capabilities of Gemma models by leveraging the wealth of knowledge from Data Commons to boost LLM accuracy and logic through two distinctive methodologies:

1. RIG (Retrieval-Interleaved Generation) elevates the functionalities of our language model, Gemma 2, by actively soliciting information from credible sources and cross-checking against the data stored in Data Commons. When prompted to generate a response, DataGemma is programmed to detect instances of statistical data and retrieve the relevant information from Data Commons. While the RIG approach is not novel, its unique implementation within the DataGemma framework sets it apart.

https://blog.google/technology/ai/google-datagemma-ai-llm/