12/13/2024 | News release | Distributed by Public on 12/13/2024 16:25
How C3 AI designed a generative AI agent solution for efficient question-answering of structured data
By Jack Lin, Lead Data Scientist, C3 AI
Making informed business decisions often replies on extracting insights from structured data - highly organized information stored in databases. While this data holds immense value, analyzing it can be time consuming and complex. Typically, this involves crafting intricate database queries, which even skilled analysts may spend hours or days refining, ultimately delaying decision making.
Recent advances in generative AI are revolutionizing how we interact with structured data. Large language models (LLMs) can now translate natural language queries into database queries (text-to-SQL for relational databases), retrieving and processing data faster than ever before. Once the queries are executed, LLMs can process the resulting tables to provide clear, actionable answers to the original questions. Yet, creating a production-ready, scalable generative AI application for question-answering structured data presents unique challenges.
The C3 AI Structured DB Agent, part of C3 Generative AI, solves these challenges by simplifying database access and delivering advanced analytics. With this tool, organizations can unlock the full potential of their structured data and make decisions faster and more effectively.
Building a reliable AI-powered solution for structured data requires overcoming the following hurdles:
The C3 AI Structured DB Agent is a powerful multi-hop system designed to navigate these challenges and deliver precise, actionable answers. Here's how it works:
The agent converts natural language queries into database queries using LLMs, leveraging context and domain-specific knowledge to improve accuracy. Given a user query, the agent retrieves the most relevant few-shot examples from long-term memory and the most relevant data tables and columns from the C3 AI Data Model. This information is then sent to the LLM for synthesizing the database query.
If query terms don't perfectly match database entries, the agent applies fuzzy matching to align them. For example, after fuzzy matching, the query "How many events happened in the US?" would have the filter string "U.S." matching the data value in the database. The database query is then executed, and the table data is sent to the LLM.
The agent processes multi-step queries and performs necessary calculations, integrating data from multiple sources when required. Depending on the user queries, the agent may generate and execute Python code to pre-process the database query and post-process the table.
If errors occur during execution, the agent self-corrects using feedback and retries until the task is successfully completed.
The results are presented as text summaries, tables, and visualizations, providing easy-to-interpret insights.
The agent is powered by the C3 AI Data Model, which standardizes relationships between data elements and the C3 AI Unified Data Lake, which consolidates fragmented data sources into a single, central view. These features eliminate the need for multiple query languages and ensure efficient integration across diverse data environments. Without the C3 AI Platform and model-driven architecture, these features would not be possible.
The C3 AI Structured DB Agent is built to handle complex datasets while ensuring secure and reliable operations.
The agent uses a reflection mechanism to address program errors and refine outputs. It observes validation scripts, other LLMs, and even humans, and adjusts its approach as needed. This type of mechanism ensures robust performance, though there's a tradeoff between accuracy and speed. We can configure the system to limit the number of reflections or total time spent before stopping, ensuring a balance between precision and efficiency.
To scale across databases with numerous tables and columns, the agent uses a retrieval-augmented (RAG) approach. Documentation of data model is stored in a vector store, allowing the agent to retrieve only the most relevant tables and columns to for each query. This avoids hitting the context window limit and reduces noise, improving accuracy.
To address ambiguity in queries, domain-specific information and few-shot examples are also stored in the system's long-term memory. When a query is processed, the agent retrieves the most relevant examples to construct prompts tailored to the task. These examples can include small Python scripts for specific operations, atomic tasks, or feedback-driven refinements.
To ensure security and reliability, the agent implements robust guardrails. Inputs (including user queries) and outputs are carefully examined to prevent prompt attacks, filter harmful or inappropriate responses, and protect sensitive data such as personally identifiable information (PII). These guardrails make the agent enterprise-ready, capable of operating in production environments.
Robust guardrails are implemented to ensure securityThe C3 AI Structured DB Agent has been benchmarked using the Defog text-to-SQL dataset and demonstrated superior performance compared to general-purpose LLMs, even without fine-tuning. This highlights its optimized design for structured data applications.
Benchmarking C3 AI's Performance with Defog Text-to-SQLOne multinational food company adopted C3 Generative AI to simplify its complex data analysis processes. C3 Generative AI, built with the C3 AI Structured DB Agent, was able to quickly aggregate and analyze key metrics across facilities and products, producing analytical insights, including monthly trends, moving averages, metric correlations, and facility outliers. C3 Generative AI was able to answer questions and prompts like:
Key Benefits:
The C3 AI Structured DB Agent represents a significant advancement in the field of question-answering with structured data. By leveraging state-of-the-art techniques, it addresses many of the inherent challenges associated with querying and extracting insights from complex databases. The agent's ability to handle single-hop/multi-hop and mathematical/statistical queries ensures that users receive accurate and multimodal answers to their queries efficiently.
With robust security, seamless scalability, and the ability to deliver real-time insights, the agent - and C3 Generative AI - empowers organizations to make smarter, faster decisions.
Learn how C3 Generative AI produced a 90% time savings while providing a 90%+ in response accuracy for a large multinational manufacturing group.
Jack Lin is a Lead Data Scientist on the Generative AI Data Science team at C3 AI, where he leads and develops advanced applications for question-answering across both structured and unstructured data sources. His current work focuses on leveraging Plan & Execute Agents, Structured Database Agents, and Retrieval-Augmented Generation (RAG) to build performant, robust, and scalable generative AI solutions. Jack is also an active voice in the generative AI field, sharing insights through articles on Towards Data Science on Medium (https://medium.com/@jacklingenai). He received his Ph.D in Quantitative Computational Biology from Baylor College of Medicine.