IBM - International Business Machines Corporation

10/14/2024 | News release | Distributed by Public on 10/15/2024 00:08

AI and the future of unstructured data

Data's the gas that makes the AI engines hum. And many companies aren't taking full advantage of the treasure trove of unstructured data at their fingertips because they're not sure how to fill the tank.

That's why businesses that have the tools to process unstructured data are catching investors' attention. Just last month, Salesforce made a major acquisition to power its Agentforce platform-just one in a number of recent investments in unstructured data management providers.

"Gen AI has elevated the importance of unstructured data, namely documents, for RAG as well as LLM fine-tuning and traditional analytics for machine learning, business intelligence and data engineering," says Edward Calvesbert, Vice President of Product Management at IBM watsonx and one of IBM's resident data experts. "Most data being generated every day is unstructured and presents the biggest new opportunity."

We wanted to learn more about what unstructured data has in store for AI. So we sat down with Calvesbert and Dave Donahue, Head of Strategy for data science company Unstructured, which closed a $40 million investment round with IBM, Nvidia and Databricks in March, to get their take on unstructured data's importance, and where it's headed next.

Q: Is a company's unstructured data more valuable than structured data when implementing AI?

Edward Calvesbert, IBM: Unstructured data-language, images, et cetera-is the "new" data that foundation models feed on and can help interpret, so it's what's in focus right now. But just like with structured data, unstructured data has to be governed-classified, assessed for quality, filtered for PII and objectionable content, and deduplicated-so successful strategies will apply many of the traditional structured data management capabilities to unstructured data.

Dave Donahue, Unstructured: Unstructured data is not inherently more valuable than structured data, but generally speaking, large organizations produce four times as much unstructured data as structured data. So the question is, do you want to be using more of your data, and especially human-generated unstructured data, when implementing AI? The answer should be a resounding "Yes."

Q: For AI to be successful, it obviously needs "good" data. But what does that look like in practice?

Calvesbert: "Good enough" is a moving target and depends on the use case. A knowledge base for RAG to improve the semantic search, Q&A and summarization for customer support agents requires the document knowledge base to be complete, accurate and fresh. Data for fine-tuning a model requires a set of human-curated examples of prompt/response pairs. Documents processed into tables or graph databases to drive analytical use cases require effective extraction of entities or values. In almost all cases, the data needs to be classified, filtered and governed in the context of the lifecycle of the use case.

Donahue: At the enterprise or company level, "good" data is clean, structured and enriched. This preprocessing pipeline should minimize information loss between the original content and the LLM-ready version. Unstructured enables companies to transform their unstructured data into a standardized format, regardless of file type, and enrich it with additional metadata. This allows organizations to mitigate the three key challenges they grapple with when using LLMs: they're frozen in time, they tend to make things up, and they know nothing about your specific organization straight out of the box.

Related: Build a modern data architecture

Q: Can you walk us through a use case where a company was sitting on a gold mine of unstructured data, but hadn't figured out how to harness it with AI? What difference did implementing AI make?

Calvesbert: A major telecommunications client that we worked with started with an internal knowledge base for customer support agents, which reduced the time required to get an answer to clients and improved the accuracy of that answer. It spread organically, like wildfire, within the call center, at which point the company had to step back and start working on governance and price performance. Internally, we've implemented a marketing-automation use case where IBM's brand guidelines and examples were ingested to generate new marketing content and curate it for consistent quality and tone.

Donahue: We are working with a global consumer packaged goods company to help them develop new product ideas. You may ask, "What does that have to do with unstructured data?" Well, historically, it would take marketing and product teams months to analyze mountains of sales data, product feedback information and demographic information to generate new ideas or concepts they could test with the end users in those specific markets. What if we could help take that process from months to hours? What if we could generate new ideas for products that are grounded in the data that the teams could rapidly test?

That is the power of harnessing your unstructured data to create business value. Now, that CPG company is leveraging their data across several of their brands to develop and test new product ideas to bring to market.

Q: If a company doesn't have enough unstructured data, can they still implement AI? What should their next steps be?

Calvesbert: Every company has documents-think of what they provide new employees to onboard them-and that's enough to get started with RAG and semantic search.

Donahue 80% of a company's data is unstructured, whether it's emails, memos, internal messaging platforms (like Slack or Microsoft Teams) or business presentations. The question is, what do you want to do with that data? Create efficiencies for engineers currently doing similar data cleaning work? Develop new product ideas based on sales and marketing data? There are countless possibilities and opportunities for AI. Identify an objective. Identify the data required. Start small.

Q: Have you seen any interesting trends in data and data management over the past year?

Calvesbert: I think lakehouse architectures and open table formats, namely Iceberg, have become mainstream and the dominant data management architecture for new data/workloads. Vector capabilities have been delivered natively in many operational/analytical databases so that gen AI workloads can be infused into existing applications. We're starting to see the industry realize that RAG alone isn't going to be enough for certain enterprise use cases that require additional contextualization based on non-obvious relationships (GraphRAG) and improved precision from transactional records (SQL-RAG). Clients are also realizing that implementing a user-authorization model that respects the access controls in place with enterprise content management systems is a critical challenge to overcome to scale gen AI across the enterprise.

Donahue: We're beginning to see data science and machine learning engineering teams work more closely with data engineering teams. Data engineering teams have grown up around the rise of data warehousing and business intelligence applications over the last decade and historically have operated in the world of SQL, structured databases and business analytics processes designed for data analysts and C-suite consumers. As enterprises have leaned into LLMs, the appetite for large volumes of preprocessed data has exploded. However, these consumers tend to operate in the world of Python, vector databases and fast and disposable user interfaces. Over time, we expect mature data engineering teams to increasingly take on responsibility for supplying gen AI teams with enterprise-ready data.

Q: What are your predictions for data trends in 2025 and beyond?

Calvesbert: I think clients are looking to simplify their data estates and the associated costs and risks. To that end, multi-model databases and multi-engine lakehouse architectures will continue to successfully compete for workloads with siloed databases as clients look to consolidate on a reduced number of data platforms. Text-to-SQL models are getting very good, which will dramatically reduce the barrier to working with data for a broad range of use cases beyond business intelligence.

Similarly, the proliferation of agents will infuse data into an exploding volume and variety of automated workflows. Some of these agentic workflows will revolutionize many knowledge worker activities and create exciting new opportunities. Imagine processing an internal or external conversation with clients and immediately mapping it to products in a catalog or opportunity record in a CRM system, including an automated assessment of progression status and propensity to close.

Donahue: In contrast to the modern data stack, in which Snowflake, BigQuery and Databricks established "data gravity" in the data warehousing space, we have yet to do the same for unstructured data. And since they're four times as voluminous as structured data and growing exponentially each year, the stakes couldn't be higher for the next generation of storage solutions for LLMs. The jury is still out on what combination of vector, graph, object or other types of storage will become dominant, and which vendors in each category will prevail, but the winners likely will be clear in the next 18 to 24 months.

eBook: How to leverage the right databases for gen AI
Was this article helpful?
YesNo
Tech Reporter, IBM