The Integration of LLMs in the Contemporary Data Stack: A 2023 Overview

The Integration of LLMs in the Contemporary Data Stack: A 2023 Overview

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

When ChatGPT first launched over a year ago, it gave internet users an ever-ready AI assistant to converse and collaborate with. This tool handled numerous tasks, from generating natural language content like essays to analyzing complex information. In no time, the rapid climb in its popularity spotlighted the underlying technology: the GPT series of large language models (LLMs).

Fast forward to today, LLMs – not just the GPT series but others as well – are pivotal in both individual tasks and large-scale business operations. Companies are now using commercial model APIs and open-source options to automate repetitive activities and enhance efficiency across various key functions. Imagine an AI helping marketing teams create ad campaigns or speeding up customer support by pulling relevant data at the right moment.

The impact has been remarkable. Yet, there’s one area where LLMs’ influence isn’t as highlighted: the modern data stack.

LLMs transforming the data stack

Data is crucial for high-performing large language models. Properly trained models can assist teams in working with their data, whether for experimentation or complex analytics.

In the past year, as ChatGPT and similar tools expanded, companies offering data tools incorporated generative AI to streamline their workflows, making tasks easier for their clients. The aim was simple: harness the power of language models to improve customer experience in data handling, saving time and resources, and allowing clients to focus on more critical tasks.

A significant, perhaps the most important, shift occurred when vendors rolled out conversational querying capabilities. This feature lets users get answers from structured data (organized in rows and columns) through natural language interaction. It obviated the need for writing intricate SQL queries and provided teams, including non-technical members, with an intuitive text-to-SQL experience, generating insights from their datasets through natural language prompts. The LLM would translate the text into SQL, execute the query on the dataset, and produce answers.

Many vendors have introduced this capability. Notably, Databricks, Snowflake, Dremio, Kinetica, and ThoughtSpot have all launched relevant tools. Kinetica initially used ChatGPT but now employs its native LLM. Snowflake offers two tools: a copilot for conversational query interactions and a Document AI tool for extracting data from unstructured sources like images and PDFs. Databricks also entered this space with its ‘LakehouseIQ’.

Several startups are also targeting this domain. For instance, California-based DataGPT markets an AI analyst capable of running thousands of queries in real-time and delivering results conversationally.

Helping with data management and AI efforts

Beyond generating insights and answers from text inputs, LLMs assist with manual data management and play a crucial role in developing robust AI products.

In May, Intelligent Data Management Cloud (IDMC) provider Informatica launched Claire GPT, a conversational AI tool based on multiple LLMs. It allows users to discover, interact with, and manage their IDMC data assets using natural language. Claire GPT handles data discovery, pipeline creation and editing, metadata exploration, and more.

To aid teams in building AI products, Refuel AI from California offers a specialized LLM for data labeling and enrichment. A paper published in October 2023 noted that LLMs are effective in de-noising datasets, a critical step in creating robust AI models.

LLMs also assist in data integration and orchestration within data engineering. They can generate the necessary code for these tasks, such as converting diverse data types into a common format or connecting different data sources, and even querying templates for constructing Airflow DAGs.

Much more to come

Despite only a year in the spotlight, LLMs have already prompted significant changes in enterprises. As these models advance and innovation continues in 2024, we’ll see even more applications for language models across the enterprise data stack, including in the evolving field of data observability.

Monte Carlo, a notable vendor, has released Fix with AI, a tool that identifies data pipeline issues and suggests fixes. Acceldata also recently acquired Bewgle for LLM integration in data observability.

However, as these applications proliferate, ensuring the precise performance of these language models will be more critical than ever. Even minor errors can impact results and disrupt customer experience.