The Revolutionary Impact of Combining Observability with Generative AI

The Revolutionary Impact of Combining Observability with Generative AI

We live in a digital world, where the ability to keep essential software systems and services running smoothly is crucial for businesses. Any downtime or performance issues can lead to significant problems, from losing potential customers to a competitor’s site to employees being unable to complete their work on time.

For site reliability engineers (SREs) and DevOps professionals, maintaining the reliability of critical websites and applications can seem like an endless struggle. However, there’s promising news: Generative AI, known for its intuitive Q&A interface, can enhance traditional observability methods, helping to solve reliability, security, and speed challenges more effectively and efficiently.

Traditionally, monitoring and observability involve identifying patterns and diagnosing issues so that SREs and DevOps teams can address unexpected problems. Generative AI can simplify and speed up this process, allowing these professionals to respond to incidents with greater agility and confidence.

Consider a newly hired on-call engineer who doesn’t yet fully understand all the systems in the organization. If they receive an alert about an unfamiliar system in the middle of the night, they can use an AI assistant to quickly get the information they need. By asking questions like “What is the purpose of this system?” or “What other systems does this one connect to?” the AI can provide immediate, contextual information summarized by a large language model (LLM).

The advantage here is that the engineer can interact with the LLM using plain English. They don’t need to know complex query languages or data models. They can simply ask questions as they would to a more experienced colleague and receive instant answers to help them understand their environment and troubleshoot issues.

Generative AI does more than just answer questions; it can also proactively summarize information for SREs. For instance, an on-call engineer might receive a summary of an issue in their Slack channel, detailing all steps taken so far and who has been involved, even before an alert wakes them up. This means they can respond to problems almost immediately without spending time piecing together the situation.

LLMs can also provide an overview of the playbook used in similar past situations, allowing the engineer to quickly implement a solution or even instruct the LLM to execute the playbook. This approach leverages the entire organization’s knowledge base, enabling engineers to make effective decisions quickly, regardless of their experience level.

Companies like T-Mobile Netherlands are already using this technology to support their network operations, planning, and customer service teams, ensuring greater network availability and faster resolution of network issues.

Generative AI currently acts as a helpful assistant, explaining things and providing context. In the near future, it will evolve into an AI agent capable of automating many responses for engineers. For example, if the agent recognizes a recurring alert and knows the corresponding playbook, it can automatically execute the necessary actions and provide a summary to the engineer, potentially saving them from a sleepless night.

As LLMs integrate observability data with other organizational data, such as ERP, financials, or security information, engineers will be able to ask more complex, business-critical questions. This could include inquiries like “What was the revenue impact the last time this incident occurred?” or “What was the operational impact on our supply chain?”

Generative AI offers a revolutionary new tool for observability professionals. It doesn’t replace SREs or DevOps teams but reduces the routine tasks they handle daily, allowing them to focus on higher-level problem-solving. By helping them quickly access the most relevant information and make better decisions, the combination of generative AI and observability data is more than just an advancement—it’s a gamechanger.

Abhishek Singh is GM, Observability at Elastic.