
From 6+ Clicks to 2: 0-1 Design of an Event Investigation Workflow, Saving Customers $350K/Hour
Overview:
Creating a user-centered, centralized dashboard for Event data investigation seemed like a relatively straightforward task. And sure—why wouldn’t users want a domain agnostic view of all of the Events within their infrastructure?
But as I interviewed and workshopped with the product manager, engineering and domain leads, and interviewed users and subject matter experts, it became clear that a centralized view of Events alone would not be enough to solve user pain or even delight our users. As a design leader, my goal is to advocate for our users and not cause development bottlenecks. This initiative required a strategic approach so that I could achieve both. Keep reading to see how I made it happen…
Executive Summary:
Problem: Site Reliability Engineers for enterprise SaaS organizations are charged with responding to and resolving application performance issues when health rule violations occur or SLO’s are otherwise not being met. However, no dedicated workflow existed for this usecase, at the time of this initiative.
The Pain: Responding to an Alert is a complex process — users must log into the Cloud Observability Platform, then make and test their assumptions about where the root case issue lies, by jumping between multiple tools and toiling with the details on each page to form hypothesis and correlate Event data across microservice applications. Each minute of downtime can cost a company $5K, or ~$350K per hour.
““It took me 2 redeploys to figure this out; very expensive to redeploy as now all pods are image pull backoff…” ”
The Solution: I designed and delivered a unified Event Explorer that consolidated cross-domain Event investigation into a single, intelligent dashboard.
The Impact:
🎯 Reduced investigation interaction from 6 clicks to 2 clicks total, regardless of scale
📈 Eliminated SRE toil of entity-by-entity searching across 10s or 100s of microservices
✅ Achieved unified search and filter design and product consistency, replacing independent event type searches
🚀 Delivered consistent performance at any scale of system complexity
💡 Influenced domain teams to adopt front-end data onboarding schema to achieve unified observability
Logistics
Role: Lead Product Designer
Duration: 4 weeks MVP, followed by iterative improvements
Team: Collaborated with Platform PM, 8 domain engineers, TPM stakeholders
My Responsibilities:
End-to-end UX strategy and research
Cross-functional stakeholder alignment
Technical constraint navigation and data schema influence
Design system implementation and delivery
Approach
This initiative was born of dated user feedback and requests for a domain-agnostic view of Event data. The PM rightly recognized that this was a common feature among our competitors, and proposed a data table as part of the requirements. But it was through my lean UX and research approach that I uncovered the real issue.
Research Insight: users don't need to see the breakdown; a centralized view of Events won’t add enough value — SRE’s want certainty about what to do next.
At the surface, I was charged with creating an Event explorer feature with a domain-agnostic view of Events. However, through contextual inquiries, user interviews, and competitive research, I discovered an opportunity to future-proof our product.
I uncovered and proposed a user-centered design strategy that would standardize Event data as it enters the UI so users can intuitively assess patterns and determine the most efficient plan for resolution and simplify the work of cross-MELT correlation, in context.
Why This Mattered:
Before, users couldn't query across domains (e.g., "Show me all AIOps events” or, “Show me all Licensing and Audit Events"). Additionally, there was no dedicated cross-MELT (Metrics, Events, Logs, Traces) correlation workflows. All of this meant that users needed to be domain architecture experts to effectively search for the desired group of Events.
This insight shifted my approach from designing a data table interface to enabling intuitive investigation workflows.
Design Goal: Simplify navigational complexity and provide cross-MELT data in the context of each Event.
Design Strategy
1. Deliver Immediate User Value (4-week MVP)
Unified data table with standardized presentation of Events within monitored environment
Front-end data transformation to support user-centered filtering and data table design
2. Fast-Follow UX Enhancements (Phase 2)
ML-powered insights for pattern recognition and action-oriented insight generation
Predictive filtering based on investigation patterns
Contextual navigation to related entities and source Entity’s
Design Decisions
FRONT END DATA SCHEMA TRANSFORMATION
Before: Independent searches for each unique event type across multiple entity pages
After: Unified search and filter refinement across all events simultaneously
DATA ACCESS
Before: Navigate to each microservice → assess metrics → repeat for each entity
After: Single click to access nested row data with related entity metrics (mixed-fidelity explorations)
Above: Design explorations for supporting cross-MELT correlation. For this to work, I would have to find the balance between providing enough context and not impeding performance. We ultimately landed on a simpler solution, for this initiative, due to time constraints, with plans to explore the nested row as a potential future solution as a design system pattern.
Final Designs
Design Principals
1. Data-First, Not Interface-First
Instead of just consolidating existing views, I redesigned how users could query and explore event relationships.
2. Logical Grouping by Domain Namespace
Before I knew what its function was, I devised a data transformation schema that mirrored the Schema Browser's familiar domain organization, allowing users to navigate events similarly to the way in which they think about their environments architecture.
3. Cross-Functional Influence
Worked with domain teams and front end UI engineers to uncover constraints and blockers for adoption of proposed data transformation schema, ensuring the design solution was technically feasible, sustainable and could intuitively scale.
Reception and Feedback
User
“Showing the entity (or entities) which are impacted by the event is already quite an improvement to current event UIs...”
Stakeholder
“...this [solution] is spot on and, as I think I've mentioned to Gatha, it is something we really, really need (and realized that we don't want to build a custom one for [omitted] but ideally, just get the general Event Explorer with the right query pre-defined.”
Subject Matter Expert