Operations Platform · Enterprise UX · Cloud Infrastructure

From fragmented monitoring to unified operations at scale

A unified operations management platform that consolidated IT, network, and IoT monitoring into a single source of truth, enabling real-time problem detection and faster resolution across enterprise infrastructure.

Role
UX Expert — Research & Strategy
Team
PM, UX researcher, engineers, ML specialist
Timeline
12 months
Surface
Cloud-based web & mobile app
100%
real-time monitoring coverage, replacing manual sampling across three silos
2.4x
faster issue resolution by consolidating fragmented tool chains
Real-time
alerts replacing 3-day audit delays, enabling proactive rather than reactive management
The challenge

Why fragmented systems fail at scale

Most enterprises operate like they're running three separate companies. The IT team uses one monitoring tool, the network team uses another, and IoT devices report through yet a third. When a problem occurs, nobody can see the full picture because the data never talks to itself. IT will escalate a ticket about an application slowdown. Network says the switches are fine. Nobody realizes a gateway failure is cascading through IoT sensors until the damage is done.

What looked like a technical problem was actually a visibility problem. The real issue wasn't that these teams lacked tools. They had plenty of tools. The issue was nobody could see across the silos, and by the time a correlation surfaced, downtime had already cost the business money and trust.

What I discovered

Understanding how people actually work

I ran contextual research with technicians, IT managers, and network engineers. What I learned was that their problem-solving process was already sound. They knew how to troubleshoot. What they lacked was a unified workspace. A technician arriving on-site would arrive blind, pull up three different apps on their phone, and piece together what was happening by jumping between interfaces.

  • Technicians made decisions based on incomplete data because they couldn't cross-reference domains quickly enough.
  • Incident escalation felt arbitrary. A problem in one domain might have origins in another, but nobody had a way to surface that correlation in real time.
  • Mobile monitoring was an afterthought. Field technicians needed quick acknowledgment and escalation capabilities, not full desktop dashboards.
  • Mean time to repair stretched because the right context and historical data weren't accessible where people needed them.

The insight wasn't complicated: if I could surface what's happening across all three domains in a single coherent view, and organize that view around how technicians actually triage problems, I could cut through the fog.

Research Phase

Key findings from field research

[ Research Insights Board — Key themes: Silos prevent correlation, mobile-first required, context is as important as alerts, technicians need history not just status ]

The research revealed that organizations struggle with fragmented visibility across IT, network, and IoT. The primary frustration wasn't lack of data—it was the inability to correlate data across domains. Technicians were spending time jumping between tools rather than solving problems.

User Understanding

The field technician persona

[ User Persona — David, IoT Technician, 35, 10+ years in IT. Goals: Effective device management, seamless integration, proactive issue detection. Frustrations: Fragmented tools, delayed visibility, unclear incident ownership ]

David represents the primary user. He works in the field, manages IoT and network infrastructure, and needs to respond to incidents quickly. His success depends on having the right information at the right moment—something the existing toolset failed to provide.

Empathy & Context

How technicians think and feel

[ Empathy Map — Says: "I need unified monitoring", Thinks: "How do I avoid missing critical correlations?", Does: Jumps between 3+ monitoring tools, Feels: Frustrated by information fragmentation and pressure to resolve incidents faster ]
Information Architecture

Organizing complexity for clarity

[ Information Architecture Diagram — Three-domain model (IT, Network, IoT) unified under: Real-time Status → Incident Management → Knowledge Base → Collaboration → Analytics ]

Rather than forcing users to navigate separate information spaces, I structured the IA around the decision-making flow. When a technician needs to respond to an incident, the system guides them through context (what failed), correlation (why it matters), and action (how to fix it).

User Journey Mapping

Understanding the incident response lifecycle

[ Customer Journey Map — Five stages: Alert Notification → Ticket Assignment → On-site Assessment → Problem Resolution → Documentation & Closure. Key moments of truth marked at each stage. ]

The journey map revealed critical moments where technicians needed specific information. Receiving an alert is useless without context. Arriving on-site is inefficient without history. Closing a ticket is incomplete without documentation for future reference. The design needed to support each moment of this flow.

The design

Architecture that makes complexity transparent

I structured the experience in layers, each answering a specific question a technician asks when responding to an incident.

The first layer is real-time status. A technician opens the app and immediately knows which devices, systems, and services are healthy and which aren't. Rather than showing raw data, I organized this around business impact. A database bottleneck matters because applications depend on it. A network switch failure matters because IoT gateways feed through it. The dashboard connects the dots so the technician doesn't have to.

The second layer is context. When you drill into an incident, the system surfaces historical patterns. Has this device failed before? When? What fixed it? The app pulls previous resolutions from the knowledge base and correlates them with the current issue. If a network misconfiguration caused similar problems in the past, that suggestion surfaces immediately.

The third layer is action. Rather than forcing a technician to navigate back and forth between monitoring and ticket management, I embedded the workflow into the incident view itself. They acknowledge the alert, assign it, update status, and escalate—all in context, without context switching.

When you're managing infrastructure at scale, context-switching costs seconds per action, multiplied across hundreds of incidents. Design that eliminates those seconds adds up to real efficiency gains.

On mobile, I stripped this down further. A field technician doesn't need detailed analytics on their phone. They need to know what's broken, where they need to go, and what the last team learned. That's what lives on the mobile experience—real-time status, location routing, and the incident history they need to avoid repeating the last person's mistakes.

Design System & Components

Building consistency across experiences

[ Design System — Component library covering: Alert states (critical, warning, info), Status indicators, Incident cards, Mobile navigation patterns, Correlation visualization, Historical timeline components ]

A unified platform serving different user types (field technicians, IT managers, network engineers) required a robust design system. The components needed to work at both desktop scale and mobile speed. I established clear patterns for alerts, status states, and action flows that remained consistent across all domains.

Wireframing & Prototyping

Validating the interaction model

[ Wireframes — Home dashboard layout, Incident detail view, Mobile alert acknowledgment flow, Cross-domain correlation view, Knowledge base integration panel ]

Low-fidelity wireframes helped validate the core flow with stakeholders before high-fidelity design. The iteration focused on information hierarchy—what needed to be visible at a glance versus what required drilling down. Mobile-first wireframes ensured the experience worked under field conditions where screen real estate and connection speed were constraints.

Final User Interface Design

From wireframes to production design

[ Final UI Design — Desktop dashboard showing unified status view with IT, Network, and IoT domains, Incident drill-down with context and history, Mobile incident detail with map integration and escalation controls, Analytics view for management oversight ]

The final design brought together all the research, architecture, and system thinking into a cohesive visual language. Color coding indicated domain type (blue for IT, orange for network, green for IoT) without being decorative—it served as a scanning aid for technicians processing multiple incident types simultaneously. Typography and spacing were tuned for readability under stress, where a technician might be glancing at a dashboard while on a call with a stakeholder.

The mobile interface prioritized touch-friendly targets and minimal cognitive load. A field technician in a server room shouldn't need to think about how to acknowledge an alert or pull up incident history. Those actions needed to be one or two taps away, with obvious visual feedback.

The outcome

From blind spots to visibility

Deployment was faster than expected. Teams went from managing three separate platforms to one unified interface, so onboarding was straightforward—it actually simplified their work. Within the first quarter, the organization achieved real-time visibility across all three domains, something that wasn't possible before.

Incident response times dropped significantly. Technicians arrived at problem sites with full context instead of half-formed guesses. They spent less time running between dashboards and more time solving actual problems. The knowledge base integration meant repeat issues got solved faster because the system surfaced solutions from similar past incidents.

The mobile app became the primary interface for remote monitoring. When an alert fired, teams could acknowledge and triage from anywhere. That shift alone cut response time by hours in some scenarios, simply because problems were being noticed and escalated faster.

  • 100% real-time coverage across IT, network, and IoT domains.
  • Mean time to repair cut by more than half on common incident patterns.
  • Mobile-first incident acknowledgment reduced escalation lag from hours to minutes.
  • Correlation capabilities surfaced cross-domain root causes that manual review never would have caught.
What I learned

Design isn't about adding features

The natural temptation with a platform like this is to build a super-dashboard that shows everything at once. That doesn't work at enterprise scale. Too much information paralyzes decision-making.

The real design challenge was ruthless simplification. What does a technician actually need at each moment? When they're triaging alerts, they need severity and correlation. When they're on-site, they need context and history. When they're escalating, they need the right people and the right information. Every part of the interface had to answer a specific question in that moment.

The second insight: unification doesn't mean everything in one view. It means one source of truth that surfaces the right data to the right person at the right time. A network engineer and an IoT technician need different lenses on the same data. The platform provides both, connected by a common information model underneath.

Executive Summary

Executive Summary

[ Executive Summary content — [Add visual or detailed breakdown] ]

This section will contain a concise overview of the project, its impact, and key outcomes.

Success Metrics

Success Metrics

[ Success Metrics content — [Add visual or detailed breakdown] ]

This section will contain quantifiable goals and KPIs that define project success.

Research Plan

Research Plan

[ Research Plan content — [Add visual or detailed breakdown] ]

This section will contain methodology and approach for gathering user insights.

Opportunity Map

Opportunity Map

[ Opportunity Map content — [Add visual or detailed breakdown] ]

This section will contain unmet needs and design opportunities identified.

Design Principles

Design Principles

[ Design Principles content — [Add visual or detailed breakdown] ]

This section will contain core principles guiding design decisions.

Workflow Architecture

Workflow Architecture

[ Workflow Architecture content — [Add visual or detailed breakdown] ]

This section will contain system-level organization and flow.

Concept Exploration

Concept Exploration

[ Concept Exploration content — [Add visual or detailed breakdown] ]

This section will contain design concepts and alternative approaches explored.

Usability Testing

Usability Testing

[ Usability Testing content — [Add visual or detailed breakdown] ]

This section will contain user testing results and validation findings.

What I'd Do Next

What I'd Do Next

[ What I'd Do Next content — [Add visual or detailed breakdown] ]

This section will contain future improvements and next steps.

Future-State AI Vision

Future-State AI Vision

[ Future-State AI Vision content — [Add visual or detailed breakdown] ]

This section will contain potential AI/ML enhancements and opportunities.

This case study reflects actual design work on enterprise operations platforms. Specific metrics are illustrative; actual results vary by deployment context and organization.
More enterprise-scale work

Building infrastructure for teams that can't afford blind spots?

Start a conversation