Skip to main content
Version: 2.0

Latency And Performance Indicators

For engineering managers, “the system is up” used to be a sufficient answer. Now, in a world obsessed with user experience, it’s wildly insufficient. Uptime is table stakes. What users feel – the responsiveness, the speed, the overall performance – is what truly dictates success. And that’s where understanding latency and performance indicators becomes your superpower.

This isn’t about chasing milliseconds for the sake of it. It’s about building a proactive, data-driven culture that anticipates and solves performance issues before they impact your users. As a manager, you don't need to be a performance tuning expert, but you do need to understand the key signals and how to use them to guide your team.

Why Latency Matters More Than Ever

Think about the last app that frustrated you with sluggish performance. Did you give it a second chance? Probably not. Slow applications bleed users, damage brand reputation, and ultimately impact your bottom line.

Latency – the delay between a request and a response – is a primary driver of perceived performance. While raw CPU usage or memory consumption are important, they’re often lagging indicators. Latency is a leading indicator of user experience. A high-latency system feels slow, even if all the underlying servers are technically healthy.

Beyond the Numbers: The Human Element

It’s important to remember that performance isn’t just about metrics. It's about how your application feels to the user. Encourage your team to:

  • Use real user monitoring (RUM) to understand the actual user experience.
  • Conduct usability testing to identify performance-related pain points.
  • Empathize with your users and understand their needs.

Key Performance Indicators (KPIs) to Track

Uptime monitoring (tools like UptimeToolbox.com, Xitoring.com, or UptimeObserver.com) is essential, but it’s just the starting point. Here’s a breakdown of KPIs you should be tracking, categorized by what they tell you:

  • Request Latency (P50, P90, P99): These percentiles show the time it takes to serve requests. P50 (median) is a good baseline, but P90 and P99 are critical. They reveal the experience of your slowest users, highlighting potential bottlenecks and scalability issues. Don’t ignore those outliers!
  • Error Rate: A spike in errors directly impacts the user experience. Track errors by endpoint, service, and error type to quickly pinpoint the source of problems.
  • Throughput (Requests per Second/Minute): This measures the capacity of your system. Monitoring throughput helps you understand how your system handles load and identify potential scaling needs.
  • Saturation (CPU, Memory, Disk I/O): These metrics tell you how close your resources are to their limits. High saturation often precedes performance degradation.
  • Database Performance (Query Latency, Connection Pool Usage): Database bottlenecks are common culprits for slow applications. Track query performance and ensure your connection pool is appropriately sized.

From Data to Action: Building a Performance-Focused Culture

Tracking metrics is useless without a plan for action. Here’s how to build a performance-focused culture within your team:

Setting SLOs

  1. Service Level Objectives (SLOs): Define clear SLOs for each critical service. These are target performance levels (e.g., “99.9% of requests should respond within 200ms”). SLOs provide a concrete goal for your team and help prioritize performance work.

Defining SLIs and Error Budgets

  1. Service Level Indicators (SLIs): SLIs are the metrics you use to measure progress towards your SLOs (e.g., request latency, error rate).
  2. Error Budgets: Based on your SLOs, define an acceptable amount of “error” or downtime. This gives the team some flexibility to innovate and take risks, while still maintaining a high level of reliability.

Proactive Monitoring & Alerting

  1. Proactive Monitoring & Alerting: Set up alerts that trigger when metrics deviate from expected levels. Avoid “alert fatigue” by focusing on actionable alerts and setting appropriate thresholds.
  2. Distributed Tracing: When things go wrong, you need to quickly pinpoint the root cause. Distributed tracing helps you follow a request as it travels through your system, identifying bottlenecks and performance issues.

Regular Performance Reviews

  1. Regular Performance Reviews: Dedicate time during sprint planning and retrospectives to discuss performance data, identify areas for improvement, and prioritize performance work.

The Engineering Manager's Role: Orchestration, Not Optimization

You don't need to be a performance expert to be an effective engineering manager. Your role is to:

  • Champion a performance-focused culture.
  • Ensure the team has the tools and resources they need to monitor and improve performance.
  • Facilitate communication and collaboration between developers, operations, and product managers.
  • Prioritize performance work alongside new features and bug fixes.

Teams often get bogged down chasing every millisecond of optimization, only to realize they were solving the wrong problems. The goal isn't to achieve perfect performance; it’s to deliver a consistently excellent user experience.

In conclusion, moving beyond basic uptime monitoring and embracing latency and performance indicators is no longer optional – it’s essential for success. As an engineering manager, you have a critical role to play in building a performance-focused culture and ensuring that your team has the tools and resources they need to deliver a consistently excellent user experience.