Instrumentation at Scale: Having Your Performance Cake and Eating It Too

Summary

Disclaimer: This summary has been generated by AI. It is experimental, and feedback is welcomed. Please reach out to info@qconsf.com with any comments or concerns.

This presentation by Brian Martin, an expert in performance optimization and distributed systems, provides a comprehensive overview of instrumentation strategies for high-performance systems. The talk highlights critical aspects of collecting performance metrics without compromising system speed and efficiency.

Key Topics Discussed:

  • Instrumentation Challenges: The balance between too much instrumentation, which can slow down systems, and too little, which can miss critical insights.
  • Technologies Used:
    • EBPF: Extended Berkeley Packet Filter used for dynamic kernel instrumentation to gather detailed metrics without altering the kernel code.
    • Prometheus: Standard tool for consuming telemetry data.
  • Counter and Histogram Techniques: Discussion on cache-line aware design, atomic operations, and the importance of implementation choices in handling metrics.
  • Fearless Instrumentation: Techniques to instrument systems extensively without fearing performance degradation.

Libaries & Tools:

  • Resolus: A system performance telemetry agent written in Rust, utilizing EBPF for kernel-level metrics collection.
  • Matrican: A low-overhead metrics library optimized for performance-critical paths used in IOP Systems projects.

Key Takeaways:

  • Effective instrumentation is crucial for gaining insights into system performance and resolving production issues.
  • Implementation details, especially for metrics like histograms, can significantly affect performance, emphasizing the need for well-considered design choices.
  • By applying the right techniques and using advanced tools, it is possible to build robust systems that offer detailed observability without impairing performance.

The talk encourages deep instrumentation as a means to maintain visibility into system health and improve reliability and scalability.

This presentation is valuable for anyone involved in system performance tuning or interested in scalable instrumentation practices.

This is the end of the AI-generated content.


Abstract

In high-performance code, a single misplaced counter increment can cost more than the operation it’s measuring. That creates a paradox: instrument too much and you slow the system down; instrument too little and you miss the insights you need to continuously deliver.

This talk focuses on techniques for instrumenting latency-sensitive, high-throughput systems with minimal impact—approaches rooted in C and Rust, but with lessons that may apply more broadly. We’ll examine the true costs of metrics collection, the pitfalls of percentile reporting, and how to extend observability from application code down to the kernel using eBPF. Along the way, we’ll discuss cacheline-aware counter design, the trade-offs in struct and memory layout, and the value of a unified metrics framework for both application and infrastructure insights.

Attendees will gain practical, language-level strategies for building observability into performance-critical systems—without sacrificing the speed their users expect.


Speaker

Brian Martin

Co-founder and Software Engineer @IOP Systems, Focused on High-Performance Software and Systems, Previously @Twitter

Brian is a software engineer who focuses on performance optimization and distributed systems. He worked at Twitter for 8 years, initially with the Cache Team and later as a member of the newly created Performance Team. After November 2022, Brian joined his teammates from Twitter as a co-founder of IOP Systems and continues to work on improving software and platform performance, efficiency, and reliability.

Read more

From the same track

Session Rust

The Rust High Performance Talk You Did Not Expect

Wednesday Nov 19 / 10:35AM PST

Rust runs faster, but it slows down engineers, right? This was our team’s assumption when we decided to rewrite our code from Kotlin into Rust. But we were wrong in completely unexpected ways.

Speaker image - Ruth Linehan

Ruth Linehan

Software Engineer @Momento, Previously APIs/Webhooks @GitHub and @Puppet

Session Performance

When Every Bit Counts: How Valkey Rebuilt Its Hashtable for Modern Hardware

Wednesday Nov 19 / 01:35PM PST

Ever wondered what happens when a bunch of performance-obsessed developers decide their blazing-fast database isn't quite blazing-fast enough?

Speaker image - Madelyn Olson

Madelyn Olson

Principal Engineer @AWS, Maintainer of the Open-Source Valkey Project

Session Python

Python, Numba, and Algorithm Design: Building Efficient Models in Financial Services

Wednesday Nov 19 / 03:55PM PST

The popularity of Python means insurance and financial services companies have a growing body of actuaries, quantitative developers, and software engineers capable of building innovative and customized solutions for both data management and modeling.

Speaker image - Chad Schuster

Chad Schuster

Principal @Milliman Focusing on Risk Management, Modeling, and Technology Consulting Services

Session Performance

Accelerating Performance by Incrementally Integrating Rust Into Existing Codebase

Wednesday Nov 19 / 02:45PM PST

In order to improve the performance of existing applications and services, we can identify the most performance-critical pieces and reimplement them in Rust as opposed to completely rewriting the applications from scratch.

Speaker image - Lily Mara

Lily Mara

Staff Engineer @Discord, Author of "Refactoring to Rust", Previously Engineering Manager @OneSignal