Finding performance bottlenecks with Pyroscope and Alloy: An example using TON blockchain

Grafana Blog

Performance profiling for production systems has historically required choosing between invasive instrumentation and coarse-grained metrics. eBPF-based continuous profiling with tools like Alloy and Pyroscope changes this calculus by providing system-wide visibility without code modifications. The TON blockchain optimization case study reveals patterns directly applicable to LLM inference pipelines and ML serving infrastructure.

The setup is minimal. Alloy's pyroscope.ebpf component runs with root privileges and profiles all processes on a host. For C++ workloads, compile with RelWithDebInfo to preserve symbols. For Python-based ML systems, the profiler captures native extensions and interpreter overhead without requiring manual instrumentation points. This matters for LLM serving where you're often debugging performance across tokenizers (Rust/C++), model inference (CUDA kernels), and Python orchestration code.

The TON analysis identified three optimization categories that map cleanly to ML infrastructure challenges. First, cryptographic operations consumed 14 percent of execution time in SHA256 hashing for cell validation. The parallel in ML systems is tokenization and preprocessing. Teams often overlook that HuggingFace tokenizers can become bottlenecks at high throughput. Profiling revealed that switching SHA256 implementations yielded 2 percent gains, while batching multiple hash operations into single calls delivered 3.5 percent improvements. For LLM systems, this translates to batching tokenization requests and evaluating alternative tokenizer implementations when per-request latency matters.

Second, data structure choices had outsized impact. Replacing std::map with std::unordered_set in cell deduplication code provided 10 percent speedup by moving from O(log n) to O(1) lookups. ML teams face similar choices in KV caching implementations, prompt deduplication, and RAG retrieval systems. Continuous profiling exposes whether your cache lookup strategy is actually the bottleneck or if it's dominated by model forward passes. Many teams optimize cache eviction policies without realizing the lookup data structure itself is the problem.

Third, platform-specific optimizations like assembly-optimized Ed25519 verification delivered 1.5 percent gains. The ML equivalent is choosing between different CUDA kernel implementations, FlashAttention versions, or quantization backends. Profiling shows whether switching from standard attention to FlashAttention-2 actually helps your specific workload, or if you're bottlenecked elsewhere in the inference pipeline.

The zero-instrumentation aspect is critical for production ML systems. Manual instrumentation with timing decorators or custom profilers creates maintenance burden and introduces observer effects. The TON contestants built custom tracing profilers with RAII timing blocks and memory allocation interceptors, demonstrating the pain of instrumentation-based approaches. For LLM serving, you want to profile tokenization latency, KV cache memory patterns, and GPU kernel execution without littering code with timing logic.

Practical application for ML teams means running Alloy on inference hosts and analyzing flame graphs to identify whether bottlenecks are in preprocessing, model execution, or postprocessing. For RAG systems, profiling reveals if you're spending more time on embedding generation, vector search, or context assembly. For agent systems, it exposes whether tool calling overhead dominates or if the LLM inference itself is the constraint.

The cost-benefit calculation favors continuous profiling when you're optimizing for throughput or reducing infrastructure spend. A 10 percent speedup from data structure changes translates directly to serving 10 percent more requests per GPU hour. For teams running hundreds of inference instances, these gains compound quickly. The overhead of eBPF profiling is typically under 5 percent CPU, making it viable for always-on production profiling rather than one-off debugging sessions.