Finding performance bottlenecks with Pyroscope and Alloy: An example using TON blockchain
Continuous profiling has moved from nice-to-have to essential tooling, but the C++ ecosystem still lags behind languages like Go where pprof is standard. The TON blockchain optimization contest provides a useful case study in how eBPF-based profiling can close this gap without requiring code changes or recompilation.
The setup here used Alloy's pyroscope.ebpf component, which attaches to running processes via eBPF and samples stack traces at regular intervals. The key advantage is zero instrumentation overhead. You point it at a binary compiled with debug symbols (RelWithDebInfo in CMake), and it immediately starts collecting CPU profiles. This matters for C++ workloads where recompiling with instrumentation can change performance characteristics or where you're analyzing third-party code you can't modify.
The TON contest results show what you'd expect from blockchain validation: cryptographic operations dominate. SHA256 hashing consumed 14% of runtime in vm::DataCell::create because every cell in TON's DAG structure needs a cryptographic hash for integrity verification and deduplication. Ed25519 signature verification added another chunk through vm::exec_ed25519_check_signature during smart contract execution.
What's interesting is the range of optimization approaches that worked. The simplest was swapping OpenSSL's SHA256 for SerenityOS's implementation, which yielded 2% improvement. Nobody seems to know exactly why it's faster, but the flame graph diff confirmed it. More impactful was consolidating multiple SHA256 update calls into a single feed operation in CellChecker::compute_hash. This reduced function call overhead and improved cache locality, delivering 20% speedup in DataCell::create and 3.5% overall. The lesson here is that how you use a crypto primitive often matters more than which implementation you choose.
The Ed25519 optimization went further, replacing OpenSSL with a handwritten x86_64 assembly implementation. This sacrifices portability but gained 1.5% performance. In production systems you'd need to weigh this carefully, but it demonstrates that even mature crypto libraries leave optimization headroom for platform-specific code.
The most dramatic gain came from a trivial-looking change: replacing std::map with std::unordered_set in CellStorageStat::add_used_storage for tracking visited cells. This delivered 10% speedup because the code only needed membership testing for memoization, not ordering. The std::map's red-black tree imposed O(log n) lookups with pointer chasing and comparison overhead, while std::unordered_set's hash table provided O(1) average case with better cache behavior. This is a common antipattern where developers reach for ordered containers by default without considering whether ordering is actually required.
The contest submissions also revealed something telling about C++ tooling: multiple contestants built custom profilers. One implemented RAII-based tracing with static IDs for O(1) lookup, requiring manual PROFILER(name) macros throughout the code. The TON codebase itself includes a malloc/free interception profiler that captures full stack traces. These exist because C++ lacks the standardized, always-available profiling that Go developers take for granted with pprof.
eBPF-based profiling addresses this gap by working at the kernel level. It samples stack traces without modifying binaries or requiring cooperation from the runtime. The tradeoff is sampling granularity versus overhead, typically 99Hz or 100Hz sampling rates. This misses short-lived function calls but captures the hot paths that matter for optimization.
For SRE and platform teams, the practical takeaway is that continuous profiling should run in production, not just during optimization contests. The Alloy setup shown here takes minutes to deploy and immediately exposes where CPU time goes. Flame graph diffs between releases make performance regressions visible before they impact users. The TON example shows this working without application changes, which matters when you're running diverse workloads across a fleet.