Write a Blog >>

Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, code serialization, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs lower overheads but does not capture detailed dependencies needed for root-cause analyses. Tracing collects all information at prohibitive overheads. In this work, we develop ScalAna that uses static analysis techniques to achieve the best of both worlds—it enables the analyzability of traces at a cost similar to profiling. We leverage compiler and runtime lightweight techniques to generate performance graph and perform graph analysis algorithm to detect the root cause of scaling issues. We evaluate ScalAna with real applications on the Tianhe-2 supercomputer. Results show that our approach can effectively locate the root cause of scalability bottlenecks for real applications and incur less than 6.38% overhead (1.89% on average) for up to 2,048 processes.