Scaling out Speculative Execution of Finite-State Machines with Parallel Merge
Finite State Machine (FSM) is a key component in many important applications, such as Huffman decoding, regular expression matching, and HTML tokenization. Due to its inherent dependency and unpredictable memory access, FSM computations are considered to be extremely difficult to parallelize. As such, significant research efforts have been made to accelerate FSM computations. Although they achieve promising performance results on multi-core machines, these methods are not scalable for emerging many-core architectures such as GPUs.
We analyzed the bottleneck of achieving scalability on GPUs is the sequential merge inherent to these methods. However, different from simple reduction loops, parallelization of FSM computations typically incorporates runtime check and re-execution, which may significantly affect performance. Based on these observations, our parallel merge implementations select efficient runtime check implementations and avoids unnecessary re-executions. Further, based on GPU architectural features, we develop optimization techniques to further improve performance.
We evaluate our parallel merge implementations on a set of representative algorithms. Experimental results show that our parallel merge implementation is several times more efficient than sequential merge implementations and achieves linear scalability on different GPUs.
Tue 25 FebDisplayed time zone: Tijuana, Baja California change
09:35 - 10:25
|Using Sample-Based Time Series Data for Automated Diagnosis of Scalability Losses in Parallel Programs|
|Scaling out Speculative Execution of Finite-State Machines with Parallel Merge|