Fault tolerance and HPC

With the emergence of Petascale systems, fault tolerance has received a lot of attention over the last decade. Most the existing fault tolerance techniques for parallel and distributed applications have been applied to high computing applications. Unfortunately, despite their relative success, existing approaches do not fit well with the challenging evolutions of large-scale systems. In a development of a fault tolerance technique for HPC architectures, scalability and performance are the most two important aspects that we need to take to our consideration. The goal of this sub-topic is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable.

Senior Researchers

