Instructor | Esteban Meneses, PhD |
esteban DOT meneses AT acm DOT org | |
Institution | Instituto Tecnológico de Costa Rica |
Location | Centro Académico Barrio Amón |
Time | Thursdays 6:00-9:00pm |
Term | I semester, 2021 |
Teaching Assistant | Alex Saenz (alexsaenz AT estudiantec DOT cr) |
SESSION | DATE | TOPIC | PRESENTER |
---|---|---|---|
1 | February 18 | Introduction History |
Instructor |
2 | February 25 | Parallel Programming Design Patterns | Instructor |
3 | March 4 | Shared-memory Programming | Instructor |
4 | March 11 | Distributed-memory Programming | Instructor |
5 | March 18 | Performance Analysis Scientific Visualization |
Instructor |
6 | March 25 | Programming Models | 1. Barnum Castillo 2. Luis Esquivel 3. Marco Torres |
April 1 | Holy Week | ||
7 | April 8 | Performance Models | 1. Kevin Umaña 2. Erick Quesada 3. Fabián Solano |
8 | April 15 | Midterm Exam | |
9 | April 22 | Interconnects | 1. Ricardo Montoya 2. -------------------- 3. Cristina Soto |
10 | April 29 | Performance Analysis | 1. Jose Pablo Araya 2. Alejandro Morales 3. Diego Jiménez |
11 | May 6 | Algorithms | 1. -------------------- 2. -------------------- 3. Izcar Muñoz |
12 | May 13 | Epidemic Simulations | 1. Cristina Soto 2. Cristian Arias 3. Ignacio Murillo |
13 | May 20 | Architecture | 1. Steven Solano 2. Ricardo Montoya 3. Oscar Blandino |
14 | May 27 | Job Scheduling | 1. Emmanuel Barrantes 2. Kevin Umaña 3. Eduardo Chavarría |
15 | June 3 | Fault Tolerance | 1. Jose Rodríguez 2. Esteban Chavarría 3. Fabián Solano |
16 | June 10 | Invited Presentation | Dr. Nikhil Jain, NVIDIA |
June 17 |
Reading List
Introduction
- Parallelization of a Denoising Algorithm for Tonal Bioacoustic Signals Using OpenACC Directives (Jorge Castro and Esteban Meneses - IEEE International Work Conference on Bioinspired Intelligence, IWOBI -2018) HTML
Programming Models
- Using Simple Abstraction to Reinvent Computing for Parallelism (Uzi Vishkin - Communications of the ACM - 2011)
- A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990)
- Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004)
Performance Models
- Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993)
- LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996)
- Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009)
Interconnects
- Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009)
- Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al – International Symposium on Computer Architecture – 2008)
- There Goes the Neighborhood: Performance Degradation due to Nearby Jobs (Abhinav Bhatele et al - ACM/IEEE Supercomputing – 2013)
Algorithms
- How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009)
- Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011)
- A Parallel Hashed Oct-Tree N-body Algorithm (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993)
Architecture
- Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU (Victor W Lee et al – International Symposium on Computer Architecture 2008)
- Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – IEEE International Symposium on Performance Analysis of Systems and Software – 2010)
- 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008)
Epidemic Simulations
- EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks (Christopher L Barrett et al – ACM/IEEE Supercomputing – 2008)
- Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters (Jae-Seung Yeom et al – IEEE International Parallel and Distributed Processing Symposium – 2014)
- PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems (Marco Minutoli et al – ACM/IEEE Supercomputing – 2020)
Performance Analysis
- COZ: Finding Code that Counts with Causal Profiling (Charlie Curtsinger and Emery D. Berger - ACM Symposium on Operating Systems Principles - 2015)
- The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003)
- Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results (Torsten Hoefler and Roberto Bellil – ACM/IEEE Supercomputing – 2015)
Job Scheduling
- A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988)
- A Comparative Study of Job Scheduling Strategies in Large-scale Parallel Computational Systems (Aftab Ahmed Chandio et al - IEEE International Conference on Trust, Security and Privacy in Computing and Communications - 2013)
- Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007)
Fault Tolerance
- Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al – IEEE/ACM Supercomputing – 2010)
- Assessing Fault Sensitivity in MPI Applications (Charng-da Lu and Daniel A. Reed – ACM/IEEE Supercomputing – 2004)
- Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015)
Cloud Computing
- Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper)
- MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004)
- Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008)
Class Project
Final report sample: paper
An interesting extension to a class project
Prof. Martin Schulz's list of research ideas
Prof. Abhinav Bhatele's list of research ideas
Prof. Esteban Meneses's list of research ideas
Publication Venues
- Conferencia Latinoamericana de Estudios en Informática (CLEI) 2021
https://clei2021.cr/home - Latin America High Performance Computing Conference (CARLA) 2021
http://carla2021.org
Resources
Prof. Gupta’s tips on presentations and reviews
Additional References
- [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010)
- [Accelerators] A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014)
- [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
- [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
- [Algorithms] Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009).
- [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
- [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
- [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
- [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
- [Algorithms] Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012)
- [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
- [Architecture] Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)
- [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
- [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
- [Architecture] From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011)
- [Architecture] IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011)
- [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
- [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
- [Fault Tolerancre] Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014)
- [Fault Tolerance] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012)
- [Fault Tolerance] Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998)
- [Graph Processing] Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
- [Graph Processing] GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF
- [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
- [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
- [Interconnects] Blue Gene/L torus interconnection network (N.R. Adiga et al – IBM Journal of Research and Development – 2010).
- [Interconnects] Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006)
- [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
- [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
- [Introduction] Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML
- [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
- [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
- [Languages] Compilers and More: MPI+X (Michael Wolfe – HPC Wire – 2014) HTML
- [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
- [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
- [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
- [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
- [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
- [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
- [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
- [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
- [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
- [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
- [Programming Models] Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
- [Programming Models] The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007)
- [Programming Models] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012)
- [Scheduling] Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001)
- [Scheduling] Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001)
Benchmarks
- HPC Graph Analysis HTML