Administrativia
Instructor Esteban Meneses, PhD
Email esteban DOT meneses AT acm DOT org
Institution Instituto Tecnológico de Costa Rica
Location Centro Académico Barrio Amón
Time Tuesdays 6:00-9:00pm
Term I semester, 2022
Teaching Assistant Alex Saenz (alexsaenz AT estudiantec DOT cr)

 

Schedule
SESSION DATE TOPIC READING PRESENTER
1 February 8 Introduction 
History
  Instructor
2 February 15 Parallel Computing Reasoning INTRO1 1. Mariela Abdalah
Instructor
3 February 22 Parallel Programming Design Patterns   Instructor
4 March 1 Programming Models PROGMOD1
PROGMOD2
PROGMOD3
1. Daniel Piedra
2. Gabriel Barboza
3. Javier Cordero
5 March 8 Shared-memory Programming   Instructor
6 March 15 Scalability SCAL1
SCAL2
SCAL3
1. Deivid Calvo
2. ----------
3. Javier Buzano
7 March 22 Interconnects INTER1
INTER2
INTER3
1. Jeison Meléndez
2. Ángel Phillips
3. Andrés Vargas
8 March 29 Distributed-memory Programming

 

Instructor 
9 April 5 Performance Models PERFMOD1
PERFMOD2
PERFMOD3
1. ----------
2. ----------
3. Jeison Meléndez
  April 12 Holy Week    
10 April 19 Performance Analysis PERFAN1
PERFAN2
PERFAN3
1. Ángel Phillips
2. ----------
3. Javier Herrera
11 April 26 Performance Analysis
Scientific Visualization
  Instructor
12 May 3 Midterm Exam    
13 May 10 Algorithms INVITED
ALG1
ALG2
- Cristina Soto
- Mariela Abdalah
- Javier Buzano
14 May 17 Fault Tolerance INVITED
FAULT2
- Elvis Rojas
- Gabriel Barboza
15 May 24 Job Scheduling INVITED
INVITED
SCHED1
SCHED2
- Alejandro Morales
- Óscar Blandino
- Deivid Calvo
- Javier Cordero
16 May 31 Architecture INVITED
ARCH1
ARCH2
- Diego Jiménez
- Javier Herrera
- Daniel Piedra
  June 7 Final Presentations  
 

 

Reading List

Introduction

  • [INTRO1] Performance vs Programming Effort between Rust and C on Multicore Architectures: Case Study in N-Body (Manuel Costanzo, Enzo Rucci, Marcelo Naiouf, Armando De Giusti - XLVII Latin American Computing Conference -CLEI - 2021) 

Programming Models

  • [PROGMOD1] Models for practical parallel computation (D. B. Skillicorn - International Journal of Parallel Programming  - 1991)
  • [PROGMOD2] A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990) 
  • [PROGMOD3] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012) 

Scalability

  • [SCAL1] A Case for NOW (Networks of Workstations) (Thomas E. Anderson, David E. Culler, David A. Patterson - IEEE Micro - 1995)
  • [SCAL2] The Landscape of Parallel Computing Research: A View from Berkeley (Krste Asanovic et al - Technical Report, University of California at Berkeley - 2006)
  • [SCAL3] A survey of high-performance computing scaling challenges (Al Geist and Daniel A Reed - IJHPCA - 2015)

Interconnects

  • [INTER1] Blue Gene/L torus interconnection network  (N.R. Adiga et al – IBM Journal of Research and Development – 2010).
  • [INTER2] Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al –  International Symposium on Computer Architecture – 2008)
  • [INTER3] Evaluating HPC Networks via Simulation of Parallel Workloads (Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, Laxmikant V. Kale - ACM/IEEE Supercomputing – 2016)

Performance Models

  • [PERFMOD1] Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993) 
  • [PERFMOD2] LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996) 
  • [PERFMOD3] Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009) 

    Performance Analysis

    • [PERFAN1] COZ: Finding Code that Counts with Causal Profiling (Charlie Curtsinger and Emery D. Berger -  ACM Symposium on Operating Systems Principles - 2015)
    • [PERFAN2] The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003) 
    • [PERFAN3] Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results (Torsten Hoefler and Roberto Belli – ACM/IEEE Supercomputing – 2015) 

    Algorithms

    • [ALG1] How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009) 
    • [ALG2] Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011) 
    • [ALG3] A Parallel Hashed Oct-Tree N-body Algorithm  (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993)

      Architecture

      • [ARCH1]  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU (Victor W Lee et al – International Symposium on Computer Architecture 2008) 
      • [ARCH2] Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – IEEE International Symposium on Performance Analysis of Systems and Software – 2010)
      • [ARCH3] 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008) 

         Job Scheduling

        • [SCHED1] A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988) 
        • [SCHED2] A Comparative Study of Job Scheduling Strategies in Large-scale Parallel Computational Systems (Aftab Ahmed Chandio et al -  IEEE International Conference on Trust, Security and Privacy in Computing and Communications - 2013)
        • [SCHED3] Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007) 

        Fault Tolerance

        • [FAULT1] Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al –  IEEE/ACM Supercomputing – 2010) 
        • [FAULT2] Assessing Fault Sensitivity in MPI Applications (Charng-da Lu and Daniel A. Reed  – ACM/IEEE Supercomputing – 2004) 
        • [FAULT3] Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015) 

        Class Project

        Final report sample: paper

        An interesting extension to a class project

        Prof. Martin Schulz's list of research ideas

        Prof. Abhinav Bhatele's list of research ideas

        Prof. Esteban Meneses's list of research ideas

        Publication Venues

        Resources

        Kabré Supercomputer

        Prof. Gupta’s tips on presentations and reviews

        Additional References

        1. [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010) 
        2. [Accelerators] A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014) 
        3. [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
        4. [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
        5. [Algorithms] Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009).
        6. [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
        7. [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube  (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
        8. [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
        9. [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
        10. [Algorithms] Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012) 
        11. [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
        12. [Architecture] Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)  
        13. [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
        14. [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
        15. [Architecture] From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011) 
        16. [Architecture] IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011)
        17. [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
        18. [Cloud Computing] Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper) 
        19. [Cloud Computing] MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004) 
        20. [Cloud Computing] Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008) 
        21. [Epidemic Simulations] EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks (Christopher L Barrett et al – ACM/IEEE Supercomputing – 2008) 
        22. [Epidemic Simulations] Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters (Jae-Seung Yeom et al – IEEE International Parallel and Distributed Processing Symposium – 2014) 
        23. [Epidemic Simulations] PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems (Marco Minutoli et al – ACM/IEEE Supercomputing – 2020) 
        24. [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
        25. [Fault Tolerancre] Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014) 
        26. [Fault Tolerance] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012) 
        27. [Fault Tolerance] Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998) 
        28. [Graph Processing] Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
        29. [Graph Processing] GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF
        30. [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing  (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
        31. [Interconnects] Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009) 
        32. [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
        33. [Interconnects] A Comparative Study of Topology Design Approaches for HPC Interconnects (Md Atiqul Mollah et al - CCGRID - 2018).
        34. [Interconnects] Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006) 
        35. [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
        36. [Interconnects] There Goes the Neighborhood: Performance Degradation due to Nearby Jobs (Abhinav Bhatele et al - ACM/IEEE Supercomputing – 2013)
        37. [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
        38. [Introduction] Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML
        39. [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
        40. [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
        41. [Languages] Compilers and More: MPI+X  (Michael Wolfe – HPC Wire – 2014) HTML
        42. [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
        43. [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
        44. [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
        45.  [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
        46.  [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
        47. [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
        48. [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
        49. [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
        50. [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
        51. [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
        52. [Programming Models] Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
        53. [Programming Models] The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007) 
        54. [Programming Models] Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004) 
        55. [Programming Models] Using Simple Abstraction to Reinvent Computing for Parallelism (Uzi Vishkin - Communications of the ACM - 2011)
        56. [Scheduling] Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001) 
        57. [Scheduling] Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001) 

        Benchmarks

        1. HPC Graph Analysis HTML