Instructor Esteban Meneses, PhD
Email esteban DOT meneses AT acm DOT org
Institution Instituto Tecnológico de Costa Rica
Location Centro Académico Barrio Amón
Time Thursdays 6:00-9:00pm
Term I semester, 2021
Teaching Assistant Alex Saenz (alexsaenz AT estudiantec DOT cr)


1 February 18 Introduction 
2 February 25 Parallel Programming Design Patterns Instructor
3 March 4 Shared-memory Programming Instructor
4 March 11 Distributed-memory Programming Instructor
5 March 18 Performance Analysis
Scientific Visualization
6 March 25 Parallel Programming Models Students
  April 1 Holy Week  
7 April 8 Performance Models Students
8 April 15 Midterm Exam  
9 April 22 Interconnects Students
10 April 29 Parallel Algorithms Students
11 May 6 Parallel Computer Architectures Students
12 May 13 Accelerators Students
13 May 20 Fault Tolerance Students
14 May 27 Job Scheduling Students
15 June 3 Graph Processing Students
16 June 10 Cloud Computing Students
  June 17     



Kabré Supercomputer

Final report sample: paper

An interesting extension to a class project


Prof. Gupta’s tips on presentations and reviews


Reading List


  1. Parallelization of a Denoising Algorithm for Tonal Bioacoustic Signals Using OpenACC Directives (Jorge Castro and Esteban Meneses - IEEE International Work Conference on Bioinspired Intelligence, IWOBI -2018) HTML
  2. Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML


  1. Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – I EEE International Symposium on Performance Analysis of Systems and Software – 2010) PDF
  2. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015) PDF
  3. A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014) PDF


  1. How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
  2. Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012) PDF
  3. Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011) PDF
  4. Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009) PDF ALT


  1. IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011) PDF
  2. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)  PDF
  3. 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008) PDF
  4. From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011) PDF

Cloud Computing

  1. Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper) PDF
  2. MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004) PDF
  3. Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008) PDF

Fault Tolerance

  1. Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998) PDF ALT
  2. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al –  IEEE/ACM Supercomputing – 2010) PDF ALT
  3. Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014) PDF
  4. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012) PDF

Graph Processing

  1. Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
  2. GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF


  1. Blue Gene/L torus interconnection network  (N.R. Adiga et al – IBM Journal of Research and Development – 2010) PDF
  2. Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al –  International Symposium on Computer Architecture – 2008) PDF
  3. Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006) PDF
  4. Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009) PDF

Performance Models

  1. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993) 
  2. LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996) 
  3. Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009) 

Parallel Programming Models

  1. Using Simple Abstraction to Reinvent Computing for Parallelism (Uzi Vishkin - Communications of the ACM - 2011)
  2. A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990) 
  3. Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004) 


  1. A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988) PDF ALT
  2. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001) PDF ALT
  3. Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001) PDF
  4. Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007) PDF ALT


Additional References

  1. [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010) PDF
  2. [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
  3. [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
  4. [Algorithms] A Parallel Hashed Oct-Tree N-body Algorithm  (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993).
  5. [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
  6. [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube  (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
  7. [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
  8. [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
  9. [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
  10. [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
  11. [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
  12. [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
  13. [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
  14. [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing  (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
  15. [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
  16. [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
  17. [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
  18. [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
  19. [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
  20. [Languages] Compilers and More: MPI+X  (Michael Wolfe – HPC Wire – 2014) HTML
  21. [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
  22. [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
  23. [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
  24.  [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
  25.  [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
  26. [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
  27. [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
  28. [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
  29. [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
  30. [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
  31. [Performance Models] The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003) PDF
  32. [Programming Models] Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
  33. [Programming Models] The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007) 
  34. [Programming Models] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012) PDF


  1. HPC Graph Analysis HTML