Instructor Esteban Meneses, PhD
Email esteban DOT meneses AT acm DOT org
Institution Instituto Tecnológico de Costa Rica
Location Centro Académico Barrio Amón
Time Thursdays 6:00-9:00pm
Term I semester, 2021
Teaching Assistant Alex Saenz (alexsaenz AT estudiantec DOT cr)


1 February 18 Introduction 
2 February 25 Parallel Programming Design Patterns Instructor
3 March 4 Shared-memory Programming Instructor
4 March 11 Distributed-memory Programming Instructor
5 March 18 Performance Analysis
Scientific Visualization
6 March 25 Programming Models 1. Barnum Castillo
2. Luis Esquivel
3. Marco Torres
  April 1 Holy Week  
7 April 8 Performance Models 1. Kevin Umaña
2. Erick Quesada
3. Fabián Solano
8 April 15 Midterm Exam  
9 April 22 Interconnects 1. Ricardo Montoya
2. --------------------
3. Cristina Soto
10 April 29 Performance Analysis 1. Jose Pablo Araya
2. Alejandro Morales
3. Diego Jiménez
11 May 6 Algorithms 1. --------------------
2. --------------------
3. Izcar Muñoz
12 May 13 Epidemic Simulations 1. Cristina Soto
2. Cristian Arias
3. Ignacio Murillo
13 May 20 Architecture 1. Steven Solano
2. Ricardo Montoya
3. Oscar Blandino
14 May 27 Job Scheduling 1. Emmanuel Barrantes
2. Kevin Umaña
3. Eduardo Chavarría
15 June 3 Fault Tolerance 1. Jose Rodríguez
2. Esteban Chavarría
3. Fabián Solano
16 June 10 Invited Presentation Dr. Nikhil Jain, NVIDIA
  June 17     




Reading List


  1. Parallelization of a Denoising Algorithm for Tonal Bioacoustic Signals Using OpenACC Directives (Jorge Castro and Esteban Meneses - IEEE International Work Conference on Bioinspired Intelligence, IWOBI -2018) HTML

Programming Models

  1. Using Simple Abstraction to Reinvent Computing for Parallelism (Uzi Vishkin - Communications of the ACM - 2011)
  2. A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990) 
  3. Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004) 

Performance Models

  1. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993) 
  2. LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996) 
  3. Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009) 


  1. Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009) 
  2. Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al –  International Symposium on Computer Architecture – 2008) 
  3. There Goes the Neighborhood: Performance Degradation due to Nearby Jobs (Abhinav Bhatele et al - ACM/IEEE Supercomputing – 2013)


  1. How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009) 
  2. Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011) 
  3.  A Parallel Hashed Oct-Tree N-body Algorithm  (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993)


    1. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU (Victor W Lee et al – International Symposium on Computer Architecture 2008) 
    2. Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – IEEE International Symposium on Performance Analysis of Systems and Software – 2010)  
    3. 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008) 

    Epidemic Simulations

    1. EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks (Christopher L Barrett et al – ACM/IEEE Supercomputing – 2008) 
    2. Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters (Jae-Seung Yeom et al – IEEE International Parallel and Distributed Processing Symposium – 2014) 
    3. PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems (Marco Minutoli et al – ACM/IEEE Supercomputing – 2020) 

    Performance Analysis

    1. COZ: Finding Code that Counts with Causal Profiling (Charlie Curtsinger and Emery D. Berger -  ACM Symposium on Operating Systems Principles - 2015)
    2. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003) 
    3. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results (Torsten Hoefler and Roberto Bellil – ACM/IEEE Supercomputing – 2015) 

     Job Scheduling

    1. A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988) 
    2. A Comparative Study of Job Scheduling Strategies in Large-scale Parallel Computational Systems (Aftab Ahmed Chandio et al -  IEEE International Conference on Trust, Security and Privacy in Computing and Communications - 2013)
    3. Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007) 

    Fault Tolerance

    1. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al –  IEEE/ACM Supercomputing – 2010) 
    2. Assessing Fault Sensitivity in MPI Applications (Charng-da Lu and Daniel A. Reed  – ACM/IEEE Supercomputing – 2004) 
    3. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015) 

    Cloud Computing

    1. Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper) 
    2. MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004) 
    3. Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008) 

    Class Project

    Final report sample: paper

    An interesting extension to a class project

    Prof. Martin Schulz's list of research ideas

    Prof. Abhinav Bhatele's list of research ideas

    Prof. Esteban Meneses's list of research ideas

    Publication Venues


    Kabré Supercomputer

    Prof. Gupta’s tips on presentations and reviews

    Additional References

    1. [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010) 
    2. [Accelerators] A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014) 
    3. [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
    4. [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
    5. [Algorithms] Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009).
    6. [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
    7. [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube  (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
    8. [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
    9. [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
    10. [Algorithms] Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012) 
    11. [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
    12. [Architecture] Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)  
    13. [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
    14. [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
    15. [Architecture] From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011) 
    16. [Architecture] IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011)
    17. [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
    18. [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
    19. [Fault Tolerancre] Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014) 
    20. [Fault Tolerance] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012) 
    21. [Fault Tolerance] Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998) 
    22. [Graph Processing] Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
    23. [Graph Processing] GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF
    24. [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing  (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
    25. [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
    26. [Interconnects] Blue Gene/L torus interconnection network  (N.R. Adiga et al – IBM Journal of Research and Development – 2010).
    27. [Interconnects] Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006) 
    28. [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
    29. [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
    30. [Introduction] Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML
    31. [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
    32. [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
    33. [Languages] Compilers and More: MPI+X  (Michael Wolfe – HPC Wire – 2014) HTML
    34. [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
    35. [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
    36. [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
    37.  [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
    38.  [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
    39. [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
    40. [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
    41. [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
    42. [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
    43. [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
    44. [Programming Models] Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
    45. [Programming Models] The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007) 
    46. [Programming Models] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012) 
    47. [Scheduling] Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001) 
    48. [Scheduling] Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001) 


    1. HPC Graph Analysis HTML