InstructorEsteban Meneses, PhD
Emailesteban DOT meneses AT acm DOT org
InstitutionInstituto Tecnológico de Costa Rica
LocationCentro Académico Barrio Amón
TimeThursdays 6:00-9:00pm
TermI semester, 2021
Teaching AssistantAlex Saenz (alexsaenz AT estudiantec DOT cr)


1February 18Introduction 
2February 25Parallel Programming Design PatternsInstructor
3March 4Shared-memory ProgrammingInstructor
4March 11Distributed-memory ProgrammingInstructor
5March 18Performance Analysis
                Scientific Visualization
6March 25Programming Models1. Barnum Castillo
                2. Luis Esquivel
                3. Marco Torres
 April 1Holy Week 
7April 8Performance Models1. Kevin Umaña
                2. Erick Quesada
                3. Fabián Solano
8April 15Midterm Exam 
9April 22Interconnects1. Ricardo Montoya
                2. --------------------
                3. Cristina Soto
10April 29Performance Analysis1. Jose Pablo Araya
                2. Alejandro Morales
                3. Diego Jiménez
11May 6Algorithms1. --------------------
                2. --------------------
                3. Izcar Muñoz
12May 13Epidemic Simulations1. Cristina Soto
                2. Cristian Arias
                3. Ignacio Murillo
13May 20Architecture1. Steven Solano
                2. Ricardo Montoya
                3. Oscar Blandino
14May 27Job Scheduling1. Emmanuel Barrantes
                2. Kevin Umaña
                3. Eduardo Chavarría
15June 3Fault Tolerance1. Jose Rodríguez
                2. Esteban Chavarría
                3. Fabián Solano
16June 10Invited PresentationDr. Nikhil Jain, NVIDIA
 June 17   




Reading List


  1. Parallelization of a Denoising Algorithm for Tonal Bioacoustic Signals Using OpenACC Directives (Jorge Castro and Esteban Meneses - IEEE International Work Conference on Bioinspired Intelligence, IWOBI -2018) HTML

Programming Models

  1. Using Simple Abstraction to Reinvent Computing for Parallelism (Uzi Vishkin - Communications of the ACM - 2011)
  3. A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990) 
  5. Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004) 

Performance Models

  1. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993) 
  3. LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996) 
  5. Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009) 


  1. Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009) 
  3. Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al –  International Symposium on Computer Architecture – 2008) 
  5. There Goes the Neighborhood: Performance Degradation due to Nearby Jobs (Abhinav Bhatele et al - ACM/IEEE Supercomputing – 2013)


  1. How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009) 
  3. Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011) 
  5.  A Parallel Hashed Oct-Tree N-body Algorithm  (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993)


    1. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU (Victor W Lee et al – International Symposium on Computer Architecture 2008) 
    3. Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – IEEE International Symposium on Performance Analysis of Systems and Software – 2010)  
    5. 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008) 

    Epidemic Simulations

    1. EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks (Christopher L Barrett et al – ACM/IEEE Supercomputing – 2008) 
    3. Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters (Jae-Seung Yeom et al – IEEE International Parallel and Distributed Processing Symposium – 2014) 
    5. PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems (Marco Minutoli et al – ACM/IEEE Supercomputing – 2020) 

    Performance Analysis

    1. COZ: Finding Code that Counts with Causal Profiling (Charlie Curtsinger and Emery D. Berger -  ACM Symposium on Operating Systems Principles - 2015)
    3. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003) 
    5. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results (Torsten Hoefler and Roberto Bellil – ACM/IEEE Supercomputing – 2015) 

     Job Scheduling

    1. A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988) 
    3. A Comparative Study of Job Scheduling Strategies in Large-scale Parallel Computational Systems (Aftab Ahmed Chandio et al -  IEEE International Conference on Trust, Security and Privacy in Computing and Communications - 2013)
    5. Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007) 

    Fault Tolerance

    1. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al –  IEEE/ACM Supercomputing – 2010) 
    3. Assessing Fault Sensitivity in MPI Applications (Charng-da Lu and Daniel A. Reed  – ACM/IEEE Supercomputing – 2004) 
    5. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015) 

    Cloud Computing

    1. Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper) 
    3. MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004) 
    5. Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008) 

    Class Project

    Final report sample: paper

    An interesting extension to a class project

    Prof. Martin Schulz's list of research ideas

    Prof. Abhinav Bhatele's list of research ideas

    Prof. Esteban Meneses's list of research ideas

    Publication Venues

    • Conferencia Latinoamericana de Estudios en Informática (CLEI) 2021
    • Latin America High Performance Computing Conference (CARLA) 2021


    Kabré Supercomputer

    Prof. Gupta’s tips on presentations and reviews

    Additional References

    1. [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010) 
    3. [Accelerators] A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014) 
    5. [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
    7. [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
    9. [Algorithms] Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009).
    11. [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
    13. [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube  (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
    15. [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
    17. [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
    19. [Algorithms] Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012) 
    21. [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
    23. [Architecture] Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)  
    25. [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
    27. [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
    29. [Architecture] From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011) 
    31. [Architecture] IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011)
    33. [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
    35. [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
    37. [Fault Tolerancre] Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014) 
    39. [Fault Tolerance] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012) 
    41. [Fault Tolerance] Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998) 
    43. [Graph Processing] Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
    45. [Graph Processing] GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF
    47. [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing  (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
    49. [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
    51. [Interconnects] Blue Gene/L torus interconnection network  (N.R. Adiga et al – IBM Journal of Research and Development – 2010).
    53. [Interconnects] Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006) 
    55. [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
    57. [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
    59. [Introduction] Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML
    61. [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
    63. [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
    65. [Languages] Compilers and More: MPI+X  (Michael Wolfe – HPC Wire – 2014) HTML
    67. [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
    69. [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
    71. [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
    73.  [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
    75.  [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
    77. [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
    79. [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
    81. [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
    83. [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
    85. [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
    87. [Programming Models] Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
    89. [Programming Models] The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007) 
    91. [Programming Models] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012) 
    93. [Scheduling] Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001) 
    95. [Scheduling] Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001) 


    1. HPC Graph Analysis HTML