Instructor Esteban Meneses, PhD


esteban DOT meneses AT acm DOT org
Institution Instituto Tecnológico de Costa Rica
Location Centro Académico Barrio Amón, room 108
Time Fridays 6:00-8:50pm
TermII II semester, 2016


1 August 5 Introduction Instructor
2 August 12 History Instructor
3 August 19 Parallel Programming Patterns Instructor
4 August 26 Shared-memory Programming Instructor
5 September 2 Distributed-memory Programming Instructor
6 September 9 Parallel-objects Programming Instructor
7 September 16    
8 September 23 Performance Models 1. Esteban Meneses
2. Carlos Gamboa
3. Carlos Gómez
4. Manfred Calvo
9 September 30    
10 October 7 Accelerators 1. Carlos Gamboa
2. Manfred Calvo
3. Carlos Gómez
11 October 14    
12 October 21 Programming Models

2. Manfred Calvo
3. Carlos Gamboa
4. Carlos Gómez

13 October 28    
14 November 4 Algorithms 1. Carlos Gómez
3. Manfred Calvo
4. Carlos Gamboa
15 November 11    
  November 18    
  November 25 Project Presentations Students



Cadejos Cluster

Previous offerings: 2015-semester2

Final report sample: 2015-semester2

An interesting extension to a class project


Prof. Gupta’s tips on presentations and reviews


Reading List


Exascale Computing and Big Data (Daniel A. Reed and Jack Dongarra) HTML


  1. Demystifying GPU microarchitecture through micro benchmarking (Henry Wong et al – I EEE International Symposium on Performance Analysis of Systems and Software – 2010) PDF
  2. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility (Devesh Tiwari et al – ACM/IEEE Supercomputing – 2015) PDF
  3. A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters (Matthias Noack et al – ACM/IEEE Supercomputing – 2014) PDF


  1. How Much Parallelism is There in Irregular Applications? (Milind Kulkarni et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
  2. Faster Topology-aware Collective Algorithms Through Non-minimal Communication (Paul Sack and William Gropp – ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming – 2012) PDF
  3. Parallel Random Numbers: As Easy as 1, 2, 3 (John K. Salmon et al – ACM/IEEE Supercomputing – 2011) PDF
  4. Millisecond-Scale Molecular Dynamics Simulations on Anton (David E. Shaw et al – ACM/IEEE Supercomputing – 2009) PDF ALT


  1. IBM POWER7 multicore server processor (B. Sinharoy et al – IBM Journal of Research and Development – 2011) PDF
  2. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation (Shekhar Borkar – IEEE Micro – 2005)  PDF
  3. 3D-Stacked Memory Architectures for Multi-core Processors (Gabriel H. Loh – International Symposium on Computer Architecture 2008) PDF
  4. From Microprocessors to Nanostores: Rethinking Data-Centric Systems (Parthasarathy Ranganathan – IEEE Computer Magazine – 2011) PDF

Cloud Computing

  1. Above the Clouds: A Berkeley View of Cloud Computing (Michael Armbrust et al – White Paper) PDF
  2. MapReduce: Simplied Data Processing on Large Cluster (Jeffrey Dean and Sanjay Ghemawat – USENIX Symposium on Operating Systems Design & Implementation – 2004) PDF
  3. Improving MapReduce Performance in Heterogeneous Environments (Matei Zaharia et al – USENIX Symposium on Operating Systems Design & Implementation – 2008) PDF

Fault Tolerance

  1. Diskless Checkpointing (James S. Plank et al – IEEE Transactions on Parallel and Distributed Systems – 1998) PDF ALT
  2. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (Adam Moody et al –  IEEE/ACM Supercomputing – 2010) PDF ALT
  3. Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers (Esteban Meneses et al – IEEE Transactions on Parallel and Distributed Systems – 2014) PDF
  4. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems (Jinsuk Chung et al – IEEE/ACM Supercomputing – 2012) PDF

Graph Processing

  1. Pregel: a system for large-scale graph processing (Grzegorz Malewicz et al – ACM SIGMOD International Conference on Management of Data – 2010)
  2. GraphX: Graph Processing in a Distributed Dataflow Framework (Joseph Gonzalez et al – USENIX Symposium on Operating Systems Design & Implementation – 2014) PDF


  1. Blue Gene/L torus interconnection network  (N.R. Adiga et al – IBM Journal of Research and Development – 2010) PDF
  2. Technology-Driven, Highly-Scalable Dragonfly Topology (John Kim et al –  International Symposium on Computer Architecture – 2008) PDF
  3. Adaptive Routing in High-Radix Clos Network (John Kim et al – ACM/IEEE Supercomputing – 2006) PDF
  4. Communication Requirements and Interconnect Optimization for High-End Scientific Applications (Shoaib Kamil et al – IEEE Transactions on Parallel and Distributed Systems – 2009) PDF

Performance Models

  1. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures (Ananth Grama et al – IEEE Concurrency – 1993) PDF ALT
  2. LogP: A Practical Model of Parallel Computation (David E. Culler et al – Communications of the ACM – 1996) PDF
  3. Roofline: An insightful Visual Performance model for multicore Architectures (Samuel Williams et al – Communications of the ACM – 2009) PDF
  4. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8192 Processors of ASCI Q (Fabrizio Petrini et al – ACM/IEEE Supercomputing – 2003) PDF

Programming Models

  1. A Bridging Model for parallel Computation (Leslie G. Valiant – Communications of the ACM – 1990) PDF
  2. Parallel Programmability and the Chapel Language (Brad Chamberlain et al – International Journal of High Performance Computing Applications – 2007) PDF ALT
  3. Stream Processors: Programmability with Efficiency (William J. Dally et al – ACM Queue – 2004) PDF
  4. The Foundations for Scalable Multi-core Software in Intel® Threading Building Blocks (Alexey Kukanov et al – Intel Technology Journal – 2007) PDF


  1. A fair share scheduler (J Kay and P Lauder – Communications of the ACM – 1988) PDF ALT
  2. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling (A. Mu’alem and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2001) PDF ALT
  3. Core Algorithms of the Maui Scheduler (D. Jackson, Q. Snell, and M. Clement – International Workshop on Job Scheduling Strategies for Parallel Processing – 2001) PDF
  4. Backfilling using system-generated predictions rather than user runtime estimates (D. Tsafrir, Y. Etsion, and D. Feitelson – IEEE Transactions on Parallel and Distributed Systems – 2007) PDF ALT


Additional References

  1. [Accelerators] An Adaptive Performance Modeling Tool for GPU Architectures (Sara S. Baghsorkhi et al – ACM Principles and Practices of Parallel Programming – 2010) PDF
  2. [Accelerators] GPUs and the future of parallel computing (Stephen W Keckler et al – IEEE Micro – 2011) PDF
  3. [Accelerators] A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors (Mark Gebhart et al – ACM Transaction on Computer Systems – 2012) PDF
  4. [Algorithms] A Parallel Hashed Oct-Tree N-body Algorithm  (M.S. Warren and J.K. Salmon – ACM/IEEE Supercomputing – 1993).
  5. [Algorithms] Data Parallel Algorithms (W. Daniel Hillis and Guy L. Steele – Communications of the ACM – 1986).
  6. [Algorithms] Development of Parallel Methods for a 1024-processor Hypercube  (John L. Gustafson et al – SIAM Journal on Scientific and Statistical Computing – 1988).
  7. [Algorithms] Highly Scalable Parallel Algorithms for Sparse Matrix Factorization (Anshul Gupta et al – IEEE Transactions on Parallel and Distributed Systems – 1997).
  8. [Algorithms] SUMMA: scalable universal matrix multiplication algorithm (R. A. van de Geijn and J Watts – Concurrency: Practice and Experience – 1997).
  9. [Architecture] The MIPS R10000 Superscalar Microprocessor (Kenneth C. Yeager – IEEE Micro – 1996).
  10. [Architecture] The Stanford DASH Multiprocessor (Daniel Lenoski et al – IEEE Computer – 1992).
  11. [Architecture] Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor (Dean M. Tullsen et al – ISCA – 1996).
  12. [Cloud Computing] Cloud-driven HPC (Amazon Web Services – HPC Wire – 2014) PDF
  13. [Fault Tolerance] MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes (George Bosilca et al – IEEE/ACM Supercomputing – 2002)
  14. [Interconnects] Fat-trees: Universal Networks for Hardware-efficient Supercomputing  (Charles E. Leiserson – IEEE Transactions on Computers – 1985).
  15. [Interconnects] A Survey of Wormhole Routing Techniques in Direct Networks (Lionel M. Ni and Philip McKinley – IEEE Computer – 1993).
  16. [Interconnects] Deadlock-free Adaptive Routing in Multicomputer Networks Using Virtual Channels (William J. Dally and Hiromichi Aoki – IEEE Transactions on Parallel and Distributed Systems – 1993).
  17. [Introduction] How Will Rebooting Computing Help IoT? (Bichlien Hoang and Sin-Kuen Hawkins) PDF
  18. [Languages] OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization (Seyong Lee et al – ACM Principles and Practices of Parallel Programming – 2009) PDF
  19. [Languages] Compilers and More: The Past, Present and Future of Parallel Loops (Michael Wolfe – HPC Wire – 2015) HTML
  20. [Languages] Compilers and More: MPI+X  (Michael Wolfe – HPC Wire – 2014) HTML
  21. [Load Balancing] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors (Sanjeev Kumar et al – ISCA – 2007).
  22. [Load Balancing] The Implementation of the Cilk-5 Multithreaded Language (Matteo Frigo et al – PLDI -1998).
  23. [Load Balancing] A dynamic scheduling strategy for the Chare-Kernel system (Wennie Shu and Laxmikant V. Kale – IEEE/ACM Supercomputing – 1989).
  24.  [Memory Consistency] Cohesion: a hybrid memory model for accelerators (John H. Kelm et al – ISCA – 2010).
  25.  [Memory Consistency] Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory (Sandhya Dwarkadas et al – HPCA – 1999).
  26. [Memory Consistency] Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors (Mark D. Hill et al – ACM Transactions on Computer Systems – 1993).
  27. [Memory System Design] Sequoia: Programming the Memory Hierarchy (Kayvon Fatahalian et al – IEEE/ACM Supercomputing – 2006).
  28. [Memory System Design] On-chip Memory System Optimization Design for FT64 Scientific Stream Accelerator (Mei Wen et al – IEEE MICRO – 2008).
  29. [Memory System Design] Comparing Memory Systems for Chip Multiprocessors (Jacob Leverich et al – ISCA – 2007).
  30. [Performance Models] Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (Torsten Hoefler et al – IEEE/ACM Supercomputing – 2010 ) PDF
  31. [Programming Models] A Survey of Parallel Programming Models and Tools in the Multi and Many-Core Era (Javier Diaz et al – IEEE Transactions on Parallel and Distributed Systems – 2012) PDF


  1. HPC Graph Analysis HTML