gogogo
Syndetics cover image
Image from Syndetics

Programming massively parallel processors : a hands-on approach / David B. Kirk and Wen-mei W. Hwu.

By: Material type: TextTextPublication details: Burlington : Morgan Kaufmann ; Amsterdam [etc.] : Elsevier, cop. 2010.Description: XVIII, 258 str. : ilustr. ; 24 cmISBN:
  • 9780123814722 (pbk.)
  • 0123814723 (pbk.)
Contained works:
  • Hwu, Wen-mei W [aut]
Subject(s):
Holdings
Item type Current library Call number Copy number Status Date due Barcode
Standard Loan Thurles Library Main Collection 004.35 KIR (Browse shelf(Opens below)) 1 Available 39002100501106
Standard Loan Thurles Library Main Collection 004.35 KIR (Browse shelf(Opens below)) 1 Available 39002100501098

Enhanced descriptions from Syndetics:

Programming Massively Parallel Processors discusses the basic concepts of parallel programming and GPU architecture. Various techniques for constructing parallel programs are explored in detail. Case studies demonstrate the development process, which begins with computational thinking and ends with effective and efficient parallel programs.

This book describes computational thinking techniques that will enable students to think about problems in ways that are amenable to high-performance parallel computing. It utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments. Studies learn how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.

This book is recommended for advanced students, software engineers, programmers, and hardware engineers.

Table of contents provided by Syndetics

  • Preface (p. xi)
  • Acknowledgments (p. xvii)
  • Dedication (p. xix)
  • Chapter 1 Introduction (p. 1)
  • 1.1 GPUs as Parallel Computers (p. 2)
  • 1.2 Architecture of a Modern GPU (p. 8)
  • 1.3 Why More Speed or Parallelism? (p. 10)
  • 1.4 Parallel Programming Languages and Models (p. 13)
  • 1.5 Overarching Goals (p. 15)
  • 1.6 Organization of the Book (p. 16)
  • Chapter 2 History Of GPU Computing (p. 21)
  • 2.1 Evolution of Graphics Pipelines (p. 21)
  • 2.1.1 The Era of Fixed-Function Graphics Pipelines (p. 22)
  • 2.1.2 Evolution of Programmable Real-Time Graphics (p. 26)
  • 2.1.3 Unified Graphics and Computing Processors (p. 29)
  • 2.1.4 GPGPU: An Intermediate Step (p. 31)
  • 2.2 GPU Computing (p. 32)
  • 2.2.1 Scalable GPUs (p. 33)
  • 2.2.2 Recent Developments (p. 34)
  • 2.3 Future Trends (p. 34)
  • Chapter 3 Introduction To Cuda (p. 39)
  • 3.1 Data Parallelism (p. 39)
  • 3.2 Cuda Program Structure (p. 41)
  • 3.3 A Matrix-Matrix Multiplication Example (p. 42)
  • 3.4 Device Memories and Data Transfer (p. 46)
  • 3.5 Kernel Functions and Threading (p. 51)
  • 3.6 Summary (p. 56)
  • 3.6.1 Function declarations (p. 56)
  • 3.6.2 Kernel launch (p. 56)
  • 3.6.3 Predefined variables (p. 56)
  • 3.6.4 Runtime API (p. 57)
  • Chapter 4 Cuda Threads (p. 59)
  • 4.1 Cuda Thread Organization (p. 59)
  • 4.2 Using blockIdx and threadIdx (p. 64)
  • 4.3 Synchronization and Transparent Scalability (p. 68)
  • 4.4 Thread Assignment (p. 70)
  • 4.5 Thread Scheduling and Latency Tolerance (p. 71)
  • 4.6 Summary (p. 74)
  • 4.7 Exercises (p. 74)
  • Chapter 5 CudaÖ Memories (p. 77)
  • 5.1 Importance of Memory Access Efficiency (p. 78)
  • 5.2 CUDA Device Memory Types (p. 79)
  • 5.3 A Strategy for Reducing Global Memory Traffic (p. 83)
  • 5.4 Memory as a Limiting Factor to Parallelism (p. 90)
  • 5.5 Summary (p. 92)
  • 5.6 Exercises (p. 93)
  • Chapter 6 Performance On Siderations (p. 95)
  • 6.1 More on Thread Execution (p. 96)
  • 6.2 Global Memory Bandwidth (p. 103)
  • 6.3 Dynamic Partitioning of SM Resources (p. 111)
  • 6.4 Data Prefetching (p. 113)
  • 6.5 Instruction Mix (p. 115)
  • 6.6 Thread Granularity (p. 116)
  • 6.7 Measured Performance and Summary (p. 118)
  • 6.8 Exercises (p. 120)
  • Chapter 7 Floating Point Considerations (p. 125)
  • 7.1 Floating-Point Format (p. 126)
  • 7.1.1 Normalized Representation of M (p. 126)
  • 7.1.2 Excess Encoding of E (p. 127)
  • 7.2 Representable Numbers (p. 129)
  • 7.3 Special Bit Patterns and Precision (p. 134)
  • 7.4 Arithmetic Accuracy and Rounding (p. 135)
  • 7.5 Algorithm Considerations (p. 136)
  • 7.6 Summary (p. 138)
  • 7.7 Exercises (p. 138)
  • Chapter 8 Application Case Study: Advanced MRI Reconstruction (p. 141)
  • 8.1 Application Background (p. 142)
  • 8.2 Iterative Reconstruction (p. 144)
  • 8.3 Computing F H d (p. 148)
  • Step 1 Determine the Kernel Parallelism Structure (p. 149)
  • Step 2 Getting Around the Memory Bandwidth Limitation (p. 156)
  • Step 3 Using Hardware Trigonometry Functions (p. 163)
  • Step 4 Experimental Performance Tuning (p. 166)
  • 8.4 Final Evaluation (p. 167)
  • 8.5 Exercises (p. 170)
  • Chapter 9 Application Case Study: Molecular Visualization and Analysis (p. 173)
  • 9.1 Application Background (p. 174)
  • 9.2 A Simple Kernel Implementation (p. 176)
  • 9.3 Instruction Execution Efficiency (p. 180)
  • 9.4 Memory Coalescing (p. 182)
  • 9.5 Additional Performance Comparisons (p. 185)
  • 9.6 Using Multiple GPUs (p. 187)
  • 9.7 Exercises (p. 188)
  • Chapter 10 Parallel Programming and Computational Thinking (p. 191)
  • 10.1 Goals of Parallel Programming (p. 192)
  • 10.2 Problem Decomposition (p. 193)
  • 10.3 Algorithm Selection (p. 196)
  • 10.4 Computational Thinking (p. 202)
  • 10.5 Exercises (p. 204)
  • Chapter 11 A Brief Introduction To OpenclÖ (p. 205)
  • 11.1 Background (p. 205)
  • 11.2 Data Parallelism Model (p. 207)
  • 11.3 Device Architecture (p. 209)
  • 11.4 Kernel Functions (p. 211)
  • 11.5 Device Management and Kernel Launch (p. 212)
  • 11.6 Electrostatic Potential Map in OpenCL (p. 214)
  • 11.7 Summary (p. 219)
  • 11.8 Exercises (p. 220)
  • Chapter 12 Conclusion And Future Outlook (p. 221)
  • 12.1 Goals Revisited (p. 221)
  • 12.2 Memory Architecture Evolution (p. 223)
  • 12.2.1 Large Virtual and Physical Address Spaces (p. 223)
  • 12.2.2 Unified Device Memory Space (p. 224)
  • 12.2.3 Configurable Caching and Scratch Pad (p. 225)
  • 12.2.4 Enhanced Atomic Operations (p. 226)
  • 12.2.5 Enhanced Global Memory Access (p. 226)
  • 12.3 Kernel Execution Control Evolution (p. 227)
  • 12.3.1 Function Calls within Kernel Functions (p. 227)
  • 12.3.2 Exception Handling in Kernel Functions (p. 227)
  • 12.3.3 Simultaneous Execution of Multiple Kernels (p. 228)
  • 12.3.4 Interruptible Kernels (p. 228)
  • 12.4 Core Performance (p. 229)
  • 12.4.1 Double-Precision Speed (p. 229)
  • 12.4 2 Better Control Flow Efficiency (p. 229)
  • 12.5 Programming Environment (p. 230)
  • 12.6 A Bright Outlook (p. 230)
  • Appendix A Matrix Multiplication Host-Only Version Source Code (p. 233)
  • A.I matrixmul . cu (p. 233)
  • A.2 matrixmul_gold.cpp (p. 237)
  • A.3 matrixmul . h (p. 238)
  • A.4 assist.h (p. 239)
  • A.5 Expected Output (p. 243)
  • Appendix B GPU Compute Capabilities (p. 245)
  • B.1 GPU Compute Capability Tables (p. 245)
  • B.2 Memory Coalescing Variations (p. 246)
  • Index (p. 251)

Author notes provided by Syndetics

David B. Kirk: Chief Scientist and NVIDIA Fellow at NVIDIA, a leader in visual computing technologies.
Wen-mei W. Hwu: Walter J. Sanders III Advanced Micro Devices Endowed Chair in Electrical and Computer Engineering in the Coordinated Sciences Laboratory of the University of Illinois at Urbana-Champaign.

Powered by Koha