CUDA application design and development

Author: Rob Farber

Published by Elsevier Inc

Foreword

Arguably, for any language to be successful, it must be surrounded by an ecosystem of powerful compilers, performance and correctness tools, and optimized libraries. --Jeffrey S. Vetter

Preface

CUDA (Compute Unified Device Architecture)

harness those tens of thousands of threads of execution(I like this verb.)

Book organization

Chapter 1. Introduces basic CUDA concepts and the tools needed to build and debug CUDA applications. Simple examples are provided that

demonstrates both the thrust C++ and C runtime APIs. Three simple rules for high-performance GPU programming are introduced.

Chapter 2. Using only techniques introduced in Chapter 1, this chapter provides a complete, general-purpose machine-learning and

optimization framework that can run 341 times faster than a single core of a conventional processor. Core concepts in machine learning and numerical optimization are also covered, which will be of interest to those who desire the domain knowledge as well as the ability to

program GPUs.

Chapter 3. Profiling is the focus of this chapter, as it is an essential skill in high-performance programming. The CUDA profiling tools are introduced and applied to the real-world example from Chapter 2. Some surprising bottlenecks in the Thrust API are uncovered. Introductory data-mining techniques are discussed and data-mining functors for both Principle Components Analysis and Nonlinear Principle Components Analysis are provided, so this chapter should be of interest to users as well as programmers.

Chapter 4. The CUDA execution model is the topic of this chapter. Anyone who wishes to get peak performance from a GPU must

understand the concepts covered in this chapter. Examples and profiling output are provided to help understand both what the GPU is doing

and how to use the existing tools to see what is happening.

Chapter 5. CUDA provides several types of memory on the GPU. Each type of memory is discussed, along with the advantages and

disadvantages.

Chapter 6. With over three orders-of-magnitude in performance difference between the fastest and slowest GPU memory, efficiently using memory

on the GPU is the only path to high performance. This chapter discusses techniques and provides profiler output to help you understand and

monitor how efficiently your applications use memory. A general functor-based example is provided to teach how to write your own generic

methods like the Thrust API.

Chapter 7. GPUs provide multiple forms of parallelism, including multiple GPUs, asynchronous kernel execution, and a Unified Virtual

Address (UVA) space. This chapter provides examples and profiler output to understand and utilize all forms of GPU parallelism.

Chapter 8. CUDA has matured to become a viable platform for all application development for both GPU and multicore processors. Pathways

to multiple CUDA backends are discussed, and examples and profiler output to effectively run in heterogeneous multi-GPU environments are

provided. CUDA libraries and how to interface CUDA and GPU computing with other high-level languages like Python, Java, R, and FORTRAN are

covered.

Chapter 9. With the focus on the use of CUDA to accelerate computational tasks, it is easy to forget that GPU technology is also a splendid platform for visualization. This chapter discusses primitive restart and how it can dramatically accelerate visualization and gaming applications. A complete working example is provided that allows the reader to create and fly around in a 3D world. Profiler output is used to demonstrate why

primitive restart is so fast. The teaching framework from this chapter is extended to work with live video streams in Chapter 12.

Chapter 10. To teach scalability, as well as performance, the example from Chapter 3 is extended to use MPI (Message Passing Interface). A

variant of this example code has demonstrated near-linear scalability to 500 GPGPUs (with a peak of over 500,000 single-precision gigaflops)

and delivered over one-third petaflop (1015 floating-point operations per second) using 60,000 x86 processing cores.

Chapter 11. No book can cover all aspects of the CUDA tidal wave. This is a survey chapter that points the way to other projects that provide free

working source code for a variety of techniques, including Support Vector Machines (SVM), Multi-Dimensional Scaling (MDS), mutual

information, force-directed graph layout, molecular modeling, and others. Knowledge of these projects—and how to interface with other

high-level languages, as discussed in Chapter 8—will help you mature as a CUDA developer.

Chapter 12. A working real-time video streaming example for vision recognition based on the visualization framework in Chapter 9 is

provided. All that is needed is an inexpensive webcam or a video file so that you too can work with real-time vision recognition. This example

was designed for teaching, so it is easy to modify. Robotics, augmented reality games, and data fusion for heads-up displays are obvious

extensions to the working example and technology discussion in this chapter.