First of all, we may need a review for the very basic CUDA programming and GPU execution principle. Here, I want to use the diagram below to demonstrate the threadings on GPU with a very simple array addition operation example. In this simple case, we only have a single block on GPU, which contains, say, 500 threads (the actual number of threads may be larger than 500 but in this example, we only need 500 of them). The part above the dashed line refers to the single block. The for loop in the middle of the bottom half is usually what we do for coding in the array addition task. Basically, we loop over all the 500 indexes to do the addition one-by-one in a serial manner. Using GPU, the working scheme is different. Instead of a single 'worker' executing our commands one after another, we now have a whole bunch of 'workers' ready at the same time, waiting for our commands so that they can do their own job, independently. Each 'worker' has his/her unique ID - t
Personal blog recording my learning notes for various topics.