

In fact, grids and blocks are 3D arrays of blocks and threads, respectively. When we call a kernel using the instruction > we automatically define a dim3 type variable defining the number of blocks per grid and threads per block. Grids And BlocksĪfter the previous articles, we now have a basic knowledge of CUDA thread organisation, so that we can better examine the structure of grids and blocks. The goal is to add new concepts throughout this article, ending up with a 2D kernel, which uses shared memory to efficiently optimise operations. We will see different ways of achieving this. We have to compute every element in $C$, and each of them is independent from the others, so we can efficiently parallelise. It should be pretty clear now why matrix-matrix multiplication is a good example for parallel computation. The following figure intuitively explains this idea: That is, in the cell $i$,$j$ of $M$ we have the sum of the element-wise multiplication of all the numbers in the $i$-th row in $A$ and the $j$-th column in $B$. The number inside it after the operation $M=A*B$ is the sum of all the element-wise multiplications of the numbers in $A$, row 1, with the numbers in $B$, column 1. Let's take the cell $1$,$1$ (first row, first column) of $M$. Why does this happen and how does it work? The answer is the same for both questions here. That is, the number of rows in the resulting matrix equals the number of rows of the first matrix $A$ and the number of columns of the second matrix $B$. The result of the multiplication $A*B$ (which is different from $B*A$!) is a $n \times w$ matrix, which we call $M$. Also assume that $B$ is a $m \times w$ matrix. Assume that $A$ is a $n \times m$ matrix, which means that it has $n$ rows and $m$ columns. Let's say we have two matrices, $A$ and $B$. Matrix-Matrix Multiplicationīefore starting, it is helpful to briefly recap how a matrix-matrix multiplication is computed.
DIM3 DECLARATION FREE
Also, if you have any doubt, feel free to ask me for help in the comment section. But don't worry, at the end of the article you can find the complete code. everything not relevant to our discussion).

DIM3 DECLARATION CODE
So far you should have read my other articles about starting with CUDA, so I will not explain the "routine" part of the code (i.e. In this article we will use a matrix-matrix multiplication as our main guide.
DIM3 DECLARATION HOW TO
option pricing under a binomial model and using finite difference methods (FDM) for solving PDEs.Īs usual, we will learn how to deal with those subjects in CUDA by coding. In subsequent articles I will introduce multi-dimensional thread blocks and shared memory, which will be extremely helpful for several aspects of computational finance, e.g. Today, we take a step back from finance to introduce a couple of essential topics, which will help us to write more advanced (and efficient!) programs in the future. When I remove the templates, the performance is the same for both ways on all compilers.In the previous article we discussed Monte Carlo methods and their implementation in CUDA, focusing on option pricing. My problem is that as soon as I do not define the advecu function in the class declaration, but instead separate the declaration and the implementation (as I prefer), I lose a considerable amount of performance in both the Intel C++ and the Clang compilers. I have a high performance code which includes the class that is found at the bottom of this post.
