CUV tutorial
Getting Started with the CUV library export
Know where to look up stuff you do not know
Get a feeling for the API documentation. A good place to start is the "Modules" page.
The main data structure
The main, if not the only, cuv class you'll work with is the
tensor
, whose documentation resides in CUV datastructures.
A tensor
is a multi-dimensional array:
using namespace cuv; tensor<float, host_memory_space> vector(extents[5]); // 5 element vector tensor<float, host_memory_space> matrix(extents[5][7]); // 5x7 matrix tensor<float, host_memory_space> cuboid(extents[5][7][2]); // ...
The template arguments denote:
- The type of the elements in the tensor
- The memory space type, either
host_memory_space
ordev_memory_space
. The former resides in RAM, the latter in GPU global memory. There are many tutorials available that introduce GPU memory architecture. A good place to start reading might be this presentation. The technique used here called tag dispatching, which you'll also use to overload functions based on the argument type.
A tensor
has a few important member functions:
ndim()
returns the number of dimensionsshape()
returns anstd::vector<unsigned int>
which contains the sizes of all dimensionsshape(dim)
returns the size of dimensiondim
ptr()
The pointer to the first elementstride(dim)
the number of items in linear memory that you have to skip to get to the next value in dimensiondim
Element access
Single elements can be accessed using the (...)
operator or the
[...]
operator. The latter accesses the elements of the tensor as
if there was only one dimension:
using namespace cuv; tensor<float, host_memory_space> t(extents[2][3]); t(1, 0) = 6.f; std::cout << t(1, 0) << std::endl; std::cout << t[3] <<std::endl; // equivalent!
Note that similar to MatLab and NumPy, element access is slow, since it amounts to a function call for each element on host, and an additional copy operation on GPU. Thus, you should not write code like this:
using namespace cuv; tensor<float, host_memory_space> t(extents[1000]); for(unsigned int i=0; i < t.shape(0); ++i) t[i] = 0.f; // SLOW!
Instead, try to use existing functionality in cuv. If that does
not work for you, you'd write a function that makes use of ptr()
,
ndim()
, shape(dim)
, and stride(dim)
, avoiding the element
access operators and directly operates on the raw memory. Typically
you need to write this twice, once for GPU, once for CPU. We'll
leave this topic for another time.
Slicing
Parts of a tensor
can be represented using a tensor_view
. The
operation slicing returns a tensor_view
. The view is derived
from tensor
, so all operations on tensor
can also take
views. Note however: Most operations only work on dense tensors,
i.e. all memory between the first and the last element is part of
the tensor
.
Extracting a slice is done using the indices
object and possibly
degenerate index ranges, which you may know from MatLab and or
NumPy.
In the following, index_range()
means all elements in the range
(similar to ":" in MatLab and NumPy), whereas index_range(i, j)
means all indices from i
inclusive to j
exclusive:
using namespace cuv; tensor<float, host_memory_space> t(extents[5][8]); // extract second and third row of t (result has ndim=2) tensor_view<float, host_memory_space> tv0 = t[indices[index_range(1, 3)][index_range()]]; // equivalent tensor_view<float, host_memory_space> tv1 = t[indices[index_range(1, 3)]]; // 3rd row of t (result has ndim=2) tensor_view<float, host_memory_space> tv2 = t[indices[index_range(2, 3)]]; // 3rd row of t (result has ndim=1) tensor_view<float, host_memory_space> tv3 = t[indices[2][index_range()]]; // equivalent tensor_view<float, host_memory_space> tv4 = t[indices[2]];
Note how a single number instead of a range decreases the number of
dimensions of the view, while an index_range
always keeps the
number of dimensions constant.
Operations on tensors
Operations on tensors are split into three main flavours, BLAS-1, BLAS-2 and BLAS-31.
- BLAS-1
- Operations involving two vectors or operations which work on multi-dimensional tensors as if they were vectors.
- BLAS-2
- Operations involving a 1-dimensional and an n-dimensional tensor. Examples are: Summing the rows of a matrix or adding a vector to every column of a matrix.
- BLAS-3
- Operations involving two \(>1\)-dimensional tensors, e.g. matrix-product.
The BLAS-1 functions currently come in three categories, depending on how many tensors are involved in the operation:
- null-ary functors
- have no tensor parameters except the one they output, e.g. filling a tensor with zeros.
- scalar functors
- have one tensor parameter, e.g. for determining \(\sin(x)\) for every \(x\) in the tensor.
- binary functors
- have two tensor parameters, e.g. for determining the pointwise sum.
The structure of BLAS and arity is reflected in the documentation.
Argument order
Throughout the library, the first argument contains the result of
the operation. This example applies the logarithm to argument
and
writes the result to result
:
using namespace cuv; tensor<float, host_memory_space> result(extents[5]); tensor<float, host_memory_space> argument(extents[5]); apply_scalar_functor(result, argument, SF_LOG); apply_binary_functor(result, argument, argument, BF_ADD);
If you want to apply a transformation and write back to the same tensor, you can shorten this to
using namespace cuv; tensor<float, host_memory_space> argres(extents[5]); apply_scalar_functor(argres, SF_LOG); apply_binary_functor(argres, argres, BF_ADD);
Scalar arguments can be passed after the functor. To add 1
to
every element of a tensor, write
using namespace cuv; tensor<float, host_memory_space> argres(extents[5]); apply_scalar_functor(argres, SF_ADD, 1.f);
For many operations, there are operators defined which are easier to read and do the same thing:
using namespace cuv; tensor<float, host_memory_space> argres(extents[5]); apply_scalar_functor(argres, SF_ADD, 1.f); argres += 1.f; // equivalent!
Copying Semantics
The assignment operator \(=\) is quite tricky in C++. It can do various things, and CUV has slightly unorthodox behavior.
References
The most trivial case is an 'assignment' to a reference. This simply creates a new name for exactly the same variable:
using namespace cuv; tensor<float, host_memory_space> a(extents[5]); tensor<float, host_memory_space>& b = a;
- This operation is instantaneous, and does not "copy" anything.
a
andb
are now indistinguishable.- If
a
goes out of scope, accessingb
results in undefined behavior.
Different memory spaces
The second case we'll discuss is assigment when two tensors have different memory spaces.
using namespace cuv; tensor<float, host_memory_space> a(extents[5]); tensor<float, dev_memory_space> b = a;
a
andb
now have the same value, but are otherwise not connected in any way.- This is a common operation to get stuff to the GPU for fast processing and from the CPU to evaluate the results.
- Copying takes time approximately linear in the size of a.
- If
b
had any value, it will be discarded.
Same memory space
Our third case is when both tensors are in the same memory space.
using namespace cuv; tensor<float, host_memory_space> a(extents[5]); tensor<float, host_memory_space> b = a;
- This copies only the meta information, i.e. ndim, shape, strides, etc.
- Copying takes constant time (linear in number of dimensions).
- The underlying memory is shared, i.e. changing
a
changesb
and vice versa. However, reshapinga
does not change the shape ofb
. - If you want the tensors not to share the memory, use
b = a.copy()
. - If
b
had any value, it will be discarded.
Left-hand-side Views
Finally, there is the case where the left side of the assignment is a view on another tensor of the same memory space. In this case, copying is mandatory and shapes on left and right hand side must be the same:
using namespace cuv; tensor<float, host_memory_space> a(extents[5]); tensor<float, host_memory_space> b(extents[10]); b[indices[index_range(0,5)]] = a;
- This only changes the memory of
b
by copying the values ofa
.
A final word w.r.t. copying: Copying only works if the source and the destination are either dense (all memory between first and last element must be copied) or 2D-copyable (for a matrix, the stride of the first dimension is larger than the number of columns).
Footnotes:
1 BLAS is short for Basic Linear Algebra System, a well-known standard naming scheme for implementations of linear algebra operations.