CUV tutorial

Getting Started with the CUV library export

Know where to look up stuff you do not know

Get a feeling for the API documentation. A good place to start is the "Modules" page.

The main data structure

The main, if not the only, cuv class you'll work with is the tensor, whose documentation resides in CUV datastructures. A tensor is a multi-dimensional array:

using namespace cuv;
tensor<float, host_memory_space> vector(extents[5]);     // 5 element vector
tensor<float, host_memory_space> matrix(extents[5][7]);  // 5x7 matrix
tensor<float, host_memory_space> cuboid(extents[5][7][2]);  // ...

The template arguments denote:

The type of the elements in the tensor
The memory space type, either host_memory_space or dev_memory_space. The former resides in RAM, the latter in GPU global memory. There are many tutorials available that introduce GPU memory architecture. A good place to start reading might be this presentation. The technique used here called tag dispatching, which you'll also use to overload functions based on the argument type.

A tensor has a few important member functions:

ndim() returns the number of dimensions
shape() returns an std::vector<unsigned int> which contains the sizes of all dimensions
shape(dim) returns the size of dimension dim
ptr() The pointer to the first element
stride(dim) the number of items in linear memory that you have to skip to get to the next value in dimension dim

Element access

Single elements can be accessed using the (...) operator or the [...] operator. The latter accesses the elements of the tensor as if there was only one dimension:

using namespace cuv;
tensor<float, host_memory_space> t(extents[2][3]);
t(1, 0) = 6.f;
std::cout << t(1, 0) << std::endl;
std::cout << t[3] <<std::endl;     // equivalent!

Note that similar to MatLab and NumPy, element access is slow, since it amounts to a function call for each element on host, and an additional copy operation on GPU. Thus, you should not write code like this:

using namespace cuv;
tensor<float, host_memory_space> t(extents[1000]);
for(unsigned int i=0; i < t.shape(0); ++i)
    t[i] = 0.f;   // SLOW!

Instead, try to use existing functionality in cuv. If that does not work for you, you'd write a function that makes use of ptr(), ndim(), shape(dim), and stride(dim), avoiding the element access operators and directly operates on the raw memory. Typically you need to write this twice, once for GPU, once for CPU. We'll leave this topic for another time.

Slicing

Parts of a tensor can be represented using a tensor_view. The operation slicing returns a tensor_view. The view is derived from tensor, so all operations on tensor can also take views. Note however: Most operations only work on dense tensors, i.e. all memory between the first and the last element is part of the tensor.

Extracting a slice is done using the indices object and possibly degenerate index ranges, which you may know from MatLab and or NumPy.

In the following, index_range() means all elements in the range (similar to ":" in MatLab and NumPy), whereas index_range(i, j) means all indices from i inclusive to j exclusive:

using namespace cuv;
tensor<float, host_memory_space> t(extents[5][8]);

// extract second and third row of t (result has ndim=2)
tensor_view<float, host_memory_space> tv0 = 
     t[indices[index_range(1, 3)][index_range()]];

// equivalent
tensor_view<float, host_memory_space> tv1 = t[indices[index_range(1, 3)]];

// 3rd row of t (result has ndim=2)
tensor_view<float, host_memory_space> tv2 = t[indices[index_range(2, 3)]];

// 3rd row of t (result has ndim=1)
tensor_view<float, host_memory_space> tv3 = t[indices[2][index_range()]];

// equivalent
tensor_view<float, host_memory_space> tv4 = t[indices[2]];

Note how a single number instead of a range decreases the number of dimensions of the view, while an index_range always keeps the number of dimensions constant.

Operations on tensors

Operations on tensors are split into three main flavours, BLAS-1, BLAS-2 and BLAS-3¹.

BLAS-1: Operations involving two vectors or operations which work on multi-dimensional tensors as if they were vectors.
BLAS-2: Operations involving a 1-dimensional and an n-dimensional tensor. Examples are: Summing the rows of a matrix or adding a vector to every column of a matrix.
BLAS-3: Operations involving two \(>1\)-dimensional tensors, e.g. matrix-product.

The BLAS-1 functions currently come in three categories, depending on how many tensors are involved in the operation:

null-ary functors: have no tensor parameters except the one they output, e.g. filling a tensor with zeros.
scalar functors: have one tensor parameter, e.g. for determining \(\sin(x)\) for every \(x\) in the tensor.
binary functors: have two tensor parameters, e.g. for determining the pointwise sum.

The structure of BLAS and arity is reflected in the documentation.

Argument order

Throughout the library, the first argument contains the result of the operation. This example applies the logarithm to argument and writes the result to result:

using namespace cuv;
tensor<float, host_memory_space> result(extents[5]);
tensor<float, host_memory_space> argument(extents[5]);
apply_scalar_functor(result, argument, SF_LOG);
apply_binary_functor(result, argument, argument, BF_ADD);

If you want to apply a transformation and write back to the same tensor, you can shorten this to

using namespace cuv;
tensor<float, host_memory_space> argres(extents[5]);
apply_scalar_functor(argres, SF_LOG);
apply_binary_functor(argres, argres, BF_ADD);

Scalar arguments can be passed after the functor. To add 1 to every element of a tensor, write

using namespace cuv;
tensor<float, host_memory_space> argres(extents[5]);
apply_scalar_functor(argres, SF_ADD, 1.f);

For many operations, there are operators defined which are easier to read and do the same thing:

using namespace cuv;
tensor<float, host_memory_space> argres(extents[5]);
apply_scalar_functor(argres, SF_ADD, 1.f);
argres += 1.f;              // equivalent!

Copying Semantics

The assignment operator \(=\) is quite tricky in C++. It can do various things, and CUV has slightly unorthodox behavior.

References

The most trivial case is an 'assignment' to a reference. This simply creates a new name for exactly the same variable:

using namespace cuv;
tensor<float, host_memory_space> a(extents[5]);
tensor<float, host_memory_space>& b = a;

This operation is instantaneous, and does not "copy" anything.
a and b are now indistinguishable.
If a goes out of scope, accessing b results in undefined behavior.

Different memory spaces

The second case we'll discuss is assigment when two tensors have different memory spaces.

using namespace cuv;
tensor<float, host_memory_space> a(extents[5]);
tensor<float, dev_memory_space>  b = a;

a and b now have the same value, but are otherwise not connected in any way.
This is a common operation to get stuff to the GPU for fast processing and from the CPU to evaluate the results.
Copying takes time approximately linear in the size of a.
If b had any value, it will be discarded.

Same memory space

Our third case is when both tensors are in the same memory space.

using namespace cuv;
tensor<float, host_memory_space> a(extents[5]);
tensor<float, host_memory_space>  b = a;

This copies only the meta information, i.e. ndim, shape, strides, etc.
Copying takes constant time (linear in number of dimensions).
The underlying memory is shared, i.e. changing a changes b and vice versa. However, reshaping a does not change the shape of b.
If you want the tensors not to share the memory, use b = a.copy().
If b had any value, it will be discarded.

Left-hand-side Views

Finally, there is the case where the left side of the assignment is a view on another tensor of the same memory space. In this case, copying is mandatory and shapes on left and right hand side must be the same:

using namespace cuv;
tensor<float, host_memory_space> a(extents[5]);
tensor<float, host_memory_space> b(extents[10]);
b[indices[index_range(0,5)]] = a;

This only changes the memory of b by copying the values of a.

A final word w.r.t. copying: Copying only works if the source and the destination are either dense (all memory between first and last element must be copied) or 2D-copyable (for a matrix, the stride of the first dimension is larger than the number of columns).

Footnotes:

¹ BLAS is short for Basic Linear Algebra System, a well-known standard naming scheme for implementations of linear algebra operations.