sc :: elementary

apr'15

30

OpenACC on Jetson TK1 (ARMv7 with CUDA capability)

ARM architecture is becoming very popular and the number of widely available computers based on ARM processor is growing very fast. Starting with entry level single board computers (Raspberry Pi - ARMv6 and Raspberry Pi 2 - ARMv7) to higher end systems like NVIDIA's Jetson TK1 (ARMv7 based Tegra K1 processor), up to whole servers. With the trend of offloading large parts of the calculations to GPUs the CPU quite often play a smaller role. In that context ARM architecture based system may compete with x86 system even thought ARM offers limited capabilities comparing to x86.

As many ARM based systems are coupled with powerful GPUs all programming models for GPUs will be of greater interest. CUDA is fully supported for ARM architectures Jetson TK1 comes with fully working CUDA toolkit that can take advantage of 192 CUDA cores of the Tegra K1 processor (ARMv7 CPU + CUDA capable GPU).

Different situation is with OpenACC. Although NVIDIA is the biggest beneficiary of that standard (which has been created for its hardware), they have not released a free compiler supporting OpenACC. Among the available compilers supporting OpenACC we have CAPS, Cray compiler, PGI and Pathscale which are all commercial. The last two from that list are capable of generating code for ARM architecture. OpenUH is based on Open64 and targets x86 and few other architectures (not ARM). OpenARC has not been publicly released and an initial GCC implementation of OpenACC is only available for x86_64 systems at this moment. accULL is the only free compiler that works on ARM processors as it is based on CUDA and Python translation of the OpenACC directive. If you have ARM computer equipped with CUDA capable GPU and are interested in OpenACC programming follow the instructions below.

The only ARM computer with NVIDIA GPU that we had available is Jetson TK1. The Tegra K1 processor is equipped with CUDA capable GPU and 192 programmable cores. accULL works just fine on Tegra K1 and that setup may very well work as a development or learning machine.

Get the source code of accULL

Clone the source code of accULL from Bitbucket

$ mkdir -p /home/ubuntu/apps/acc
$ cd /home/ubuntu/apps/acc
$ hg clone https://bitbucket.org/ruyman/accull

You will also need to download all dependencies that we will need to make the clean installation of Python and packages used by accULL

$ wget https://www.python.org/ftp/python/2.7.9/Python-2.7.9.tar.xz
$ wget https://pypi.python.org/packages/source/S/Sphinx/Sphinx-1.3.1.tar.gz
$ wget http://lxml.de/files/lxml-3.4.2.tgz
$ wget http://www.dabeaz.com/ply/ply-3.4.tar.gz
$ wget https://pypi.python.org/packages/source/M/Mako/Mako-1.0.1.tar.gz

There are two ways you may proceed with the installation 1) install Python and all packages that accULL needs into a separate tree or 2) use existing Python installation and install required packages. We will follow the first route as it is more general and portable.

Install dependencies

Before we being the installation of accULL we will also need several developers packages from Ubuntu 14.04 repository. These are needed for proper installation of Python

$ sudo apt-get install zlib1g-dev libncurses5-dev
$ sudo apt-get install libreadline6 libreadline6-dev
$ sudo apt-get install libbz2-dev libxml2-dev libxslt-dev

At this moment you may proceed with the compilation and installation of Python. You may use the script below to install Python and its dependencies in an automatic fashion.

#!/bin/bash

ACCULLROOT=/home/ubuntu/apps/acc

mkdir -p $ACCULLROOT/source
cd $ACCULLROOT/source

wget https://www.python.org/ftp/python/2.7.9/Python-2.7.9.tar.xz
wget https://pypi.python.org/packages/source/S/Sphinx/Sphinx-1.3.1.tar.gz
wget http://lxml.de/files/lxml-3.4.2.tgz
wget http://www.dabeaz.com/ply/ply-3.4.tar.gz
wget https://pypi.python.org/packages/source/M/Mako/Mako-1.0.1.tar.gz

tar xvf Python-2.7.9.tar.xz
cd Python-2.7.9/
./configure --prefix=$ACCULLROOT
make -j2 
make install

list=(Sphinx-1.3.1.tar.gz lxml-3.4.2.tgz ply-3.4.tar.gz Mako-1.0.1.tar.gz)
for file in $list
do
  if [ "$file" == "*.tgz" ] 
    name=$(basename $file .tgz)
  then
    name=$(basename $file .tar.gz)
  fi

  tar zxvf $file
  cd $basename
  $ACCULLROOT/bin/python setup.py build
  $ACCULLROOT/bin/python setup.py install
  cd ..

done

At this moment everything what we need is installed and we may proceed with compilation and installation of accULL.

Install accULL OpenACC compiler

Having all dependencies for Python and all packages ready we may install the accULL compiler. Go to the source code directory and run build-accull.sh code.

$ cd /home/ubuntu/apps/acc/accull
$ ./build-accull.sh
...
...
#######################################
       accULL installation completed
#######################################
 * 
 * Add '. /home/ubuntu/apps/acc/accull/env-parameters.sh' to your shell profile.
 * 

To make sure that accull will be able to find all of the dependencies that we have just installed you need to edit file accull and change the value of variable PYTHONBIN to location of the Python binary that we have installed $ACCULLROOT/bin/python.

Test your installation of accULL compiler

The accULL should be now ready to use. We need to source file $ACCULLROOT/accull/env-parameters.sh to be able to use the compiler. After that command accull will be available to invoke the compiler

$ source $ACCULLROOT/accull/env-parameters.sh
$ accull -h

The last command displays all available parameters to accull command.

Directory $ACCULLROOT/accull/yacf/examples/acc contains many examples of simple and advanced OpenACC codes. Take a look as you may find there a trick that you were looking for your code.

We will use a simple test code to try the accull compiler

#include <stdio.h>
#include <math.h>

#define N 10000

int main(){
  
  int i,j;
  double val, tmp1, tmp2;
  double total = 0.0;

  #pragma acc parallel
  #pragma acc loop reduction(+:total) 
  for(i=0; i<N; i++){
    for(j=0; j<N; j++){

      tmp1 = sin((double)(i+j));
      tmp2 = cos((double)(i+j));

      val = tmp1*tmp1 + tmp2*tmp2;

      total += val/(N*N);
    }
  }

  printf("N value is %d\n",N);
  printf("Total is %f \n",total);

  return 0;

}

Compile the code using your new accULL setup

$ accull -v -o gpu.x  acc_test.c 

You should see verbose message from the accULL compiler with details of every step of the code conversion and compilation

 **************************************** 
                 _    _ _      _          
                | |  | | |    | |         
   __ _  ___ ___| |  | | |    | |         
  / _` |/ __/ __| |  | | |    | |         
 | (_| | (_| (__| |__| | |____| |____     
  \__,_|\___\___|\____/|______|______|    
 **************************************** 
 Release 0.4alpha 

* Verbose output
* Output executable name gpu.x
* Processing acc_test.c to ./accull_YoSTr/acc_test.c
* Building acc_test.c
* Running StS translation 
Parsing acc_test.c ....   OK 
 Migrating to Internal Representation ....  OK 
Mutating ...
 [c2frangollo] Generating code for _acc_test_random around line 12 
Warning: Following variables were not declared in any clause
val, i, j, tmp1, tmp2,  at region _acc_test_random
 Acc parallel loop i with nesting level 1
***** Kernel Statistics *******
 Flop count: (0 + ((10000 - 0) / 1)) * ((0 + ((10000 - 0) / 1)) * (1 + (1 + (1 + (1 + (1 + (1 + (1 + 0))))))))
 Mem Access: (0 + ((10000 - 0) / 1)) * ((0 + ((10000 - 0) / 1)) * 0)
 Div. Factor: 0
 Parallel kernel 1D: _acc_test_random_P_2
 Gangs: auto (resolved in execution time)
 Workers: auto (resolved in execution time)
 Following functions were called but not declared: 
['sin', 'cos']
 Will assume they are native 
 Following functions were called but not declared: 
['sin', 'cos']
 Will assume they are native 
 Following functions were called but not declared: 
['sin', 'cos']
 Will assume they are native 
 - * - 
 Filter AccDataFilter called 
 - * - 
OK
* StS translation appears to be OK
* Building the project directory ./accull_YoSTr/acc_test.c
* Building appears to be OK
* Copying result files back to origin
/usr/local/cuda/bin/nvcc -Xcompiler -fopenmp -lrt -lcuda -lrt -lcuda -arch=sm_20 acc_test.o _acc_test_random_P_2.o /home/ubuntu/apps/acc/accull/frangollo/src/libfrangollo.a -o gpu.x
* Finished

You can verify that the executable is linked to CUDA library

$ ldd ./gpu.x 
    librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb6eea000)
    libcuda.so.1 => /usr/lib/arm-linux-gnueabihf/tegra/libcuda.so.1 (0xb63b8000)
    libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb639d000)
    libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb6392000)
    libstdc++.so.6 => /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 (0xb62e7000)
    libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0xb627b000)
    libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0xb6259000)
    libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb6172000)
    /lib/ld-linux-armhf.so.3 (0xb6f13000)

Execute the code and time it

$ time ./gpu.x 
N value is 10000
Total is 1.000000 

real    0m0.880s
user    0m0.024s
sys 0m0.040s

To compare this to a CPU version use gcc to compile the host version of this code

$ gcc -O3 acc_test.c -lm -o cpu.x

and execute it to get the comparison of the execution time

$ time ./cpu.x
N value is 10000
Total is 1.000000 

real    0m42.401s
user    0m42.336s
sys 0m0.017s

The short test above shows that OpenACC on ARM based host is possible and there is definite speedup. The small example above was more than 40 times faster. accULL is limited to OpenACC 1.0 and is supporting only C language.

All instructions and timings above are for NVIDIA Jetson TK1 platform with Tegra K1 processor. The instruction is however general and should work on all ARM systems with CUDA capable device. As more servers that connect ARM with CUDA devices are available and NVIDIA is fully supporting this platform for next generation of GPUs (Pascal) we may expect even more hardware like that.