NameDateSize

..16-Mar-201612 KiB

.gitignore07-Jan-20151.2 KiB

.travis.yml15-May-201451

build.xml02-Sep-20142.6 KiB

csrc/23-Oct-20144 KiB

doc/20-Feb-20134 KiB

examples/02-Sep-20144 KiB

lib/27-Oct-20144 KiB

LICENSE30-Nov-20131.1 KiB

manifest.mf29-Dec-201282

mvn_install07-Jan-20151 KiB

pack-rootbeer29-Dec-201299

pack-rootbeer.bat29-Dec-201295

packages/30-Nov-20134 KiB

plugins/07-Jan-20154 KiB

pom.xml07-Jan-20154.8 KiB

README.md12-Jun-201519.9 KiB

src/03-Feb-20144 KiB

test-files/29-Dec-20124 KiB

test_results/30-Nov-20134 KiB

README.md

1#Rootbeer
2
3The Rootbeer GPU Compiler lets you use GPUs from within Java. It allows you to use almost anything from Java on the GPU:
4
5  1. Composite objects with methods and fields
6  2. Static and instance methods and fields
7  3. Arrays of primitive and reference types of any dimension.
8
9ROOTBEER IS PRE-PRODUCTION BETA. IF ROOTBEER WORKS FOR YOU, PLEASE LET ME KNOW AT PCPRATTS@TRIFORT.ORG
10
11Be aware that you should not expect to get a speedup using a GPU by doing something simple
12like multiplying each element in an array by a scalar. Serialization time is a large bottleneck
13and usually you need an algorithm that is O(n^2) to O(n^3) per O(n) elements of data.
14
15GPU PROGRAMMING IS NOT EASY, EVEN WITH ROOTBEER. EXPECT TO SPEND A MONTH OPTIMIZING TRIVIAL EXAMPLES.
16
17FEEL FREE TO EMAIL ME FOR DISCUSSIONS BEFORE ATTEMPTING TO USE ROOTBEER
18
19An experienced GPU developer will look at existing code and find places where control can
20be transfered to the GPU. Optimal performance in an application will have places with serial
21code and places with parallel code on the GPU. At each place that a cut can be made to transfer
22control to the GPU, the job needs to be sized for the GPU.
23
24For the best performance, you should be using shared memory (NVIDIA term). The shared memory is
25basically a software managed cache. You want to have more threads per block, but this often
26requires using more shared memory. If you see the [CUDA Occupancy Calculator](http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls) you can see
27that for best occupancy you will want more threads and less shared memory. There is a tradeoff
28between thread count, shared memory size and register count. All of these are configurable
29using Rootbeer.
30
31## Programming  
32<b>Kernel Interface:</b> Your code that will run on the GPU will implement the Kernel interface.
33You send data to the gpu by adding a field to the object implementing kernel. `gpuMethod` will access the data.
34
35    package org.trifort.rootbeer.runtime;
36
37    public interface Kernel {
38      void gpuMethod();
39    }
40
41###Simple Example:
42This simple example uses kernel lists and no thread config or context. Rootbeer will create a thread config and select the best device automatically. If you wish to use multiple GPUs you need to pass in a Context.
43
44<b>ScalarAddApp.java:</b>  
45See the [example](https://github.com/pcpratts/rootbeer1/tree/master/examples/ScalarAddApp)
46
47```java
48package org.trifort.rootbeer.examples.scalaradd;
49
50import java.util.List;
51import java.util.ArrayList;
52import org.trifort.rootbeer.runtime.Kernel;
53import org.trifort.rootbeer.runtime.Rootbeer;
54import org.trifort.rootbeer.runtime.util.Stopwatch;
55
56public class ScalarAddApp {
57
58  public void multArray(int[] array){
59    List<Kernel> tasks = new ArrayList<Kernel>();
60    for(int index = 0; index < array.length; ++index){
61      tasks.add(new ScalarAddKernel(array, index));
62    }
63
64    Rootbeer rootbeer = new Rootbeer();
65    rootbeer.run(tasks);
66  }
67
68  private void printArray(String message, int[] array){
69    for(int i = 0; i < array.length; ++i){
70      System.out.println(message+" array["+i+"]: "+array[i]);
71    }
72  }
73
74  public static void main(String[] args){
75    ScalarAddApp app = new ScalarAddApp();
76    int length = 10;
77    int[] array = new int[length];
78    for(int index = 0; index < array.length; ++index){
79      array[index] = index;
80    }
81
82    app.printArray("start", array);
83    app.multArray(array);
84    app.printArray("end", array);
85  }
86}
87```
88
89<b>ScalarAddKernel:</b>
90
91```java
92package org.trifort.rootbeer.examples.scalaradd;
93
94import org.trifort.rootbeer.runtime.Kernel;
95
96public class ScalarAddKernel implements Kernel {
97
98  private int[] array;
99  private int index;
100
101  public ScalarAddKernel(int[] array, int index){
102    this.array = array;
103    this.index = index;
104  }
105
106  public void gpuMethod(){
107    array[index] += 1;
108  }
109}
110```
111
112### High Performance Example - Batcher's Even Odd Sort
113See the [example](https://github.com/pcpratts/rootbeer1/tree/master/examples/sort)  
114See the [slides](http://trifort.org/ads/index.php/lecture/index/27/)  
115
116<b>GPUSort.java</b>  
117
118```java
119package org.trifort.rootbeer.sort;
120
121import org.trifort.rootbeer.runtime.Rootbeer;
122import org.trifort.rootbeer.runtime.GpuDevice;
123import org.trifort.rootbeer.runtime.Context;
124import org.trifort.rootbeer.runtime.ThreadConfig;
125import org.trifort.rootbeer.runtime.StatsRow;
126import org.trifort.rootbeer.runtime.CacheConfig;
127import java.util.List;
128import java.util.Arrays;
129import java.util.Random;
130
131public class GPUSort {
132
133  private int[] newArray(int size){
134    int[] ret = new int[size];
135
136    for(int i = 0; i < size; ++i){
137      ret[i] = i;
138    }
139    return ret;
140  }
141
142  public void checkSorted(int[] array, int outerIndex){
143    for(int index = 0; index < array.length; ++index){
144      if(array[index] != index){
145        for(int index2 = 0; index2 < array.length; ++index2){
146          System.out.println("array["+index2+"]: "+array[index2]);
147        }
148        throw new RuntimeException("not sorted: "+outerIndex);
149      }
150    }
151  }
152
153  public void fisherYates(int[] array)
154  {
155    Random random = new Random();
156    for (int i = array.length - 1; i > 0; i--){
157      int index = random.nextInt(i + 1);
158      int a = array[index];
159      array[index] = array[i];
160      array[i] = a;
161    }
162  }
163
164  public void sort(){
165    //should have at least 192 threads per SM
166    int size = 2048;
167    int sizeBy2 = size / 2;
168    //int numMultiProcessors = 14;
169    //int blocksPerMultiProcessor = 512;
170    int numMultiProcessors = 2;
171    int blocksPerMultiProcessor = 256;
172    int outerCount = numMultiProcessors*blocksPerMultiProcessor;
173    int[][] array = new int[outerCount][];
174    for(int i = 0; i < outerCount; ++i){
175      array[i] = newArray(size);
176    }
177
178    Rootbeer rootbeer = new Rootbeer();
179    List<GpuDevice> devices = rootbeer.getDevices();
180    GpuDevice device0 = devices.get(0);
181    //create a context with 4212880 bytes objectMemory.
182    //you can leave the 4212880 missing at first to
183    //use all available GPU memory. after you run you
184    //can call context0.getRequiredMemory() to see
185    //what value to enter here
186    Context context0 = device0.createContext(4212880);
187    //use more die area for shared memory instead of
188    //cache. the shared memory is a software defined
189    //cache that, if programmed properly, can perform
190    //better than the hardware cache
191    //see (CUDA Occupancy calculator)[http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls]
192    context0.setCacheConfig(CacheConfig.PREFER_SHARED);
193    //wire thread config for throughput mode. after
194    //calling buildState, the book-keeping information
195    //will be cached in the JNI driver
196    context0.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
197    //configure to use kernel templates. rather than
198    //using kernel lists where each thread has a Kernel
199    //object, there is only one kernel object (less memory copies)
200    //when using kernel templates you need to differetiate
201    //your data using thread/block indexes
202    context0.setKernel(new GPUSortKernel(array));
203    //cache the state and get ready for throughput mode
204    context0.buildState();
205
206    while(true){
207      //randomize the array to be sorted
208      for(int i = 0; i < outerCount; ++i){
209        fisherYates(array[i]);
210      }
211      long gpuStart = System.currentTimeMillis();
212      //run the cached throughput mode state.
213      //the data now reachable from the only
214      //GPUSortKernel is serialized to the GPU
215      context0.run();
216      long gpuStop = System.currentTimeMillis();
217      long gpuTime = gpuStop - gpuStart;
218
219      StatsRow row0 = context0.getStats();
220      System.out.println("serialization_time: "+row0.getSerializationTime());
221      System.out.println("execution_time: "+row0.getExecutionTime());
222      System.out.println("deserialization_time: "+row0.getDeserializationTime());
223      System.out.println("gpu_required_memory: "+context0.getRequiredMemory());
224      System.out.println("gpu_time: "+gpuTime);
225
226      for(int i = 0; i < outerCount; ++i){
227        checkSorted(array[i], i);
228        fisherYates(array[i]);
229      }
230
231      long cpuStart = System.currentTimeMillis();
232      for(int i = 0; i < outerCount; ++i){
233        Arrays.sort(array[i]);
234      }
235      long cpuStop = System.currentTimeMillis();
236      long cpuTime = cpuStop - cpuStart;
237      System.out.println("cpu_time: "+cpuTime);
238      double ratio = (double) cpuTime / (double) gpuTime;
239      System.out.println("ratio: "+ratio);
240    }
241    //context0.close();
242  }
243
244  public static void main(String[] args){
245    GPUSort sorter = new GPUSort();
246    while(true){
247      sorter.sort();
248    }
249  }
250}
251```
252
253<b>GPUSortKernel.java</b>
254
255```java
256package org.trifort.rootbeer.sort;
257
258import org.trifort.rootbeer.runtime.Kernel;
259import org.trifort.rootbeer.runtime.RootbeerGpu;
260
261
262public class GPUSortKernel implements Kernel {
263
264  private int[][] arrays;
265
266  public GPUSortKernel(int[][] arrays){
267    this.arrays = arrays;
268  }
269
270  @Override
271  public void gpuMethod(){
272    int[] array = arrays[RootbeerGpu.getBlockIdxx()];
273    int index1a = RootbeerGpu.getThreadIdxx() << 1;
274    int index1b = index1a + 1;
275    int index2a = index1a - 1;
276    int index2b = index1a;
277    int index1a_shared = index1a << 2;
278    int index1b_shared = index1b << 2;
279    int index2a_shared = index2a << 2;
280    int index2b_shared = index2b << 2;
281
282    RootbeerGpu.setSharedInteger(index1a_shared, array[index1a]);
283    RootbeerGpu.setSharedInteger(index1b_shared, array[index1b]);
284    //outer pass
285    int arrayLength = array.length >> 1;
286    for(int i = 0; i < arrayLength; ++i){
287      int value1 = RootbeerGpu.getSharedInteger(index1a_shared);
288      int value2 = RootbeerGpu.getSharedInteger(index1b_shared);
289      int shared_value = value1;
290      if(value2 < value1){
291        shared_value = value2;
292        RootbeerGpu.setSharedInteger(index1a_shared, value2);
293        RootbeerGpu.setSharedInteger(index1b_shared, value1);
294      }
295      RootbeerGpu.syncthreads();
296      if(index2a >= 0){
297        value1 = RootbeerGpu.getSharedInteger(index2a_shared);
298        //value2 = RootbeerGpu.getSharedInteger(index2b_shared);
299        value2 = shared_value;
300        if(value2 < value1){
301          RootbeerGpu.setSharedInteger(index2a_shared, value2);
302          RootbeerGpu.setSharedInteger(index2b_shared, value1);
303        }
304      }
305      RootbeerGpu.syncthreads();
306    }
307    array[index1a] = RootbeerGpu.getSharedInteger(index1a_shared);
308    array[index1b] = RootbeerGpu.getSharedInteger(index1b_shared);
309  }
310}
311```
312
313
314### Compiling Rootbeer Enabled Projects
3151. Download the latest Rootbeer.jar from the releases
3162. Program using the Kernel, Rootbeer, GpuDevice and Context class.
3173. Compile your program normally with javac.
3184. Pack all the classes used into a single jar using [pack](https://github.com/pcpratts/pack/)
3195. Compile with Rootbeer to enable the GPU
320   `java -Xmx8g -jar Rootbeer.jar App.jar App-GPU.jar`
321
322### Building Rootbeer from Source
323
3241. Clone the github repo to `rootbeer1/`
3252. `cd rootbeer1/`
3263. `ant jar`
3274. `./pack-rootbeer` (linux) or `./pack-rootbeer.bat` (windows)
3285. Use the `Rootbeer.jar` (not `dist/Rootbeer1.jar`)
329
330### Command Line Options
331
332* `-runeasytests` = run test suite to see if things are working
333* `-runtest` = run specific test case
334* `-printdeviceinfo` = print out information regarding your GPU
335* `-maxrregcount` = sent to CUDA compiler to limit register count
336* `-noarraychecks` = remove array out of bounds checks once you get your application to work
337* `-nodoubles` = you are telling rootbeer that there are no doubles and we can compile with older versions of CUDA
338* `-norecursion` = you are telling rootbeer that there are no recursions and we can compile with older versions of CUDA
339* `-noexceptions` = remove exception checking
340* `-keepmains` = keep main methods
341* `-shared-mem-size` = specify the shared memory size
342* `-32bit` = compile with 32bit
343* `-64bit` = compile with 64bit (if you are on a 64bit machine you will want to use just this)
344* `-computecapability` = specify the Compute Capability {sm_11,sm_12,sm_20,sm_21,sm_30,sm_35} (default ALL)
345
346Once you get started, you will find you want to use a combination of -maxregcount, -shared-mem-size and the thread count sent to the GPU to control occupancy.
347
348
349### Debugging
350
351You can use System.out.println in a limited way while on the GPU. Printing in Java requires StringBuilder support to concatenate strings/integers/etc. Rootbeer has a custom StringBuilder runtime (written with great improvements from Martin Illecker) that allows most normal printlns to work.
352
353Since you are running on a parallel GPU, it is nice to print from a single thread
354
355```java
356public void gpuMethod(){
357  if(RootbeerGpu.getThreadIdxx() == 0 && RootbeerGpu.getBlockIdxx() == 0){
358    System.out.println("hello world");
359  }
360}
361```
362
363Once you are done debugging, you can get a performance improvement by disabling exceptions and array bounds checks (see command line options).
364
365### Multi-GPUs (untested)
366
367```java
368List<GpuDevice> devices = rootbeer.getDevices();
369GpuDevice device0 = devices.get(0);
370GpuDevice device1 = devices.get(1);
371
372Context context0 = device0.createContext(4212880);
373Context context1 = device1.createContext(4212880);
374
375context0.setCacheConfig(CacheConfig.PREFER_SHARED);
376context`.setCacheConfig(CacheConfig.PREFER_SHARED);
377
378context0.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
379context1.setThreadConfig(sizeBy2, outerCount, outerCount * sizeBy2);
380
381context0.setKernel(new GPUSortKernel(array0));
382context1.setKernel(new GPUSortKernel(array1));
383
384context0.buildState();
385context1.buildState();
386
387while(true){
388  //run using two gpus without blocking the current thread
389  GpuFuture future0 = context0.runAsync();
390  GpuFuture future1 = context1.runAsync();
391  future1.take();
392  future2.take();
393}
394```
395
396### RootbeerGpu Builtins (compiles directly to CUDA statements)
397
398```java
399public class RootbeerGpu (){
400    //returns true if on the gpu
401    public static boolean isOnGpu();
402
403    //returns blockIdx.x * blockDim.x + threadIdx.x
404    public static int getThreadId();
405
406    //returns threadIdx.x
407    public static int getThreadIdxx();
408
409    //returns blockIdx.x
410    public static int getBlockIdxx();
411
412    //returns blockDim.x
413    public static int getBlockDimx();
414
415    //returns gridDim.x;
416    public static long getGridDimx();
417
418    //__syncthreads
419    public static void syncthreads();
420
421    //__threadfence
422    public static void threadfence();
423
424    //__threadfence_block
425    public static void threadfenceBlock();
426
427    //__threadfence_system
428    public static void threadfenceSystem();
429
430    //given an object, returns the long handle
431    //in GPU memory
432    public static long getRef(Object obj);
433
434    //get/set byte in shared memory. requires 1 byte.
435    //index is byte offset into shared memory
436    public static byte getSharedByte(int index);
437    public static void setSharedByte(int index, byte value);
438
439    //get/set char in shared memory. requires 2 bytes.
440    //index is byte offset into shared memory
441    public static char getSharedChar(int index);
442    public static void setSharedChar(int index, char value);
443
444    //get/set boolean in shared memory. requires 1 byte.
445    //index is byte offset into shared memory
446    public static boolean getSharedBoolean(int index);
447    public static void setSharedBoolean(int index, boolean value);
448
449    //get/set short in shared memory. requires 2 bytes.
450    //index is byte offset into shared memory
451    public static short getSharedShort(int index);
452    public static void setSharedShort(int index, short value);
453
454    //get/set integer in shared memory. requires 4 bytes.
455    //index is byte offset into shared memory
456    public static int getSharedInteger(int index);
457    public static void setSharedInteger(int index, int value);
458
459    //get/set long in shared memory. requires 8 bytes.
460    //index is byte offset into shared memory
461    public static long getSharedLong(int index);
462    public static void setSharedLong(int index, long value);
463
464    //get/set float in shared memory. requires 4 bytes.
465    //index is byte offset into shared memory
466    public static float getSharedFloat(int index);
467    public static void setSharedFloat(int index, float value);
468
469    //get/set double in shared memory. requires 8 bytes.
470    //index is byte offset into shared memory
471    public static double getSharedDouble(int index);
472    public static void setSharedDouble(int index, double value);
473
474    //atomic add value to array at index
475    public static void atomicAddGlobal(int[] array, int index, int value);
476    public static void atomicAddGlobal(long[] array, int index, long value);
477    public static void atomicAddGlobal(float[] array, int index, float value);
478
479    //atomic sub value from array at index
480    public static void atomicSubGlobal(int[] array, int index, int value);
481
482    //atomic exch value at index in array. old is retured
483    public static int atomicExchGlobal(int[] array, int index, int value);
484    public static long atomicExchGlobal(long[] array, int index, long value);
485    public static float atomicExchGlobal(float[] array, int index, float value);
486
487    //from CUDA programming guide: "reads the 32-bit word old located at the
488    //address address in global memory, computes the minimum of old and val,
489    //and stores the result back to memory at the same address.
490    //These three operations are performed in one atomic transaction.
491    //The function returns old."
492    public static int atomicMinGlobal(int[] array, int index, int value);
493
494    //from CUDA programming guide: "reads the 32-bit word old located at the
495    //address address in global memory, computes the maximum of old and val,
496    //and stores the result back to memory at the same address.
497    //These three operations are performed in one atomic transaction.
498    //The function returns old."
499    public static int atomicMaxGlobal(int[] array, int index, int value);
500
501    //from CUDA programming guide: "reads the 32-bit word old located at the
502    //address address in global memory, computes (old == compare ? val : old),
503    //and stores the result back to memory at the same address.
504    //These three operations are performed in one atomic transaction. The function
505    //returns old (Compare And Swap)."
506    public static int atomicCASGlobal(int[] array, int index, int compare, int value);
507
508    //from CUDA programming guide: "reads the 32-bit word old located at the
509    //address address in global memory, computes (old & val), and stores the
510    //result back to memory at the same address.
511    //These three operations are performed in one atomic transaction.
512    //The function returns old."
513    public static int atomicAndGlobal(int[] array, int index, int value);
514
515    //from CUDA programming guide: "reads the 32-bit word old located at the
516    //address address in global memory, computes (old | val), and stores the
517    //result back to memory at the same address.
518    //These three operations are performed in one atomic transaction.
519    //The function returns old."
520    public static int atomicOrGlobal(int[] array, int index, int value);
521
522    //from CUDA programming guide: "reads the 32-bit word old located at the
523    //address address in global memory, computes (old ^ val), and stores the
524    //result back to memory at the same address.
525    //These three operations are performed in one atomic transaction.
526    //The function returns old."
527    public static int atomicXorGlobal(int[] array, int index, int value);
528}
529```
530
531### Viewing Code Generation
532
533CUDA code is generated and placed in ~/.rootbeer/generated.cu  
534
535You can use this to find out the register / shared memory usage
536
537    $/usr/local/cuda/bin/nvcc --ptxas-options=-v -arch sm_20 ~/.rootbeer/generated.cu
538
539### CUDA Setup
540
541You need to have the CUDA Toolkit and CUDA Driver installed to use Rootbeer.
542Download it from http://www.nvidia.com/content/cuda/cuda-downloads.html
543
544### License
545
546Rootbeer is licensed under the MIT license. If you use rootbeer for any reason, please
547star the repository and email me your usage and comments. I am preparing my dissertation
548now.
549
550### Examples
551
552See [here](https://github.com/pcpratts/rootbeer1/tree/master/examples) for a variety of
553examples.
554
555
556### Consulting
557
558GPU Consulting available for Rootbeer and CUDA. Please email pcpratts@trifort.org  
559
560
561### Author
562
563Phil Pratt-Szeliga  
564http://trifort.org/
565