Jcodec updates: August 2015

Thursday, August 20, 2015

Got rid of int arrays in H264Encoder

Got 20% increase in performance on my Intel Mac. commit

Wednesday, August 19, 2015

Just got my Nexus 6

Just got my Nexus 6! Adreno (GPU) driver for RenderScript is present in '/vendor/lib' and is called '/vendor/lib/libRSDriver_adreno.so'!

When I run my DCT 4x4 test (see below) I see message: 'Successfully loaded runtime: libRSDriver_adreno.so' -- good sign.

#pragma version(1)
#pragma rs java_package_name(org.jcodec.codecs.h264.rs)
#pragma rs_fp_relaxed

rs_allocation input_alloc;
rs_allocation output_alloc;

// Computes forward 4x4 h.264 approximated DCT of the input block-wise horizontal pass
ushort __attribute__((kernel)) fdcth_264_hor(uchar coeff, uint32_t x, uint32_t y)
{
    uint8_t off_x = x & 0x3;
    uint8_t blk_x = x & ~0x3;

    const static int COEFFS[][4] = {
        {1, 1, 1, 1}, {2, 1, -1, -2}, {1, -1, -1, 1}, {1, -2, 2, -1}
    };

    ushort res =
        rsGetElementAt_uchar(input_alloc, blk_x    , y) * COEFFS[off_x][0] +
        rsGetElementAt_uchar(input_alloc, blk_x + 1, y) * COEFFS[off_x][1] +
        rsGetElementAt_uchar(input_alloc, blk_x + 2, y) * COEFFS[off_x][2] +
        rsGetElementAt_uchar(input_alloc, blk_x + 3, y) * COEFFS[off_x][3];


    return res;
}

// Computes forward 4x4 h.264 approximated DCT of the input block-wise vertical pass
ushort __attribute__((kernel)) fdcth_264_vert(ushort coeff, uint32_t x, uint32_t y)
{
    uint8_t off_y = y & 0x3;
    uint8_t blk_y = y & ~0x3;

    const static int COEFFS[][4] = {
        {1, 1, 1, 1}, {2, 1, -1, -2}, {1, -1, -1, 1}, {1, -2, 2, -1}
    };

    ushort res =
        rsGetElementAt_ushort(output_alloc, x, blk_y    ) * COEFFS[off_y][0] +
        rsGetElementAt_ushort(output_alloc, x, blk_y + 1) * COEFFS[off_y][1] +
        rsGetElementAt_ushort(output_alloc, x, blk_y + 2) * COEFFS[off_y][2] +
        rsGetElementAt_ushort(output_alloc, x, blk_y + 3) * COEFFS[off_y][3];

    return res;
}

It gives me 105fps on a 1920x1080 frame! And it's all from CPU unfortunately, at least Trepn tells me my GPU is not loaded at all. At the same time both CPUs are at 100%.

After I looked in my Logcat I found a couple of error messages. So apparently there was a problem and Adreno couldn't execute my kernels so it fell back to the CPU.

08-19 22:34:21.235  10160-10178/com.example.android.basicrenderscript E/Adreno-RS﹕ : ERROR: Address not found for fdcth_264_vert.COEFFS
08-19 22:34:21.235  10160-10178/com.example.android.basicrenderscript W/Adreno-RS﹕ : ERROR: rsdQueryGlobals returned -30

Turns out my kernels couldn't run on the GPU because I used non-static non-const array COEFFS (the code above is corrected). When I made the array 'const static' the errors from Adreno went away and I can now see that my GPU is loaded at 50%. And it is currently giving me around 65fps on 1920x1080 video. Great success!

Just for reference -- Moto X (first gen) still gives me 16fps out of its 100% loaded dual core CPU.

Tuesday, August 18, 2015

Wavefront processing in RenderScript

In H.264 I-macroblocks have dependencies on neighboring macroblocks in terms of prediction. In other words one can only encode macroblock at (mb_x, mb_y) when macroblocks at (mb_x - 1, mb_y) [A], (mb_x, mb_y - 1) [B], (mb_x + 1, mb_y - 1) [C], (mb_x - 1, mb_y - 1) [D] have been encoded.

Renderscript doesn't guarantee any scan order, i.e. we can not rely on any of the top or left macroblocks being done before we start to process the current one.

The solution to this is wave-front processing (see below). The idea is to schedule macroblock processing in waves, each wave will process the spots that are ready to be processed. In the image below each wave is represented by a different fill color of the rectangles. I highlighted one particular wave with red borders in the image below.

However how to make Renderscript process image only at the specified locations? The answer I figured out would be to use 'fake allocations' of the right size just for the purpose of having RenderScript iterate the amount of times we want. Let's call these 'fake allocations' -- the iterators.

So for the first wave we'll use a 1x1 'fake allocation' (iterator), for the second wave a 1x1 as well; for the third and 4th wave -- a 2x1 iterator and so forth (all the iterators are displayed to the right of the grid in a picture below with the appropriate wave color).

The question now is how do we derive a macroblock position (mb_x, mb_y) within a wave from the iterator postion (it_x, it_y) within the iterator. This it totally task-dependent, but in this case we'll use the following technique:

fill the iterator with the current wave number;
mb_x = wave_number - it_x * 2;
mb_y = it_x;

Note that the position of actual pixels to process (pic_x, pic_y, blk_w, blk_h) is (mb_x << 4, mb_y << 4, 16, 16).

PS: I event think it would be better to schedule the waves from withing the RenderScript function, this way we don't have to go back to Java too many times.

Monday, August 17, 2015

Starting to work on RenderScript h.264 encoder implementation

Recently found out about RenderScript which apparently allows to implement highly parallel native computations that can run either on CPU or GPU of the phone.

Already thinking of implementing h.264 encoder in RenderScript. Things like DCT 4x4 and motion search are massively parallel tasks and they take the majority of time during encoding.

The only problem I am seeing at this point is I-macroblock encoding since these explore intra-frame prediction, meaning that the left and top macroblocks should be fully encoded and reconstructed before even starting to encode the current I-macroblock. Though this will be a problem in I-frames I don't see it as a big problem in P-frames. The plan is to first encode everything as P-macroblocks and then selectively replace those over threshold with I-macroblock. The hope is the number of I-macroblocks will not be too large.

Did some initial testing on my Moto X first gen and turns out this one doesn't have the RenderScript Driver for GPU at all. So all my tests are on CPU for now. I have came up with initial kernels for DCT 4x4 and diamond motion search. The DCT 4x4 gives me 20fps on 1920x1080 video and diamond motion search gives me 20fps on 640x480 video. Not so great so far but way much better then Java.

Ordered Nexus 6, since apparently Nexus 5/6/10 do have RS GPU driver. Waiting to get my hands on it.

PS: RenderScript drivers are supplied by device manufacturer and are supposed to be in /system/lib/. By default there's vanilla CPU-based driver -- libRSDriver.so). Since my Moto X has Snapdragon chipset the GPU is Adreno (320 in my case) so the RS GPU driver would have been in /system/lib/libRSDriver_adreno.so . Unfortunately it's not there.

Moving h.264 encoder to byte arrays instead of int arrays

Intitially, when I started JCodec it was easier to use int arrays since integer allows for full 0...255 range representation. This pattern stuck for too long in JCodec. Now there's plenty of code and all the encoders/decoders use 4x memory they are supposed to.

So now I am slowly starting to move classes to byte arrays. For this I will have the inputs shifted by -128 so that 0 will correspond to -128. I am assuming I will need to check in many places to make sure things are working fine.

The first component to change would be H264Encoder, since I figured this one is used the most.