Home Exam 1: Video encoding on ARM Cortex-A57 using ARM NEON instructions

Assignment

In this assignment, we will take advantage of the parallelization options available in a single ARM NEON-enabled core to accelerate c63 video encoding.

You are supposed to:

Additional details

The exam will be graded on how well you are able to take advantage of the ARM NEON instructions to solve the task.

You must program for Linux and use the inline ARM NEON intrinsics for GCC. Do not rely on compiler options to provide speedup (use -O0 and  -fno-tree-vectorize to quality-assure your code optimizations).

You are not supposed to make the encoder multithreaded. Your implementation is supposed to be single threaded, and optimized to use the parallelism available through NEON vector instructions.

The encoder must accept the test sequences in YUV format and generate the format that is understood by c63's own and unmodified decoder.

Start by profiling the encoder to see which parts of the encoder are the bottlenecks. Remember, after optimizing one part of the code, more profiling might be needed to find new bottlenecks. One operation (e.g., motion vector search) may still be the heaviest function when you have optimised as much as you can. If you cannot optimize one operation further, move to another one.

Based on your profiling you should try to optimize different parts of the code, (1) structurally, and (2) with NEON instructions. There is no definite answer to which parts of the code you have to optimize, and there is also no definite answer to which instructions you have to use. Look for SIMD-friendly cases where the same operation needs to be performed on very many similar data elements. You are NOT supposed to change or replace any algorithms, only reimplement the algorithms using vector instructions.

You should write a report detailing your profiling results, the instructions you used, and your changes to the precode. The report should also detail and explain the performance results, both positive and negative (in research it is also important to learn what not to do). If you find several alternatives for solving a problem, or if you tried several dead ends before solving a problem successfully, discuss them in your report.

 

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It provides a variety of parallelization opportunities that exist in modern codecs without the complexity of those full-fledged codecs. It is not compliant with any standards by itself, so the precode contains both an example of an encoder and a decoder (which converts an encoded file back to YUV).

C63's inter-frame prediction works by encoding every macroblock independently whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame. If a motion vector is used, the residual is stored in the same manner. If a motion vector is used, this is stored right before storing the encoded residual.

If no motion vector is used, the macroblocks are encoded according to the JPEG standard [1] and stored in the output file. An illustrative overview of the steps involved during JPEG encoding can be found at Wikipedia [2].

The c63 is very basic and shows behavior that you would not allow a standard encoder to have. This concerns in particular the Huffman tables and the unconditional use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not have an encoder pipeline of more than 3 frames. In addition, you should not use parallelization techniques that severely degrade the video quality.

It is your task to optimize the c63 encoder. As mentioned above, you should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms may provide large speedups, but they distract from the main goal of this home exam, which is to identify and implement parallelization using vector instructions.

Two test sequences in YUV format is available in the /opt/cipr directory on the lab machines:

These should be used as input to the provided c63 encoder, and can be used to test your implementations.

 

Precode

The precode consists of:

The precode is written in C, and you should also write your solution in C. You are not required to touch the decoder or c63pred. We recommend that you have two separate repositories. One where you modify the encoder, and one unmodified which you use to test your implementation.

The precode can be downloaded from a Git repository here: 

git clone https://bitbucket.org/mpg_code/inf5063-codec63.git

You must login to the Jetson TX1 devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about how to access the kits can be found in the ARM FAQ.

You are free to adapt, modify or completely rewrite the provided encoder to take full advantage of the target architecture. You are, however, not allowed to replace the algorithms for Motion Estimation, Motion Compensation or DCT/iDCT. You are also not allowed to paste/reuse any other pre-written code in your implementation.

Some command usage examples:

To encode the foreman test sequence
$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv

To decode the test sequence
$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To dump the prediction buffer (used to test motion estimation)
$ ./c63pred /tmp/test.c63 /tmp/test.yuv

$ vlc --rawvid-width 352 --rawvid-height 288 /tmp/test.yuv

 

Evaluation

In evaluation, we will consider and give points for (most important first):

  1. At least two bottleneck algorithmic functions in the source code have been NEONized.(*)
    • Document the bottleneck and the effect of your optimization.
  2. A program that works (on the Jetson TX1 provided)
    • Runs to completion. (**)
    • Encodes tractor (1080p) correctly.
  3. Effect of the Parallelization (SIMDification)
    • Most costly algorithms identified and NEONized
    • NEON instructions are used effectively.
  4. Good documentation
    • Readable, well-commented code.
    • Optimization steps and performance results
    • Comparison of / reflection about alternative approaches
    • Complete and well-presented document
  5. Output video has a quality with a similar or better PSNR and file size as the reference encoder's.
  6. Bonus points for non-obvious optimizations

(*) Automatic fail if this is not fulfilled. (**) We do not debug code before testing. There will not be any points for correctness and effectiveness if this is not fulfilled.

Machine Setup

The Jetson TX1 devkits are situated at Simula Research Laboratory. Machine names and how to access them can be found in the ARM FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.

Contact inf5063@ifi.uio.no if you have problems logging in.

 

Formal information

The deadline for handing in your assignment is: Monday, September 26th at (15:00:00.00).

Deliver your code and report (as PDF) at https://devilry.ifi.uio.no/. Submit the poster (as PDF) to inf5063@ifi.uio.no.

The groups should also prepare a poster (2 x A3 pages) and a quick 2 minutes talk (without slides) where you pitch your poster for the class on September 28th. Name the poster with your group name, and e-mail the poster to inf5063@ifi.uio.no no later than noon (12:00) on September 27th. We will then print the poster for you.

For questions and course related chatter, we have created a Slack space: https://inf5063.slack.com

There will be a prize for best poster/presentation (awarded by an independent panel and independent of the grade).

Please check the ARM FAQ page for updates and FAQ

For questions please contact:
inf5063@ifi.uio.no

 

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example