IN5050 - Programming heterogeneous multi-core architectures

Home Exam 2: Video Encoding on ARMv8 CPUs using ARM NEON Instructions

?

Assignment

In this assignment, we will leverage the parallelization options available on a single ARMv8 NEON-enabled core to accelerate Codec63 (c63) video encoding.

You are supposed to:

Profile and analyze the encoder, and write a short Design Review (max 1 x A4 page) that the group will present on Tuesday, March 24th.
Optimize the c63 encoder using ARM NEON vector instructions.
Create a poster (PowerPoint or PDF slide) that the group will present on Tuesday, April 14th.
Write a short report where you describe which optimizations you have implemented and discuss your results. You should not describe other thinkable or planned optimizations you did not test.

Additional details

The exam will be graded on how well you can use ARM NEON instructions to solve the task.

You must program for Linux and use the inline ARM NEON intrinsics for GCC. Do not rely on compiler options to provide auto-vectorization (use?-O1?and ?-fno-tree-vectorize?to quality-assure your code optimizations).

You are not supposed to make the encoder multi-threaded. Your implementation should be single-threaded and optimized to use the parallelism available through NEON vector instructions. The encoder must accept the test sequences in YUV format and generate the format understood by c63's unmodified decoder.

Start by profiling the encoder to identify which parts are bottlenecks. Remember, additional profiling may be needed to identify new bottlenecks after optimizing a single piece of code. One operation (e.g., motion vector search) may still be the most important function when optimized as much as possible. If you cannot optimize one operation further, move on to another.

Based on your profiling, you should optimize different parts of the code, (1) structurally and (2) with NEON instructions. There is no definite answer to which sections of the code you have to optimize, and there is no definite answer to which instructions you must use. Look for SIMD-friendly cases where the same operation needs to execute on many similar data elements. You are NOT supposed to change or replace any algorithms. Only reimplement the algorithms using vector instructions.

Write a report detailing your profiling results, the instructions you used, and your changes to the precode. The report should also detail and explain the positive and negative performance results (in research, it is also essential to learn what not to do). Discuss them in your report if you find several alternatives for solving a problem, or if you tried several dead ends before solving a challenge successfully.

?

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It provides a variety of parallelization opportunities found in modern codecs without the complexity of full-fledged codecs. It is not compliant with any standards on its own, so the precode includes both an encoder and a decoder (which converts an encoded file back to YUV).

C63's inter-frame prediction works by encoding each macroblock independently, whether or not it uses a motion vector. If a motion vector is used, it refers to the previous frame. If a motion vector is used, the residual is stored in the same manner. If a motion vector is used, this is stored right before saving the encoded residual.

The macroblocks are encoded according to the JPEG standard [1] and stored in the output file if no motion vector is used. An illustrative overview of the steps involved during JPEG encoding can be found in Wikipedia [2].

The video scenario is live streaming. You should not have an encoder pipeline of more than three frames. Also, you should avoid parallelization techniques that severely degrade video quality.

The c63 is basic and exhibits behavior you would not expect from a standard encoder. This concerns, in particular, the Huffman tables and the use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not use parallelization techniques that severely degrade the video quality.

Your task is to optimize the c63 encoder. As mentioned above, you should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms may provide substantial speedups. Still, they distract from the primary goal of this home exam: identifying and implementing parallelization using vector instructions.

Two test sequences in YUV format are available in the /mnt/sdcard directory on the Tegra-machines:

foreman (352x288), CIF
tractor (1920x1080), 1080p

These should be used as input to the provided c63 encoder and can test your implementations.

?

Precode

The precode consists of the following:

The reference c63 encoder (c63enc).
Decoder for c63 (c63dec).
The command c63pred (which extracts the prediction buffer for debugging purposes).

The precode is written in C, and you should also write your solution in C. You are not required to touch the decoder (c63dec) or c63pred. We recommend that you have two separate repositories: one where you modify the encoder and one unmodified, which you use to test your implementation.

You can download the decoder from a Git repository here:?

git clone https://github.com/griwodz/in5050-codec63.git

You must log in to the Jetson AGX Xavier devkit assigned to your group for this assignment. You should have received an email from the course administrators about which kits to use. Information about accessing the devkit can be found in the ARM FAQ.

You are free to adapt, modify, or rewrite the provided encoder to fully leverage the target architecture. You are, however, not allowed to replace the algorithms for Motion Estimation, Motion Compensation, or DCT/iDCT. You are also not allowed to paste/reuse any other pre-written code in your implementation.

Some command usage examples:

To encode the foreman test sequence

$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv

To decode the test sequence:

$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To dump the prediction buffer (used to test motion estimation):

$ ./c63pred /tmp/test.c63 /tmp/test.yuv

Playback of raw YUV videos:

$ mplayer /tmp/test.yuv -demuxer rawvideo -rawvideo w=352:h=288

Report

You must write the results as a technical report of no more than four pages in?ACM format (double column). The report should serve as a guide to the code modifications?you have made and the resulting performance changes.??

Evaluation

In the evaluation, we will consider and give points for (most important first):

Motion Estimation & DCT/iDCT?algorithmic functions in the source code have been NEONized.
- Document the bottleneck and the effect of your optimization.
A program that works (on the Jetson AGX Xavier provided)
- The program runs to completion. (*)
- Encodes the foreman (CIF) video correctly.
- The output video has a similar quality and file size to the unmodified encoder.
- Readable, well-commented code
Effect of the Parallelization (SIMDification) and the correct use of the architecture
- Most costly algorithms identified and NEONized
- NEON instructions are used effectively
- If multiple CPU cores are used for processing, is memory access and core placement optimized?
- Bonus points for non-obvious optimizations
The quality of the report that accompanies the code
- Clear and structured report of the performance changes caused by your modifications to the precode
- References to the relevant parts of the accompanying program?code (to aid the reviewer of the submitted assignment)
- Graphical presentation of the optimization steps and performance results (plots of performance changes)
- Comparison of / reflection about the alternative approaches tried out by your group

(*) We do not debug code before testing. There will be no points for video correctness if the code does not work.

Machine Setup

The Jetson AGX Xavier devkits are at IFI. Machine names and how to access them are available in the ARM FAQ. If you have reported your group to the course administration, you should have been assigned to a devkit and provided with a username and a password.

Contact in5050@ifi.uio.no if you have problems logging in.

Formal information

The deadline for handing in your assignment:

Design: Tuesday, March 24th at 10:00
Code: Tuesday, April 14th at 23:59
Report: Friday, April 17th?at 23:59

Deliver your code and?report (as PDF) at https://devilry.ifi.uio.no/.

Submit the design review and poster (as PDF) to in5050@ifi.uio.no.

The groups should prepare a poster (PowerPoint or PDF slide) and a quick 5-minute talk to pitch the implementation to the class on April 14th. Name the poster with your group name and email it to in5050@ifi.uio.no. There will be a prize for the best poster/presentation (awarded by an independent panel, regardless of grade).

For questions and course-related chatter, we have created a Mattermost space: https://mattermost.uio.no/ifi-undervisning/channels/in5050

Please check the ARM FAQ page for updates, and the FAQ

For questions, please contact:
in5050@ifi.uio.no

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example