IN5050 - Programming heterogeneous multi-core architectures

?

Home Exam 1: Video Encoding on NVIDIA GPUs using CUDA framework

In this assignment, you will use the computing power available on a graphics processor to accelerate video encoding.

You are supposed to:

Profile and analyze the encoder, and write a short Design Review (max 1 A4 page) for the group to present.
Optimize the motion estimation, motion compensation, and DCT/iDCT part of the c63 encoder using CUDA.
Write a short report where you describe which optimizations you have implemented and discuss your results. You should not describe other thinkable or planned optimizations you did not test.
Create a poster (to show on screen) and participate in the poster session on March 3rd.

Codec63

Codec63 is a modified variant of Motion JPEG that supports inter-frame prediction. It is not compliant with any standards, so the precode contains an example of an encoder and a decoder (which converts an encoded file back to YUV). C63's inter-frame prediction works by encoding for every macroblock independently, whether it uses a motion vector or not. If a motion vector is used, it refers to the previous frame.

Macroblocks are encoded according to the JPEG standard [1] if no motion vector is used, and the resulting data isstored in the output file. If a motion vector is used, the residual is stored in a similar manner. An illustrative overview of the steps involved during JPEG encoding can be found in Wikipedia [2]. If a motion vector is used, this is stored right before storing the encoded residual.

It is your task to optimize the c63 encoder using the CUDA framework.

The c63 is very basic and shows behavior that you wouldn't allow a standard encoder to have. This concerns, in particular, the Huffman tables and the unconditional use of motion vectors in non-I-frames. You should not modify these Huffman tables. You can decide to use conditional motion vectors, but you must search for motion vectors, and you must write code that potentially uses the whole motion vector search range (hard-coded to 16 in the precode).

The video scenario is live streaming. You should not have an encoder pipeline of more than three frames. Also, you should avoid parallelization techniques that severely degrade video quality.

You should not replace the algorithms that you find in c63. Alternative motion vector search algorithms and DCT encoding algorithms provide considerable speedup potential. Still, they distract from this home exam's primary goal: identifying and implementing parallelization options.

Two test sequences in YUV format are available in the /mnt/sdcard directory on the lab machines:

foreman (352x288) CIF
tractor (1920x1080) 1080p

These should be used as input to the provided c63 encoder and can test your implementations.

Precode

The precode consists of the reference c63 code, including:

an encoder
a decoder
the command c63pred (which extracts the prediction buffer for debugging purposes)

The precode is written in C. You are not required to touch the decoder or c63pred.

The precode can be downloaded from a Git repository here (use the CUDA branch):

git clone https://github.com/griwodz/in5050-codec63.git

You must log in to the lab machines assigned to your group for this assignment. You should have received an email from the course administrators about user accounts. Information about how to access the machines can be found in the GPU FAQ. Using a CUDA GPU on a private computer for this assignment is also possible, but be aware that a discrete GPU does not have the shared memory architecture found on the Tegra.

You can adapt, modify, or rewrite the provided encoder to fully leverage the target architecture. You are, however, not allowed to change out the algorithms for Motion Estimation, Motion Compensation, or DCT/iDCT. You are not allowed to paste any other pre-written code into your implementation. You can also not post any code from the home exam online.

Start by profiling the encoder to identify which parts are bottlenecks. Remember, more profiling might be needed to identify new bottlenecks after optimizing a single code section.

Some usage examples:

To encode the foreman test sequence.

$ ./c63enc -w 352 -h 288 -o /tmp/test.c63 foreman.yuv

To decode a sequence.

$ ./c63dec /tmp/test.c63 /tmp/test.yuv

To dump the prediction buffer (used to test motion estimation):

$ ./c63pred /tmp/test.c63 /tmp/test.yuv

To playback a raw yuv file

mplayer

 -demuxer rawvideo -rawvideo w=352:h=288

Evaluation

Write a short report where you discuss your results. The exam will be graded on how well you can use the GPU architecture to solve the task.

In the evaluation, we will consider (in order) the following:

Motion Estimation, Motion Compensation, and DCT/iDCT algorithmic functions in the source code have been offloaded to the GPU.
- Document the bottleneck and the effect of your optimization.
A program that works (on the lab machines)
- Runs to completion. (*)
- Encodes foreman and tractor correctly.
- The output video has a similar quality and file size to the reference encoders.
- Readable, well-commented code
Effect of the GPU offload
- Understanding the GPU architecture.
  - Minimizing overhead by moving data between the CPU and GPU.
  - The correctness of memory use on the GPU (memory types, bank conflicts) and?GPU code optimization regarding branching.
- Bonus points can be given?for non-obvious optimizations.
The quality of the report that accompanies the code
- Clear and structured description of the performance changes caused by your modifications to the precode
- References to the relevant parts of the accompanying code (to aid the reviewer of the submitted assignment)
- Graphical presentation of the optimization steps and performance results (plots of performance changes)
- Comparison of / reflection about the alternative approaches tried out by your group.

^{(*) We do not debug code before testing; correctness and effectiveness are not evaluated if this is not fulfilled.}

Report

You must write the results as a technical report of no more than four pages in?ACM format (The ACM format is also available in Overleaf). The report should serve as a guide to the code modifications?you have made and the resulting performance changes. ?

Machine Setup

The lab machines with CUDA GPUs are at IFI. Machine names and access to them are available in the?GPU FAQ. If you have reported your group to the course administration, you should have been assigned a user account and provided with a username and a password.

Contact in5050@ifi.uio.no if you have problems logging in.

Formal Information

The deadline for handing in your assignment:

Design: Tuesday, February 10th at 10:00
Code: Tuesday, March 3rd at 23:59
Report: Friday, March 6th at 15:00

Deliver your code (as ZIP, TGZ, etc.)?and report (as PDF) to https://devilry.ifi.uio.no/.

Submit the design review and poster (as PDF) to in5050@ifi.uio.no.

The groups should also prepare a poster (to show on-screen during the lecture) and a 5-minute talk for the class on March 3rd. There will be a prize for the best poster/presentation (awarded by an independent panel and independent of the grade).

For questions and course-related chatter, we have created a Mattermost space:?https://mattermost.uio.no/ifi-undervisning/channels/in5050

Please check the?GPU FAQ?page for updates, and the FAQ

For questions, please contact:

in5050@ifi.uio.no

[1] http://www.w3.org/Graphics/JPEG/itu-t81.pdf

[2] http://en.wikipedia.org/wiki/JPEG#JPEG_codec_example