Digital Amphitheatre

Tiling Architecture

Each attendee in the amphitheater participates by unicasting video to their closest tiling agent, located via an anycast address. The agent, in turn, tiles together all the video streams it receives, multicasting the result to the group. All participants join the group, receiving and displaying the combined audience video. The panel members and speaker send directly to the multicast group, thus avoiding the tiling.

Spatial Tiling

The tiling agents combine video streams from several sources into a single, merged, stream generating a single high-bandwidth stream in the place of the individual, lower bandwidth, video streams.

The Spatial Tiling Process

To illustrate the spatial tiling operation, consider the example in the figure below: three streams of video are spatially tiled and represented as a single frame. The tiled frame consists of the three frames tiled side by side, with each of the frames being completely represented. The meta-data for each of the frames, in this case the frame size and block coordinates, are adjusted accordingly. Frame size reflects the total frame size and block coordinates are transposed to the correct location.

Spatial Image Tiling

It is important that spatial tiling does not add additional delay to the video stream. Tiling agents only parse and deconstruct the incoming video streams into smaller building blocks, whilst maintaining their relevant meta-data: no decompression is done in the tiling agent. To maintain independence between incoming and outgoing frame-rates, two sets of buffers are maintained per stream. The tiled frame is constructed at given intervals (determined by the outgoing frame rate) from the output buffers. New incoming frames are copied from the incoming buffer to the output buffers, once they are received in full.

Although, theorically, it is possible to tile an unlimited number of streams up to the network MTU, we have restricted the tiling to 15 video streams. This restriction allows us to use the built in mixer functionality of RTP/RTCP, since an RTP packet can carry the contributing source identifiers for up to 15 different sources. The input streams can be tiled in any geometry requested: for 15 streams the agent can generate a single row of 15x1, a square of 4x4 (where the last square will be empty), a 5x3 rectangle, or even a single row/column.

In our current implementation, the spatial tiling agents support two video representations: high bandwidth raw video in component YUV form, with conditional replenishment (YUVCR) and H.261 using only intra-frame compression. At the receiver, any YUVCR decoder can receive and display a tiled YUVCR stream. However, for H.261, we have added an H.261 tiled decoder, H.261t, to the video conferencing tool vic in order to receive and decode a tiled H.261 stream. This was necessary as the tiled H.261 stream no longer complies with the standard H.261 stream. Both the standard H.261 syntax and the RTP payload headers have been slightly modified.

Tiling YUVCR

The simple structure of YUVCR is well suited to spatial tiling. Each video frame is divided into macro blocks of 16x16 pixels, represented in planar YUV fo rmat with a 4:2:0 color subsampling. The conditional replenishment algorithm insures that only updated video macro blocks are transmitted, and provides for some reduction in the data rate, in what is otherwise raw video. Unlike H.261, no meta-data related to the frame is carried in the frames themselves. The size of the video frames and the macro block coordinates are carried in an RTP payload header. For example, the figure below displays the RTP payload for frame 2 of our tiling example, where macro blocks (2,2) and (3,3) of the frame have been updated and must be transmitted. The RTP payload contains first the height and width of the video frame (80x64), followed by the coordinates and data of the two updated macro blocks. This structure not only makes for a high degree of flexibility in terms of frame sizes, up to 4096 pixels in each axis (the numbers are actually stored in multiples of 8), it also lends itself very well to spatial tiling.

Tiling YUVCR

Tiling YUVCR frames essentially consists of manipulating the size of the video frame in the RTP payload header and the coordinates of each 16x16 macro block such that it reflects the position of the macro block in the new tiled frame. As the information in the RTP payload is sufficient for tiled YUVCR, it is unnecessary to change the RTP payload, therefore the YUVCR decoder need not be altered either. At the receiving side, the YUVCR decoder is unchanged, it renders and displays the tiled frame in the usual manner.

Tiling H.261

Tiling H.261 frames is, in essence, the same as tiling YUVCR frames, since H.261 also divides each video frame into 16x16 macro-blocks, which are the smallest building blocks that the STAs process. However, tiling H.261 is somewhat complicated by intricate header system used to describe a frame and the use of Huffman encoding. The figure below shows the syntax diagram for an H.261 video frame. As is obvious from the diagram, extracting a macro-block requires manipulating non-byte aligned bit values and variable length fields. In this diagram following the solid lines in the macro block layer, produces the headers for intra-frame H.261.

H.261 syntax

An H.261 video frame consists of three layers: a picture layer, a group of block (GOB) layer and a macro block (MB) layer. Each GOB is divided into 33 macro-blocks, arranged in a 3x11 matrix. H.261 supports two scanning formats CIF and QCIF. A CIF frame contains 12 GOBs number consecutively from 1 to 12, whereas a QCIF frame contains 3 GOBs numbered 1, 3 and 5.

H.261 image size

In the tiled H.261 frame, GOBs are numbered consecutively from 1 to Nx3 + Mx12, where N is the number of QCIF frames and M is the number of CIF frames tiled (currently both N and M cannot be non-zero, tiling of mixed CIF and QCIF frames planned for a future version). Since the standard H.261 GOB header only allocates 4 bits to the GOB Number (GN) field, it is necessary to extend this field to 8 bits in the tiled frame, so as to accommodate up to 15 CIF frames. GOB numbers are hence renumbered within the STAs prior to packetization. No changes are made to macro block headers: each macro block header is copied to the tiled frame as is. The tiled frame is preceded by a single picture header.

Overall, the primary changes made to the tiled H.261 frame, are replacing the individual picture headers by a single picture header and extending the GN field to 8 bits. For a tiled frame of 15, using a single picture header results in 56 bytes of savings (4 bytes per picture header) and the 4 bit increase in the GN field adds 12 bits per QCIF frame, or 22.5 bytes for a tiled frame of 15. All in all, 15 tiled QCIF frames save 33.5 bytes when compared with the non-tiled frames. For CIF frames the additional 4 bits per GOB, adds up to 6 bytes per frame. Therefore when 15 CIF frames are tiled, the tiled frame is 34 bytes larger than the individual non-tiled frames. However, this increase in ameliorated by the reduction in the number of packets and per packet overhead (40 bytes for each IP/UDP/RTP header) once the frame is packetized.

The increase in the size of the GOB number field must also be reflected in the standard RTP payload for H.261, which only allocates 4 bits for the GOBN. We define a new RTP payload header for tiled H.261 where GOBN is extended to 8 bits. To maintain the 4 byte size of the payload header, for intra-frame H.261, we have eliminated the two motion vectors from the payload header (since our implementation does not currently support motion vectors).

Other information needed for by the H.261t decoder is the requested formation of the tiled frame, i.e., is the tiled frame a row of 15 or a block of 3x5? This information is signaled out of band to the STA and the number of tiled frames is extracted from the number of contributing sources. Thereby allowing the H.261t decoder to deduce the height and width of the tiled H.261t frame.

Performance of Tiling Agents

The main goal of our performance measurements was to answer the following question: What benefits are gained by using our STAs? To identify potential performance gains we decided to measure and quantify: (1) bandwidth, in bits per second (bps); (2) packets per second (pps); and (3) the total number of streams the end system is capable of decoding and rendering (N). We compared the value of these variables in a conferencing session with and without the use of STAs. The test material comprised 15 YUVCR streams and 15 H.261 streams, all recorded at 8fps and 2 minutes long.

The receiving system was what was considered an average user grade system at the time: a 550Mhz Pentium III machine with 256M of memory, running Red Hat Linux 7. The STAs were initially run on a somewhat lower grade system, a 400Mhz Pentium II with 64M memory, running FreeBSD 3.4, but this was found to have insufficient memory to run multiple STAs, although it was sufficient in other ways. A more powerful system, with 512M of memory, was used to host the tiling agents during the tests we report. Work is underway to reduce the memory footprint of the tiling agents, since they are otherwise not very compute intensive and require only a few percentage of CPU time.

In our initial set of trials, we measured bps and pps. To do so, first we streamed the 15 test video streams individually to vic. Next, we ran the test video through our STA with the output frame rate set to 8fps. To measure the bitrate, bps, and packet rate, pps, we instrumented vic such that it logged these variables, along with other decoding statistics, to a file.

The figure below shows the reduction in pps is due to the aggregation of smaller packets. The STA generates a single large frame, therefore there are fewer `half empty' packets in the resulting stream. In the tiled YUVCR stream pps is reduced by approximately 13% and in the H.261 tiled stream pps is reduced by approximately 35%. The higher reduction of pps in the H.261 tiled stream in our trials is the result of the more flexible nature of H.261 and its compression scheme. H.261 UDP packets, range anywhere in size from 48 bytes to 1024 bytes. A YUVCR packet, on the other hand, can only hold one, two or three macro blocks, which results in UDP packets sizes of 400 bytes, 786 bytes and 1172 bytes. This means there are less options on how to aggregate packets for the YUVCR tiled stream, whilst remaining within the 1500 byte Ethernet MTU, which results in a lower reduction of pps.

Change in packet rate due to tiling

In terms of bandwidth, bps, the tiled YUVCR stream is reduced by an average of 3% and the H.261 stream by 4%. Although bandwidth is reduced over the duration of the test runs, the graphs reveal that this in not the case on a per minute bases, as in some instances the bps of the separate streams appears to be less than the tiled stream. This is in part due to synchronization differences between the separate streams and the tiled stream, and in part due to measurement artifacts resulting from the averaging process. We also note the the low reduction in bandwidth is to be expected. In these tests the STA reproduces the input video streams, exactly as they come in, without any temporal or spatial down sampling. Both the input and output frame rates of the STAs are 8 fps and the STAs more or less copy each incoming frame to the outgoing tiled frame. There are some instances where incoming frames are not pushed out by the STA due to synchronization variability between incoming and outgoing frame rates, this can results in a dropped frame. However as our data shows this does not happen often. The existing reduction in bps is mainly a reflection of the reduced pps and lower packet overhead.

Change in data rate due to tiling

Finally, we turned our attention to the performance of the end-system, and quantifying the number of streams that can be supported with the aid of the tiling agents. Our decoder maintains statistics on the number of packets correctly decoded and on packets discarded due to late arrival or lack of rendering time. We used these statistics to measure the maximum number of streams, N, our end-system could receive without loss, both with and without the STAs. This process was conducted by incrementally increasing the number of individual streams until the end-system reached the point of saturation. For the H.261 video streams it was found that the system could decode and render up to 55 individual video streams without loss. With this number of streams CPU was at 100% utilization. When receiving tiled H.261, the system could receive 6 tiled streams of 15 and an additional stream of 2 tiles, comprising a total of 97 individual streams, an overall increase of 43% in number of streams. For YUVCR, our end-system was able to receive 42 individual streams without loss, while tiled, the end-system could process 3 fully tiled streams and one tiled stream of 12, a total of 57 streams.

These numbers clearly demonstrate the reduction of workload on the end-system due to the spatial tiling process. For our H.261 streams the end-systems is capable of receiving almost twice as many video streams, once the video streams are tiled and for YUVCR the number of streams is increased by about 25%. The higher increase in N for H.261 is in part a reflection of the greater reduction in pps for the H.261 streams. In fact in all our tests H.261 fared better than YUVCR. Although YUVCR requires no processing time, its higher data rates and non-flexible packetizing scheme, seem to make it an unsuitable candidate for spatial tiling.

This leads us to conclude that a primary load on end-systems is per packet interrupt processing rather than the computational complexity of the decoding process and therefore spatial tiling is more amendable to relatively highly compressed video streams where the average packet size is significantly smaller than the network MTU. Having a significant number of half-full packets, gives the STAs more leverage in reducing the overall packet rate.

Cumulative distribution of packet sizes

Other Considerations

Our work has focussed on tiling intra-frame video only, for ease of implementation and to demonstrate the concept. Tiling inter-frame codecs can pose interesting challenges in terms of synchronization. Of course, in the best case scenario where the keyframe and intermediate frames of all incoming video streams are in sync, the STA will simply produce either keyframes or intermediate frames, based on the last frame it has received from all streams. However, it is unlikely that the incoming streams will always be in sync, further more, they may simply have different keyframe distributions. So what is an STA to do?

There are two possible solutions when a tiling agent is confronted with a mixture of keyframes and intermediate frames. In the first solution, the STA simply produces two video frames at once. One frame tiled from the latest keyframes and an other from the latest intermediate frames. Although a simple solution, the frame rate of the tiled stream can become rather erratic, and could in the worse case be double that of the intende d rate. The second solution avoids this pitfall, but is far more computationally intensive: the tiling agent retains at all times the latest frame for a video stream. This includes the last keyframe plus any updates since. When any number of actual keyframes must be tiled, the STA tiles the keyframes plus the update frames that it maintains for all streams.

Of course each technique has its own trade offs. The first solution increases the frame rate of the outgoing tiled video stream, although it does not increase bit-rate. Conversely the second solution increases bit-rate but maintains the frame rate of the outgoing stream, and is more resilient to packet loss. In addition this solution imposes additional computational overhead and buffering requirements on the tiling agent. The appropriateness of each solution is application dependent.