Faq about multimedia

VCN (Video, Compression, Networking) Glossary
This is a collection of often used and misused technical terms regarding video, compression and networking.
Many sources contributed to this list.

If you wish to contribute, correct any mistake or just send your comments and impressions please contact :

Luigi.Filippini@crs4.it
ATV (Advanced TV)
Although sometimes used interchangeably, advanced and high-definition television (HDTV) are not one and the same. Advanced television (ATV) would distribute wide-screen television signals with resolution substantially better than current systems. It requires changes to current emission regulations, including transmission standards. In addition, ATV would offer at least two-channel, CD-quality audio.
A:B:C notation
The a:b:c notation for sampling ratios, as found in the CCIR-610 specifications, has the following meaning :
4:2:2 means 2:1 horizontal downsampling, no vertical downsampling. (Think "4 Y samples for every 2 Cb and 2 Cr samples in a scanline".)
4:1:1 *ought* to mean 4:1 horizontal downsampling, no vertical. (Think "4 Y samples for every 1 Cb and 1 Cr samples in a scanline".) But it is often misused to mean the same as:
4:2:0 means 2:1 horizontal and 2:1 vertical downsampling. (Think "I want some of whatever these guys were taking.")
Not only is this notation not internally consistent, but it is incapable of being extended to represent any unusual sampling ratios, eg different ratios for the Cb and Cr channels.
Arithmetic Coding
Perhaps the major drawback to each of the Huffman encoding techniques is their poor performance when processing texts where one symbol has a probability of occurrence approaching unity. Although the entropy associated with such symbols is extremely low, each symbol must still be encoded as a discrete value.
Arithmetic coding removes this restriction by representing messages as intervals of the real numbers between 0 and 1. Initially, the range of values for coding a text is the entire interval [0, 1]. As encoding proceeds, this range narrows while the number of bits required to represent it expands. Frequently occurring characters reduce the range less than characters occurring infrequently, and thus add fewer bits to the length of an encoded message.

ATM
ATM (Asynchronous Transfer Mode) is a switching/transmission technique where data is transmitted in small, fixed sized cells (5 byte header, 48 byte payload). The cells lend themselves both to the time-division- multiplexing characteristics of the transmission media, and the packet switching characteristics desired of data networks. At each switching node, the ATM header identifies a virtual path or virtual circuit that the cell contains data for, enabling the switch to forward the cell to the correct next-hop trunk. The virtual path is set up through the involved switches when two endpoints wish to communicate. This type of switching can be implemented in hardware, almost essential when trunk speed range from 45Mb/s to 1Gb/s.
B-Y R-Y
The human visual system has much less acuity for spatial variation of colour than for brightness. Rather than conveying RGB, it is advantageous to convey luma in one channel, and colour information that has had luma removed in the two other channels. In an analog system, the two colour channels can have less bandwidth, typically one-third that of luma. In a digital system each of the two colour channels can have considerably less data rate (or data capacity) than luma.
Green dominates the luma channel: about 59% of the luma signal comprises green information. Therefore it is sensible, and advantageous for signal-to-noise reasons, to base the two colour channels on blue and 1red. The simplest way to remove luma from each of these is to subtract it to form the difference between a primary colour and luma. Hence, the basic video colour-difference pair is (B-Y), (R-Y) [pronounced "B minus Y, R minus Y"].

The (B-Y) signal reaches its extreme values at blue (R=0, G=0, B=1; Y=0.114; B-Y=+0.886) and at yellow (R=1, G=1, B=0; Y=0.886; B-Y=-0.886). Similarly, the extrema of (R-Y), +-0.701, occur at red and cyan. These are inconvenient values for both digital and analog systems. The colour spaces YPbPr, YCbCr, PhotoYCC and YUV are simply scaled versions of (Y, B-Y, R-Y) that place the extrema of the colour difference channels at more convenient values.

Bridge
Bridges are devices that connect similar and dissimilar LANs at the data link layer (OSIlayer 2), regardless of the physical layer protocols or media being used. Bridges require that the networks have consistent addressing schemes and packet frame sizes. Current introductions have been termed learning bridges since they are capable of updating node address (tracking) tables as well as overseeing the transmission of data between two Ethernet LANs.
Brouter
Brouters are bridge/router hybrid devices that offer the best capabilities of both devices in one unit. Brouters are actually bridges capable of intelligent routing and therefore are used as generic components to integrate workgroup networks . The bridge function filters information that remains internal to the network and is capable of supporting multiple higher-level protocols at once.
The router component maps out the optimal paths for the movement of data from one point on the network to another. Since the brouter can handle the functions of both bridges and routers, as well as bypass the need for the translation across application protocols with gateways, the device offers significant cost reductions in network development and integration.

CCITT
Commite' Consultatif International de Telecommunications et Telegraphy A committee of the International Telecommunications Union responsible for making technical recommendations about telephone and data communication systems for PTTs and suppliers. Plenary sessions are held every four years to adopt new standards.
CD-DA
CD-DA (Compact Disc-Digital Audio), are standard music CDs. CD-DA began CD-ROM when people realized that you could store a whole bunch of computer data on a 12cm optical disc (650mb). CD-ROM drives are simply another kind of digital storage media for computers, albeit read-only. They are peripherals just like hard disks and floppy drives. (Incidentally, the convention is that when referring to magnetic media, it is spelled disk. Optical media like CDs, LaserDisc, and all the other formats are spelled disc)
CD-I
CD-I means Compact Disc Interactive. It is meant to provide a standard platform for mass consumer interactive multimedia applications. So it is more akin to CD-DA, in that it is a full specification for both the data/code and standalone playback hardware: a CD-I player has a CPU, RAM, ROM, OS, and audio/video/(MPEG) decoders built into it. Portable players add an LCD screen and speakers/phonejacks. It has limited motion video and still image compression capabilities. It was announced in 1986, and was in beta test by Spring 1989
This is a consumer electronics format that uses the optical disc in combination with a computer to provide a home entertainment system that delivers music, graphics, text, animation, and video in the living room. Unlike a CD-ROM drive, a CD-I player is a standalone system that requires no external computer. It plugs directly into a TV and stereo system and comes with a remote control to allow the user to interact with software programs sold on discs. It looks and feels much like a CD player except that you get images as well as music out of it and you can actively control what happens. In fact, it is a CD-DA player and all of your standard music CDs will play on a CD-I player; there is just no video in that case.

For a CD-I disk, there may be as few as 1 or as many as 99 data tracks. The sector size in the data tracks of a CD-I disk is approximately 2 kbytes. Sectors are randomly accessible, and, in the case of CD-I, sectors can be multiplexed in up to 16 channels for audio and 32 channels for all other data types. For audio these channels are equivalent to having 16 parallel audio data channels instantly accessible during the playing of a disk.

If you want information about Philips CD-I products, you can call these numbers:

US: Consumer hotline: 800-845-7301 For nearest store: 800-223-7772 Developers
hotline: 800-234-5484 UK: Philips CD-I hotline: 0800-885-885
Some useful references about CD-I are :
"Discovering CD-I" available for $45 from:
"Discovering CD-I" Microware Systems Corporation 1900 NW 114th Street Des Moines,
IA 50325-7077 1-800-475-9000 Other books by Philips IMS and published by Addison
Wesley: "Introducing CD-I" ISBN 0-201-62748-5 "The CD-I Production Handbook"
ISBN 0-201-62750-7 "The CD-I Design Handbook" ISBN 0-201-62749-3
CD-ROM
CD-ROM means "Compact Disc Read Only Memory". A CD-ROM is physically identical to a Digital Audio Compact Disc used in a CD player, but the bits recorded on it are interpreted as computer data instead of music. You need to buy a CD-ROM Drive and attach it to your computer in order to use CD-ROMs.
A CD-ROM has several advantages over other forms of data storage, and a few disadvantages. A CD-ROM can hold about 650 megabytes of data, the equivalent of thousands of floppy discs. CD-ROMs are not damaged by magnetic fields or the xrays in airport scanners. The data on a CD-ROM can be accessed much faster than a tape, but CD-ROMs are 10 to 20 times slower than hard discs.

You cannot write to a CD-ROM. You buy a disc with the data already recorded on it. There are thousands of titles available.

CD-XA
CD-XA is a CD-ROM extension being designed to support digital audio and still images.
Announced in August 1988 by Microsoft, Philips, and Sony, the CD-ROM XA (for Extended Architecture) format incorporates audio from the CD-I format. It is consistent with ISO 9660, (the volume and the structure of CD-ROM), is an application extension of the Yellow Book, and draws on the Green Book.

CD-XA defines another way of formatting sectors on a CD-ROM, including headers in the sectors that describe the type (audio, video, data) and some additional info (markers, resolution in case of a video or audio sector, file numbers, etc).

The data written on a CD-XA can still be in ISO9660 file system format and therefore be readable by MSCDEX and Unix CD-ROM file system translators. A CD-I player can also read CD-XA discs even if its own `Green Book' file system only resembles ISO9660 and isn't fully compatible. However, when a disc is inserted in a CD-I player, the player tries to load an executable application from the CD-XA, normally some 68000 application in the /CDI directory. Its name is stored in the disc's primary volume descriptor. CD-XA bridge discs, like Kodak's PhotoCDs, do have such an application, ordinary CD-XA discs don't.

A CD-DA drive is a CD-ROM drive but with some of the compressed audio capabilities found in a CD-I player (called ADPCM). This allows interleaving of audio and other data so that an XA drive can play audio and display pictures (or other things) simultaneously. There is special hardware in an XA drive controller to handle the audio playback. This format came from a desire to inject some of the features of CD-I back into the professional market.

Cell Compression (from Sun Microsystem Inc.)
Cell is a compression technique developed by SMI. The compression algorithms, the bit-stream definition, and the decompression algorithms are open. That is Sun will tell anybody who is interested about them . Cell compression is similar to MPEG and H.261 in that there is a lot of room for value-add on the compressor end. Getting the highest quality image from a given bit count at a reasonable amount of compute is an art. In addition the bit-stream completely defines the compression format and defines what the decoder must do and there is less art in the docoder.
There are two flavors of Cell: the original called Cell or CellA, and a newer flavor called CellB. CellA is designed for use many times video, where one does not mind that the encoder runs at less than real time. For example, CD-ROM playback, distance learning, video help for applications. CellB is designed for use once video where the encoder must run at real time (interactive) rates. For example, video mail and video conferencing.

Both flavors of cell use the same basic technique of representing each 4x4 pixel block with a 16-bit bitmask and two 8-bit vector quantized codebook indices. This produces a compression of 12-1 (or 8-1) since each 16 pixel block is represented by 32 bits (16-bit mask, and two 8-bit codebook indices). In both flavors, further compression is accomplished by checking current blocks against the spatially equivalent block in the previous frame. If the new block is "close enough" to the old block, the block is coded as a skip code. Consecutive skip codes are run-length encoded for further compression. Modifying the definition of close enough allows one to trade off quality and compression rates. Both version of Cell typically compress video images down to about .75 to .5 bits/pixel.

Both flavors have many similar steps in the early part of compression. For each 4x4 block, the compressor calculates the average luma of the 16 pixels. It then partions the pixels into two groups, those whose luma is above the average and those whose luma is below the average. The compressor sets the 16-bit bitmask based on which pixels are in each partition. The compressor then calculates a color to represent each partition.

In Cell, the compressor calculates an average color of each partion, it then does a vector quantization against the Cell codebook (which is just a color-map). The encoded block is the 16-bit mask and the two 8-bit colormap indices. The compressor maintains statistics about how much error each codebook entry is responsible for and how many times each codebook entry is used. It uses these numbers to adaptively refine the codebook on each frame. Changed codebooks are sent in the bitstream.

In CellB, the compressor calculates the average luma for each partition and the average chroma for the entire block. This gives two colors [Y_lo, Cb_ave, Cr_ave] and [Y_hi, Cb_ave, Cr_ave]. The pair [Y_lo, Y_hi] is vector quantized against the Y/Y codebook and the pair [Cb_ave, Cr_ave] is vector quantized against the Cr/Cb codebook. Here the encoded block is the 16-bit mask and the two 8-bit VQ indices. Both of CellB's codebooks are fixed. This allows both the compressor and decompressor to run at high-speed by using table lookups. Both codebooks are designed with the human visual system in mind. They are not just uniform partition of the Y/Y or Cr/Cb space. Each codebook has fewer than 256 entries.

Cell (or CellA) is supported in XIL 1.0 from SMI. It is part of Solaris 2.2. CellB is supported in XIL 1.1 from SMI. It will be part of Solaris 2.3 when that becomes available. Complete bitstream definitions for both flavors of cell are in the XIL 1.1 programmer's guide. There is some discussion of the CellA bitstream in the XIL 1.0 programmer's guide.

CellB was used for the SMI Scott McNealy holiday broadcast, where he talk to the company in real-time over Sun Wide Area Network. This broadcast reach from Tokyo Japan to Munich Germany with over 3000 known viewers.

CIF
Common Image Format. The standardization of the structure of the samples that represent the picture information of a single frame in digital HDTV, independent of frame rate and sync/blank structure.
The uncompressed bit rates for transmitting CIF at 29.97 frames/sec is 36.45 Mbit/sec.

DPCM (Differential Pulse Code Modulation)
Differential pulse code modulation (DPCM) is a source coding scheme that was developed for encoding sources with memory.
The reason for using the DPCM structure is that for most sources of practical interest, the variance of the prediction error is substantially smaller than that of the source.

DVI (Digital Video Interactive)
Digital Video Interactive (DVI) technology brings television to the microcomputer. DVI's concept is simple: information is digitized and stored on a random-access device such as a hard disk or a CD-ROM, and is accessed by a computer. DVI requires extensive compression and real-time decompression of images. Until recently this capability was missing. DVI enables new applications. For example, a DVI CD-ROM disk on twentieth-century artists might consist of 20 minutes of motion video; 1,000 high-res still images, each with a minute of audio; and 50,000 pages of text. DVI uses the YUV system, which is also used by the European PAL color television system. The Y channel encodes luminance and the U and V channels encode chrominance. For DVI, we subsample 4-to-1 both vertically and horizontally in U and V, so that each of these components requires only 1/16 the information of the Y component. This provides a compression from the 24-bit RGB space of the original to 9-bit YUV space.
The DVI concept originated in 1983 in the inventive environment of the David Sarnoff Research Center in Princeton, New Jersey, then also known as RCA Laboratories. The ongoing research and development of television since the early days of the Laboratories was extending into the digital domain, with work on digital tuners, and digital image processing algorithms that could be reduced to cost-effective hardware for mass-market consumer television.

EACEM
European Association of Consumer Electronics Manufacturers
EDTV
Extended [or Enhanced] Definition Television. A television system that offers picture quality substantially improved over conventional 525-line or 625-line receivers, by employing techniques at the transmitter and at the receiver that are transparent to (and cause no visible quality degradation to) existing 525-line or 625-line receivers. One example of EDTV is the improved separation of luminance and colour components by pre-combing the signals prior to transmission, using techniques that have been suggested by Faroudja, Central Dynamics and Dr William Glenn
Entropy
Entropy, the average amount of information represented by a symbol in a message, is a function of the model used to produce that message and can be reduced by increasing the complexity of the model so that it better reflects the actual distribution of source symbols in the original message.
Entropy is a measure of the information contained in message, it's the lower bound for compression.

ESAC
Economics and Statistics Advisory Committee
ESPRIT
European Strategic Programme for Research and Development in Information Technology
ETSI
European Telecommunication Standard Institute
FFT
Fast Fourier Transform
Gateway
Gateways provide functional bridges between networks by receiving protocol transactions on a layer-by-layer basis from one protocol (SNA) and transforming them into comparable functions for the other protocol (OSI). In short, the gateway provides a connection with protocol translation between networks that use different protocols. Interestingly enough, gateways, unlike the bridge, do not require that the networks have consistent addressing schemes and packet frame sizes. Most proprietary gateways (such as IBM SNA gateways) provide protocol converter functions up through layer six of the OSI, while OSI gateways perform protocol translations up through OSI layer seven.
H.261
Recognizing the need for providing ubiquitous video services using the Integrated Services Digital Network (ISDN), CCITT (International Telegraph and Telephone Consultative Committee) Study Group XV established a Specialist Group on Coding for Visual Telephony in 1984 with the objective of recommending a video coding standard for transmission at m x 384 kbit/s (m=1,2,..., 5). Later in the study period after new discoveries in video coding techniques, it became clear that a single standard, p x 64 kbit/s (p = 1,2,..., 30), can cover the entire ISDN channel capacity. After more than five years of intensive deliberation, CCITT Recommendation H.261, Video Codec for Audiovisual Services at p x 64 kbit/s, was completed and approved in December 1990. A slightly modified version of this Recommendation was also adopted for use in North America.
The intended applications of this international standard are for videophone and videoconferencing. Therefore, the recommended video coding algorithm has to be able to operate in real time with minimum delay. For p = 1 or 2, due to severely limited available bit rate, only desktop face-to-face visual communication (often referred to as videophone) is appropriate. For p>=6, due to the additional available bit rate, more complex pictures can be transmitted with better quality. This is, therefore, more suitable for videoconferencing.

HDTV
High-Definition Television. A television system with approximately twice the horizontal and twice the vertical resolution of current 525-line and 625-line systems, component colour coding (e.g. RGB or YCbCr) a picture aspect ratio of 16:9 and a frame rate of at least 24 Hz. Currently there are a number of proposed HDTV standards, including HD-MAC, HiVision and others.
Hybrid Coder
In the archetypal hybrid coder, an estimate of the next frame to be processed is formed from the current frame and the difference is then encoded by some purely intraframe mechanism. In recent years, the most attention has been paid to the motion-compensated DCT coder where the estimate is formed by a two-dimensional warp of the previous frame and the difference is encoded using a block transform (the Discrete Cosine Transform).
This system is the basis for international standards for videotelephony, is used for some HDTV demonstrations, and is the prototype from which MPEG was designed. Its utility has been demonstrated for video sequence, and the DCT concentrates the remaining energy into a small number of transform coefficients that can be quantized and compactly represented.

The key feature of this coder is the presence of a complete decoder within it. The difference between the current frame as represented as the receiver and the incoming frame is processed. In the basis design, therefore, the receiver must track the transmitter precisely, the decoder at the receiver and the decoder at the transmitter must match. The system is sensitive to channel errors and does not permit random access. However, it is on the order of three to four times as efficient as one that uses no prediction.

In practice, this coder is modified to suit specific application. The standard telephony model uses a forced update of the decoded frame so that channel errors do not propagate. When a participant enters the conversation late or alternates between image sources, residual errors die out and a clear image is obtained after a few frames. Similar techniques are used in versions of this coder being developed for direct satellite television broadcasting.

Huffman Coding
For a given character distribution, by assigning short codes to frequently occurring characters and longer codes to infrequently occurring characters, Huffman's minimum redundancy encoding minimizes the average number of bytes required to represent the characters in a text.
Static Huffman encoding uses a fixed set of codes, based on a representative sample of data, for processing texts. Although encoding is achieved in a single pass, the data on which the compression is based may bear little resemblance to the actual text being compressed.

Dynamic Huffman encoding, on the other hand, reads each text twice; once to determine the frequency distribution of the characters in the text and once to encode the data. The codes used for compression are computed on the basis of the statistics gathered during the first pass with compressed texts being prefixed by a copy of the Huffman encoding table for use with the decoding process.

By using a single-pass technique, where each character is encoded on the basis of the preceding characters in a text, Gallager's adaptive Huffman encoding avoids many of the problems associated with either the static or dynamic method.

IDTV
Improved Definition Television. A television system that offers picture quality substantially improved over conventional receivers, for signals originated in standard 525-line or 625-line format, by processing that involves the use of field store and/or frame store (memory) techniques at the receiver . One example is the use of field or frame memory to implement de-interlacing at the receiver in order to reduce interline twitter compared to that of an interlaced display . IDTV techniques are implemented entirely at the receiver and involve no change to picture origination equipment and no change to emission standards
IEC
International Electrotechnic Committee. A standardisation body at the same level as ISO
Interactive videodisc
Interactive video-disc is another video related technology, using an analog approach. It has been available since the early 1980s, and is supplied in the U.S. primarily by Pioneer, Sony, and IBM.
ISDN
ISDN stands for "Integrated Services Digital Networks", and it's a CCITT term for a relatively new telecommunications service package. ISDN is basically the telephone network turned all-digital end to end, using existing switches and wiring (for the most part) upgraded so that the basic call is a 64 kbps end-to-end channel, with bit-diddling as needed (but not when not needed!). Packet and maybe frame modes are thrown in for good measure, too, in some places. It's offered by local telephone companies, but most readily in Australia, France, Japan, and Singapore, with the UK and Germany somewhat behind, and USA availability rather spotty.
A Basic Rate Interface (BRI) is two 64K bearer ("B") channels and a single delta ("D") channel. The B channels are used for voice or data, and the D channel is used for signaling and/or X.25 packet networking. This is the variety most likely to be found in residential service. Another flavor of ISDN is Primary Rate Interface (PRI). Inside the US, this consists of 24 channels, usually divided into 23 B channels and 1 D channel, and runs over the same physical interface as T1. Outside of the US then PRI has 31 user channels, usually divided into 30 B channels and 1 D channel. It is typically used for connections such as one between a PBX and a CO or IXC.

Letter-box
A television system that limits the recording or transmission of useful picture information to about three-quarters of the available vertical picture height of the distribution format (e.g. 525-line) in order to offer program material that has a wide picture aspect ratio
Luma (Y)
Video originates with linear-light (tristimulus) RGB primary components, conventionally contained in the range 0 (black) to +1 (white). From the RGB triple, three gamma-corrected primary signals are computed; each is essentially the 0.45-power of the corresponding tristimulus value, similar to a square-root function.
In a practical system such as a television camera, however, in order to minimize noise in the dark regions of the picture it is necessary to limit the slope (gain) of the curve near black. It is now standard to limit gain to 4.5 below a tristimulus value of +0.018, and to stretch the remainder of the curve to place the Y-intercept at -0.099 in order to maintain function and tangent continuity at the breakpoint:

Rgamma = (1.099 * pow(R,0.45))
- 0.099 Ggamma = (1.099 * pow(G,0.45)) - 0.099 Bgamma = (1.099 * pow(B,0.45))
- 0.099
Luma is then computed as a weighted sum of the gamma-corrected primaries:
Y = 0.299*Rgamma + 0.587*Ggamma + 0.114*Bgamma
The three coefficients in this equation correspond to the sensitivity of human vision to each of the RGB primaries standardized for video. For example, the low value of the blue coefficient is a consequence of saturated blue colours being perceived as having low brightness.
The luma coefficients are also a function of the white point (or chromaticity of reference whitex). Computer users commonly have a white point with a colour temperature in the range of 9300 K, which contains twice as much blue as the daylight reference CIE D65 used in television. This is reflected in pictures and monitors that look too blue.

Although television primaries have changed over the years since the adoption of the NTSC standard in 1953, the coefficients of the luma equation for 525 and 625 line video have remained unchanged. For HDTV, the primaries are different and the luma coefficients have been standardized with somewhat different values.

Lempel-Ziv Welch (LZW) Compression
Algorithm used by the Unix compress command to reduce the size of files, eg. for archival or transmission. The algorithm relies on repetition of byte sequences (strings) in its input. It maintains a table mapping input strings to their associated output codes. The table initially contains mappings for all possible strings of length one. Input is taken one byte at a time to find the longest initial string present in the table. The code for that string is output and then the string is extended with one more input byte, b. A new entry is added to the table mapping the extended string to the next unused code (obtained by incrementing a counter). The process repeats, starting from byte b. The number of bits in an output code, and hence the maximum number of entries in the table is usually fixed and once this limit is reached, no more entries are added.
Model-Based Coder
Communicating a higher-level model of the image than pixels is an active area of research. The idea is to have the transmitter and receiver agree on the basic model for the image; the transmitter then sends parameters to manipulate this model in lieu of picture elements themselves. Model-based decoders are similar to computer graphics rendering programs.
The model-based coder trades generality for extreme efficiency in its restricted domain. Better rendering and extending of the domain are research themes.

Modem (Modulator/demodulator)
An electronic device for converting between serial data (typically RS-232) from a computer and an audio signal suitable for transmission over telephone lines. The audio signal is usually composed of silence (no data) or one of two frequencies representing 0 and 1. Modems are distinguished primarily by the baud rates they support which can range from 75 baud up to 19200 and beyond.
Data to the computer is sometimes at a lower rate than data from the computer on the assumption that the user cannot type more than a few characters per second. Various data compression and error algorithms are required to support the highest speeds. Other optional features are auto-dial (auto-call) and auto-answer which allow the computer to initiate and accept calls without human intervention.

NAB
National Association of Broadcasters
NHK
Nippon Hoso Kyokai, principal japanese broadcaster
NTSC (National Television System Committee)
USA video standard with image format 4:3, 525 lines, 60 Hz and 4 Mhz video bandwidth with a total 6 Mhz of video channel width. NTSC uses YIQ
OSI
The Open Systems Interconnection Reference Model was formally initiated by the International Organization for Standardization (ISO) in March, 1977, in response to the international need for an open set of communications standards. OSI's objectives are:
To provide an architectural reference point for developing standardized procedures
To allow inter-networking between networks of the same type
To serve as a common framework for the development of services and protocols consistent with the OSI model
To expedite the offering of interoperable, multi-vendor products and services
The model is similar in structure to that of SNA. It consists of seven architectural layers: the physical layer; the data link layer, the network layer; the transport layer; the session layer; the presentation layer; the application layer.
The physical and data link layers provide the same functions as their SNA counterparts (physical control and data link control layers). The network layer selects routing services, segments blocks and messages, and provides error detection, recovery, and notification.

The transport layer controls point-to-point information interchange, data packet size determination and transfer, and the connection/disconnection of session entities.

The session layer serves to organize and synchronize the application process dialog between presentation entities, manage the exchange of data (normal and expedited) during the session, and monitor the establishment/release of transport connections as requested by session entities.

The presentation layer is responsible for the meaningful display of information to application entities.

More specifically, the presentation layer identifies and negotiates the choice of communications transfer syntax and the subsequent data conversion or transformation as required. The application layer affords the interfacing of application processes to system interconnection facilities to assist with information exchange. The application layer is also responsible for the management of application processes including initialization, maintenance and termination of communications, allocation of costs and resources, prevention of deadlocks, and transmission security.

PAL (Phase Alternating Line)
European video standard with image format 4:3, 625 lines, 50 Hz and 4 Mhz video bandwidth with a total 8 Mhz of video channel width. PAL uses YUV.
QCIF
Quarter Common source Intermediate Format (1/4 CIF , e.g. 1180*144)
The uncompressed bit rates for transmitting QCIF at 29.97 frames/sec is 9.115 Mbit/s.

Region Coding
Region Coding has received attention because of the ease with which it can be decoded and the fact that a coder of this type is used in Intel's Digital Video Interactive system (DVI), the only commercially available system designed expressly for low-cost, low-bandwidth multimedia video. Its operation is relatively simple. The basic design is due to Kunt.
Envision a decoder that can reproduce certain image primitives well. A typical set might consist of rectangular areas of constant color, smooth shaded patches and some textures. The image is analyzed into regions that can be expressed in terms of these primitives. The analysis is usually performed using a tree-structured decomposition where each part of the image is successively divided into smaller regions until a patch that meets either the bandwidth constraints or the quality desired can be fitted. Only the tree description and the parameters for each leaf need then be transmitted. Since the decoder is optimized for the reconstruction of these primitives, it is relatively simple to build.

To account for image data that does not encode easily using the available primitives, actual image data can also be encoded and transmitted, but this is not as efficient as fitting a patch.

This coder can also be combined with prediction (as it is in DVI), and the predicted difference image can then be region coded. A key element in the encoding operation is a region growing step where adjacent image patches that are distinct leaves of the tree are combined into a single patch. This approach has been considered highly asymmetric in that significantly more processing is required for encoding/analysis than for decoding. It is harder to grow a tree than to climb one.

While hardware implementations of the hybrid DCT coder have been built for extremely low bandwidth teleconferencing and for HDTV, there is no hardware for a region coder. However, such an assessment is deceptive since much of the processing used in DVI compression is in the motion predictor, a function common to both methods. In fact, all compression schemes are asymmetric, the difference is a matter of degree rather than one of essentials.

Repeater
Repeaters are transparent devices used to interconnect segments of an extended network with identical protocols and speeds at the physical layer (OSI layer 1). An example of a repeater connection would be the linkage of two carrier sense multiple access/collision detection (CSMA/CD) segments within a network.
Router
Routers connect networks at OSI layer 3. Routers interpret packet contents according to specified protocol sets, serving to connect networks with the same protocols (DECnet to DECnet, TCP/IP (Transmission Control Protocol/Internet Protocol) to TCP/IP). Routers are protocol-dependent; therefore, one router is needed for each protocol used by the network. Routers are also responsible for the determination of the best path for data packets by routing them around failed segments of the network.
SECAM (Sequentiel Coleur A Memoire)
European video standard with image format 4:3, 625 lines, 50 Hz and 6 Mhz video bandwidth with a total 8 Mhz of video channel width.
SMPTE
SMPTE is the Society of Motion Picture and Television Engineers. There is an SMPTE time code standard (hr:min:sec:frame) used to identify video frames.
SNA
Systems network Architecture entered the market in 1974 as a hierarchical, single-host network structure. Since then, SNA has developed steadily in two directions. The first direction involved tying together mainframes and unintelligent terminals in a master-to-slave relationship. The second direction transformed the SNA architecture to support a cooperative-processing environment, whereby remote terminals link up with mainframes as well as each other in a peer-to-peer relationship (termed Low Entry Networking (LEN) by IBM). LEN depends on the implementation of two protocols: Logical Unit 6.2, also known as APPC, and Physical Unit 2.1 which affords point-to-point connectivity between peer nodes without requiring host computer control.
The SNA model is concerned with both logical and physical units. Logical units (LUs) serve as points of access by which users can utilize the network. LUs can be viewed as terminals that provide users access to application programs and other services on the network. Physical units (PUs) like LUs are not defined within SNA architecture, but instead, are representations of the devices and communication links of the network.

Standard bodies
Any country have national standard body where experts from industry and universities develop standards for all kinds of engineering problems. Among them are, for instance,
ANSI American National Standards Institute USA
DIN Deutsches Institut fuer Normung Germany BSI British Standards Institution
United Kingdom AFNOR Association francaise de normalisation France UNI Ente
Nazionale Italiano di Unificatione Italy NNI Nederlands Normalisatie-instituut
Netherlands SAA Standards Australia Australia SANZ Standards Association of
New Zealand New Zealand NSF Norges Standardiseringsforbund Norway DS Dansk Standard
Denmark
and about 80 others.
The International Organization for Standardization, ISO, in Geneva is the head organization of all these national standardization bodies. Together with the International Electrotechnical Commission, IEC, ISO concentrates its efforts on harmonizing national standards all over the world. The results of these activities are published as ISO standards. Among them are, for instance, the metric system of units, international stationery sizes, all kinds of bolt nuts, rules for technical drawings, electrical connectors, security regulations, computer protocols, file formats, bicycle components, ID cards, programming languages, International Standard Book Numbers (ISBN), ... Over 10,000 ISO standards have been published so far and you surely get in contact with a lot of things each day that conform to ISO standards you never heard of. By the way, "ISO" is not an acronym for the organization in any language. It's a wordplay based on the English/French initials and the Greek-derived prefix "iso-" meaning "same".

Within ISO, ISO/IEC Joint Technical Committee 1 (JTC1) deals with information technology.

The International Telecommunication Union, ITU, is the United Nations specialized agency dealing with telecommunications. At present there are 164 member countries. One of its bodies is the International Telegraph and Telephone Consultative Committee, CCITT. A Plenary Assembly of the CCITT, which takes place every few years, draws up a list of 'Questions' about possible improvements in international electronic communication. In Study Groups, experts from different countries develop 'Recommendations' which are published after they have been adopted. Especially relevant to computing are the V series of recommendations on modems (e.g. V.32, V.42), the X series on data networks and OSI (e.g. X.25, X.400), the I and Q series that define ISDN, the Z series that defines specification and programming languages (SDL, CHILL), the T series on text communication (teletext, fax, videotext, ODA) and the H series on digital sound and video encoding.

Since 1961, the European Computer Manufacturers Association, ECMA, has been a forum for data processing experts where agreements have been prepared and submitted for standardization to ISO, CCITT and other standards organizations.

Sub band coding
Sub-band coding for images has roots in work done in the 1950s by Bedford and on Mixed Highs image compression done by Kretzmer in 1954. Schreiber and Buckley explored general two channel coding of still pictures where the low spatial frequency channel was coarsely sampled and finely quantized and the high spatial frequency channel was finely sampled and coarsely quantized. More recently, Karlsson and Vetterli have extended this to multiple subbands. Adelson et al. have shown how a recursive subdivision called a pyramid decomposition can be used both for compression and other useful image processing tasks.
A pure sub-band coder performs a set of filtering operations on an image to divide it into spectral components. Usually, the result of the analysis phase is a set of sub-images, each of which represents some region in spatial or spatio-temporal frequency space. For example, in a still image, there might be a small sub-image that represents the low-frequency components of the input picture that is directly viewable as either a minified or blurred copy of the original. To this are added successively higher spectral bands that contain the edge information necessary to reproduce the original sharpness of the original at successively larger scales. As with DCT coder, to which it is related, much of the image energy is concentrated in the lowest frequency band.

For equal visual quality, each band need not be represented with the same signal-to-noise ratio; this is the basis for sub-band coder compression. In many coders, some bands are eliminated entirely, and others are often compressed with a vector or lattice quantizer. Succeedingly higher frequency bands are more coarsely quantized, analogous to the truncation of the high frequency coefficients of the DCT. A sub-band decomposition can be the intraframe coder in a predictive loop, thus minimizing the basic distinctions between DCT-based hybrid coders and their alternatives.

T1Q1.5
The T1Q1.5 Video Teleconferencing/Video Telephony (VTC/VT) ANSI Subworking Group (SWG) was formed to draft a performance standard for digital video. Important questions were asked, relating to video digital performance characteristics of video teleconferencing/video telephony :
Is it possible to measure motion artifacts with VTC/VT digital transport?
If it can be done by objective measurements, can they be matched to subjective tests?
Is it possible to correlate the objective measurements of analog and digital performance specification?
The VTC/VT Subworking Group's goal is to answer these questions. It has become a first step to the process of constructing the performance standard.
Trellis Coding
Trellis coding is a source coding technique that has resulted in numerous publications and some very effective source codes. Unfortunately, the computational burden of these codes is tremendous and grows exponentially with the encoding rate.
A trellis is a transition diagram, that takes time into account, for a finite state machine. Populating a trellis means specifying output symbols for each branch, specifying an initial state yields a set of allowable output sequences.

A trellis coder is defined as follows: given a trellis populated with symbols from an output alphabet and an input sequence x of length n, a trellis coder outputs the sequence of bits corresponding to the output sequence x that maximizes the SNR of the encoding.

X.25
A standard networking protocol suite approved by the CCITT and ISO. This protocol suite defines standard physical, link, and networking layers (OSI layers 1 through 3). X.25 networks are in use throughout the world.
X.400
The set of CCITT communications standards covering mail services provided by data networks.
YCC (Kodak PhotoCD [tm])
Kodak's PhotoYCC colour space (for PhotoCD) is similar to YCbCr, except that Y is coded with lots of headroom and no footroom, and the scaling of Cb and Cr is different from that of Rec. 601-1 in order to accommodate a wider colour gamut:
Y_8bit = (255/1.402) * Y C1_8bit = 156 + 111.40 *
(Bgamma - Y) C2_8bit = 137 + 135.64 * (Rgamma - Y)
The C1 and C2 components are subsequently subsampled by factors of two horizontally and vertically, but that subsampling should be considered a feature of the compression process and not of the colour space.
YCbCr
The international standard CCIR-610-1 specifies eight-bit digital coding for component video, with black at luma code 16 and white at luma code 235, and chroma in eight-bit two's complement form centred on 128 with a peak at code 224. This coding has a slightly smaller excursion for luma than for chroma: luma has 219 risers compared to 224 for Cb and Cr. The notation CbCr distinguishes this set from PbPr where the luma and chroma excursions are identical.
For Rec. 601-1 coding in eight bits per component,

Y_8b
= 16 + 219 * Y Cb_8b = 128 + 112 * (0.5/0.886) * (Bgamma - Y) Cr_8b = 128 +
112 * (0.5/0.701) * (Rgamma - Y)
Some computer applications place black at luma code 0 and white at luma code 255. In this case, the scaling and offsets above can be changed accordingly, although broadcast-quality video requires the accommodation for headroom and footroom provided in the CCIR-610-1 equations.
CCIR-610-1 Rec. calls for two-to-one horizontal subsampling of Cb and Cr, to achieve 2/3 the data rate of RGB with virtually no perceptible penalty. This is denoted 4:2:2. A few digital video systems have utilized horizontal subsampling by a factor of four, denoted 4:1:1. JPEG and MPEG normally subsample Cb and Cr two-to-one horizontally and also two-to-one vertically, to get 1/2 the data rate of RGB. No standard nomenclature has been adopted to describe vertical subsampling. To get good results using subsampling you should not just drop and replicate pixels, but implement proper decimation and interpolation filters.

YCbCr coding is employed by D-1 component digital video equipment.

YPbPr
If three components are to be conveyed in three separate channels with identical unity excursions, then the Pb and Pr colour difference components are used:

Pb = (0.5/0.886) * (Bgamma - Y) Pr = (0.5/0.701) * (Rgamma
- Y)<br>
<br>

These scale factors limit the excursion of EACH colour difference component to -0.5 .. +0.5 with respect to unity Y excursion: 0.886 is just unity less the luma coefficient of blue. In the analog domain Y is usually 0 mV (black) to 700 mV (white), and Pb and Pr are usually +- 350 mV.
YPbPr is part of the CCIR Rec. 709 HDTV standard, although different luma coefficients are used, and it is denoted E'Pb and E'Pr with subscript arrangement too complicated to be written here.

YPbPr is employed by component analog video equipment such as M-II and BetaCam; Pb and Pr bandwidth is half that of luma.

YIQ
The U and V signals above must be carried with equal bandwidth, albeit less than that of luma. However, the human visual system has less spatial acuity for magenta-green transitions than it does for red-cyan. Thus, if signals I and Q are formed from a 123 degree rotation of U and V respectively [sic], the Q signal can be more severely filtered than I (to about 600 kHz, compared to about 1.3 MHz) without being perceptible to a viewer at typical TV viewing distance. YIQ is equivalent to YUV with a 33 degree rotation and an axis flip in the UV plane. The first edition of W.K. Pratt "Digital Image Processing", and presumably other authors that follow that bible, has a matrix that erroneously omits the axis flip; the second edition corrects the error.
Since an analog NTSC decoder has no way of knowing whether the encoder was encoding YUV or YIQ, it cannot detect whether the encoder was running at 0 degree or 33 degree phase. In analog usage the terms YUV and YIQ are often used somewhat interchangeably. YIQ was important in the early days of NTSC but most broadcasting equipment now encodes equiband U and V.

The D-2 composite digital DVTR (and the associated interface standard) conveys NTSC modulated on the YIQ axes in the 525-line version and PAL modulated on the YUV axes in the 625-line version.

YUV
In composite NTSC, PAL or S-video systems, it is necessary to scale (B-Y) and (R-Y) so that the composite NTSC or PAL signal (luma plus modulated chroma) is contained within the range -1/3 to +4/3. These limits reflect the capability of composite signal recording or transmission channel. The scale factors are obtained by two simultaneous equations involving both B-Y and R-Y, because the limits of the composite excursion are reached at combinations of B-Y and R-Y that are intermediate to primary colours. The scale factors are as follows:
U = 0.493 * (B - Y) V = 0.877 * (R - Y)
U and V components are typically modulated into a chroma component:
C = U*cos(t) + V*sin(t)
where t represents the ~3.58 MHz NTSC colour sub-carrier. PAL coding is similar, except that the V component switches Phase on Alternate Lines (+-1), and the sub-carrier is at a different frequency, about 4.43 MHz.
It is conventional for an NTSC luma signal in a composite environment (NTSC or S-video) to have 7.5% setup :

Y_setup = (3/40) + (37/40) * Y
A PAL signal has zero setup.
The two signals Y (or Y_setup) and C can be conveyed separately across an S-video interface, or Y and C can be combined (encoded) into composite NTSC or PAL:

NTSC = Y_setup + C PAL = Y + C
U and V are only appropriate for composite transmission as 1-wire NTSC or PAL, or 2-wire S-video. The UV scaling (or the IQ set, described below) is incorrect when the signal is conveyed as three separate components. Certain component video equipment has connectors labelled YUV that in fact convey YPbPr signals.

--------------------------------------------------------------------------------

The following is a list of persons whose material contributed to the creation of this list : Andrew Davidson Tom Lane Charles A. Poynton Lee Westover and many, many more others ....

--------------------------------------------------------------------------------
You might also want to look at the following documents :
comp.dsp.faq
MPEG-2 FAQ
What is MPEG-2 ?

At a meeting hosted in New York by Columbia University, the Moving Picture Experts Group (MPEG) completed definition of MPEG-2 Video, MPEG-2 Audio, and MPEG-2 Systems. MPEG therefore confirmed that it is on schedule to produce, by November 1993, Committee Drafts of all three parts of the MPEG-2 Standard, for balloting by its member countries.

To ensure that a harmonized solution to the widest range of applications is achieved, MPEG, an ISO/IEC working group designated ISO/IEC JTC1/SC29/WG11, is working jointly with the ITU-TS Study Group 15 Experts Group for ATM Video Coding. MPEG also collaborates with representatives from other parts of ITU-TS, and from EBU, ITU-RS, SMPTE, and the North American HDTV community.

Why MPEG-2 ?

MPEG-1 was optimized for CD-ROM or applications at about 1.5 Mbit/sec. Video was strictly non-interlaced (i.e. progressive). The international co-operation had executed so well for MPEG-1, that the committee began to address applications at broadcast TV sample rates using the CCIR-610 recommendation (720 samples/line by 480 lines per frame by 30 frames per second... or about 15.2 million samples/sec including chroma) as the reference.

Unfortunately, today's TV scanning pattern is interlaced. This introduces a duality in block coding: do local redundancy areas (blocks) exist exclusively in a field or a frame... (or a particle or wave) ? The answer of course is that some blocks are one or the other at different times, depending on motion activity.

The additional man years of experimentation and implementation between MPEG-1 and MPEG-2 improved the method of block-based transform coding.

What are the typical MPEG-2 bitrates and picture quality ?

Here are some examples of typical frame sizes in bits :

Picture type
I P B Average MPEG-1 SIF @ 1.15 Mbit/sec 150,000 50,000 20,000 38,000 MPEG-2
601 400,000 200,000 80,000 130,000 @ 4.00 Mbit/sec Note: parameters assume Test
Model for encoding, I frame distance of 15 (N = 15), and a P frame distance
of 3 (M = 3).
Of course with scene changes and more advanced encoder models found in any real-world implementation, these numbers can be very different.
When will an MPEG-2 decoder chip be available ?

Several chips will be sampling in late 1993. For reasons of economy and scale in the cable TV application, all are single-chip (not including DRAM and host CPU/controller) implementations. They are:

SGS-Thomson STi-3500 first
MPEG-2 chip on market multi-tap binary horizontal sample rate converter. pan
& scanning support for 16:9 requires external, dedicated microcontroller (8
bit) 8-bit data bus, no serial data bus. LSI Logic L64112 successor (pin compatible)
serial bus, 15 Mbit coded throughput. smaller pin-count version due soon. C-Cube
CL-950 successor (?)
In 1994, we can look forward to:
Pioneer single-chip
MPEG-2 successor to CD-1100 MPEG-1 chip set. IBM single-chip decoder.
Where will we see MPEG in everyday life? ?

Just about wherever you see video today.

DBS (Direct Broadcast Satellite). The Hughes/USSB DBS service will use MPEG-2 video and audio. Thomson has exclusive rights to manufacture the decoding boxes for the first 18 months of operation. No doubt Thomson's STi-3500 MPEG-2 video decoder chip will be featured.
Hughes/USSB DBS will begin service in North America in April 1994. Two satellites at 101 degrees West will share the power requirements of 120 Watts per 27 MHz transponder. Multi-source channel rate control methods will be employed to optimally allocate bits between several programs on one data carrier. An average of 150 channels are planned.
CATV (Cable Television) Despite conflicting options, the the cable industry has more or less settled on MPEG-2 video. Audio is less than settled. For example, General Instruments (the largest U.S. consumer cable set-top box manufacturer) have announced the planned use of the Dolby AC-3 audio algorithm.
The General Instruments DigiCipher I video syntax is similar to MPEG-2 syntax but uses smaller macroblock predictions and no B-frames. The DigiCipher II specification will include modes to support both the GI and full MPEG-2 Video Main Profile syntax. Services such as HBO will upgrade to DigiCipher II in 1994.
HDTV. The U.S. Grand Alliance, a consortium of companies that formely competed for the U.S. terrestrial HDTV standard, have already agreed to use the MPEG-2 Video and Systems syntax---including B-pictures. Both interlaced (1440 x 960 x 30 Hz) and progressive (1280 x 720 x 60 Hz) modes will be supported. The Alliance must then settle upon a modulation (QAM, VSB, OFDM), convolution (MS or Viterbi), and error correction (RSPC, RSFC) specification.
In September 1993, the consortium of 85 European companies signed an agreement to fund a project known Digital Video Broadcasting (DVB) which will develop a standard for cable and terrestrial transmission by the end of 1994. The scheme will use MPEG-2. This consortium has put the final nail in the coffin of the D-MAC scheme for gradual migration towards an all-digital, HDTV consumer transmission standard. The only remaining analog or digital-analog hybrid system left in the world is NHK's MUSE (which will probably be axed in a few years).
What did MPEG-2 add to MPEG-1 in terms of syntax/algorithm ?

Here is a brief summary:

Sequence layer:
More aspect ratios. A minor, yet necessary part of the syntax.

Horizontal and vertical dimensions are now required to be a multiple of 16 in frame coded pictures, and the vertical dimension must be a multiple of 32 in field coded pictures.

4:2:2 and 4:4:4 macroblocks were added in the Next profiles.
Syntax can now signal frame sizes as large as 16383 x 16383.

Syntax signals source video type (NTSC, PAL, SECAM, MAC, component) to help post-processing and display.

Source video color primaries (609, 170M, 240M, D65, etc.) and opto- electronic transfer characteristics (709, 624-4M, 170M etc.) can be indicated.

Four scalable modes [see scalable section below]

Picture layer:
All MPEG-2 motion vectors are half-pel accuracy.

DC precision can be user-selected as 8, 9, 10, or 11 bits.

Concealment motion vectors were added to I-pictures in order to increase robustness from bit errors since I pictures are the most critical and sensitive in a group of pictures.

A non-linear macroblock quantization factor that results in a more dynamic step size range, from 0.5 to 56, than in MPEG-1 (1 to 32).

New Intra-VLC table for dct_next_coefficient (AC run-level events) that is more geared towards I-frame probability distribution. EOB is 4 bits. The old tables are still included.

Alternate scanning pattern that (supposedly) improves entropy coding performance over the original Zig-Zag scan used in H.261, JPEG, and MPEG-1. The extra scanning pattern is geared towards interlaced video.

Syntax to signal 3:2 pulldown process (repeat_field_first flag)

Syntax flag to signal chrominance post processing type (4:2:0 to 4:2:2 upsampling conversion)

Progressive and interlaced frame coding

Syntax to signal source composite video characteristics useful in post-processing operations. (v-axis, field sequence, sub_carrier, phase, burst_amplitude, etc.)

Pan & scanning syntax that tells decoder how to, for example, window a 4:3 image within a wider 16:9 aspect ratio image. Vertical pan offset has 1/16th pixel accuracy.

Macroblock layer:
Macroblock stuffing is now illegal in MPEG-2 (hurray!!)

Two line modes (interlaced and progressive) for DCT operation.

Now only one run-level escape code code (24-bits) instead of the single (20-bits) and double escape (28-bits) in MPEG-1.

Improved mismatch control in quantization over the original oddification method in MPEG-1. Now specifies adding or subtracting one to the 63rd AC coefficient depending on parity of summed quantized coefficients.

Many additional prediction modes (16x8 MC, field MC, Dual Prime) and, correspondingly, macroblock modes.

Overall, MPEG-2's greatest compression improvements over MPEG-1 are: prediction modes, Intra VLC table, DC precision, non-linear macroblock quant. Implementation improvements, well,.. uh... macroblock stuffing was eliminated.
What are the scalable modes of MPEG-2 ?

Scalable video is permitted only in the Main+ and Next profiles. Currently, there are four scalable modes in the MPEG-2 toolkit. These modes break MPEG-2 video into different layers (base, middle, and high layers) mostly for purposes of prioritizing video data. For example, the high priority channel (bitstream) can be coded with a combination of extra error correction information and decreased bit error (i.e. higher Carrier-to-Noise ratio or signal strength) than the lower priority channel.

Another purpose of scalability is complexity division. For example, in HDTV, the high priority bitstream (720 x 480) can be decoded under noise conditions were the lower priority (1440 x 960) cannot. This is "graceful" degradation. By the same division however, a standard TV set need only decode the 720 x 480 channel, thus requiring a less expensive decoder than a TV set wishing to display 1440 x 960. This is simulcasting.

A brief summary of the MPEG-2 video scalability modes:

Spatial Scalability
Useful in simulcasting, and for feasible software decoding of the lower resolution, base layer. This spatial domain method codes a base layer at lower sampling dimensions (i.e. "resolution") than the upper layers. The upsampled reconstructed lower (base) layers are then used as prediction for the higher layers.

Data Partitioning
Similar to JPEG's frequency progressive mode, only the slice layer indicates the maximum number of block transform coefficients contained in the particular bitstream (known as the "priority break point"). Data partitioning is a frequency domain method that breaks the block of 64 quantized transform coefficients into two bitstreams. The first, higher priority bitstream contains the more critical lower frequency coefficients and side informations (such as DC values, motion vectors). The second, lower priority bitstream carries higher frequency AC data.

SNR Scalability
Similar to the point transform in JPEG, SNR scalability is a spatial domain method where channels are coded at identical sample rates, but with differing picture quality (through quantization step sizes). The higher priority bitstream contains base layer data that can be added to a lower priority refinement layer to construct a higher quality picture.

Temporal Scalability
A temporal domain method useful in, e.g., stereoscopic video. The first, higher priority bitstreams codes video at a lower frame rate, and the intermediate frames can be coded in a second bitstream using the first bitstream reconstruction as prediction. In stereoscopic vision, for example, the left video channel can be prediction from the right channel.

Other scalability modes were experimented with in MPEG-2 video (such as Frequency Scalability), but were eventually dropped in favor of methods that demonstrated similar quality and greater simplicity.
What is the TM rate control and adaptive quantization technique ?

Test model was not by any stretch of the imagination meant to be the show-stopping, best set of algorithm. It was designed to exercise the syntax, verify proposals, and test the *relative* performance of proposals in a way that could be duplicated by co-experimentors in a timely fashion. Otherwise there would be more endless debates about model interpretation than actual time spent in verification. [MPEG-2 Test model is frozen as v5b] The MPEG-2 Test Model (TM) rate control method offers a dramatic improvement to the Simulation Model (SM) method used for MPEG-1. TM's improvements are due to more sophistication pre-analysis and post-analysis routines. Rate control and adaptive quantization are divided into three steps:

Bit Allocation
In Complexity Estimation, the global complexity measures assign relative weights to each picture type. These weights (Xi, Xp, Xb) are reflected by the typical coded frame size of I, P, and B pictures (see typical frame size section). I pictures are assigned the largest weight since they have the greatest stability factor in an image sequence. B pictures are assigned the smallest weight since B data does not propagate into other frames through the prediction process.

Picture Target Setting allocates target bits for a frame based on the frame type and the remaining number of frames of that same type in the Group of Pictures (GOP).

Rate Control
Rate control attempts to adjust bit allocation if there is significant difference between the target bits (anticipated bits) and actual coded bits for a block of data.

Adaptive Quantization
Recomputes macroblock quantization factor according to activity of block against the normalized activity of the frame.

The effect of this step is to roughly assign a constant number of bits per macroblock (this results in more perceptually uniform picture quality).

What is MPEG-2 VIDEO ?

MPEG is developing the MPEG-2 Video Standard, which specifies the coded bit stream for high-quality digital video. As a compatible extension, MPEG-2 Video builds on the completed MPEG-1 Video Standard (ISO/IEC IS 11172-2), by supporting interlaced video formats and a number of other advanced features, including features to support HDTV.

As a generic International Standard, MPEG-2 Video is being defined in terms of extensible profiles, each of which will support the features needed by an important class of applications. At the March MPEG meeting in Sydney, the MPEG-2 Main Profile was defined to support digital video transmission in the range of about 2 to 15 Mbits/sec over cable, satellite, and other broadcast channels, as well as for Digital Storage Media (DSM) and other communications applications. Building on this success at the New York meeting, MPEG experts from participating countries in Asia, Australia, Europe, and North America further defined parameters of the Main Profile and Simple Profile suitable for supporting HDTV formats.

MPEG experts also extended the features of the Main Profile by defining a hierarchical/scalable profile. This profile aims to support applications such as compatible terrestrial TV/HDTV, packet-network video systems, backward-compatibility with existing standards (MPEG-1 and H.261), and other applications for which multi-level coding is required. For example, such a system could give the consumer the option of using either a small portable receiver to decode standard definition TV, or a larger fixed receiver to decode HDTV from the same broadcast signal.

The technical definition of MPEG-2 Video has been completed. This was a critical milestone, and shows that MPEG-2 Video is on schedule for a Committee Draft in November 1993.

What are MPEG-2 VIDEO Main Profile and Main Level ?

MPEG-2 Video Main Level is analogous to MPEG-1's CPB, with sampling limits at CCIR-610 parameters (720 x 480 x 30 Hz). Profiles limit syntax (i.e. algorithms), whereas Levels limit parameters (sample rates, frame dimensions, coded bitrates, etc.). Together, Video Main Profile and Main Level (abbreviated as MP@ML) normalize complexity within feasible limits of 1994 VLSI technology (0.5 micron), yet still meet the needs of the majority of application users.

Level Max.
sampling Pixels/ Max. Significance dimensions fps sec bitrate --------- ----------------
------- ------- -------------------------- Low 352 x 240 x 30 3.05 M 4 Mb/s
CIF, consumer tape equiv. Main 720 x 480 x 30 10.40 M 15 Mb/s CCIR 601, studio
TV High 1440 1440 x 1152 x 30 47.00 M 60 Mb/s 4x 601, consumer HDTV High 1920
x 1080 x 30 62.70 M 80 Mb/s production SMPTE 240M std Note 1: pixel rate and
luminance (Y) sample rate are equivalent. 2: Low Level is similar MPEG-1's Constrained
Parameters Bitstreams. Profile Comments ------- -----------------------------------------------------------
Simple Same as Main, only without B-pictures. Intended for software applications,
perhaps CATV. Main Most decoder chips, CATV, satellite. 95% of users. Main+
Main with Spatial and SNR scalability Next Main+ with 4:2:2 macroblocks Profile
Level Simple Main Main+ Next ------------ -------------- -------------- --------------
------------ High illegal illegal 4:2:2 chroma High-1440 illegal With spatial
4:2:2 chroma Scalablity Main 90% of users Main with SNR 4:2:2 chroma scalability
Low illegal Main with SNR illegal scalabiliy [Subject to change at whim of MPEG
Requirements sub-group]
At what bitrates is MPEG-2 video optimal ?

The Test subgroup has defined a few examples :

"Sweet spot" sampling
dimensions and bit rates for MPEG-2: Dimensions Coded rate Comments -------------
---------- ------------------------------------------- 352x480x24 Hz 2 Mbit/sec
Half horizontal 601. Looks almost NTSC (progressive) broadcast quality, and
is a good (better) substitute for VHS. Intended for film src. 544x480x30 Hz
4 Mbit/sec PAL broadcast quality (nearly full capture (interlaced) of 5.4 MHz
luminance carrier). Also 4:3 image dimensions windowed within 720 sample/line
16:9 aspect ratio via pan&scan. 704x480x30 Hz 6 Mbit/sec Full CCIR 601 sampling
dimensions. (interlaced) [these numbers subject to change at whim of MPEG Test
subgroup]
How does MPEG video really compare to TV, VHS, laserdisc ?

VHS picture quality can be achieved for source film video at about 1 million bits per second (with proprietary encoding methods). It is very difficult to objectively compare MPEG to VHS. The response curve of VHS places -3 dB at around 2 MHz of analog luminance bandwidth (equivalent to 200 samples/line). VHS chroma is considerably less dense in the horizontal direction than MPEG source video (compare 80 samples/ line to 176!). From a sampling density perspective, VHS is superior only in the vertical direction (480 lines compared to 240)... but when taking into account interfield magnetic tape crosstalk and the TV monitor Kell factor, not by all that much. VHS is prone to timing errors (which can be improved with time base correctors), whereas digital video is fully discretized. Pre-recorded VHS is typically recorded at very high duplication speeds (5 to 15 times real time playback), which leads to further shortfalls for the format that has been with us since 1977.

Broadcast NTSC quality can be approximated at about 3 Mbit/sec, and PAL quality at about 4 Mbit/sec. Of course, sports sequences with complex spatial-temporal activity need more like 5 and 6 Mbit/sec, respectively.

Laserdisc is a tough one to compare. Disc is composite video (NTSC or PAL) with up to 425 TVL (or 567 samples/line) response. Thus it could be said laserdisc has 567 x 480 x 30 Hz "resolution". The carrier-to-noise ratio is typically better than 48 dB. Timing is excellent. Yet some of the clean characteristics of laserdisc can be achieved at 1.15 Mbit/sec (SIF rates), especially for those areas of medium detail (low spatial activity) in the presence of uniform motion. This is why some people say MPEG-1 video at 1.15 Mbit/sec looks almost as good as Laserdisc or Super VHS.

Regardless of the above figures, those clever proprietary encoding algorithms can push these bitrates even lower.

Why film does so well with MPEG ?

Several reasons, really:

The frame rate is 24 Hz (instead of 30 Hz) which is a savings of some 20%.
the film source video is inherently progressive. Hence no fussy interlaced spectral frequencies.
the pre-digital source was severely oversampled (compare 352 x 240 SIF to 35 millimeter film at, say, 3000 x 2000 samples). This can result in a very high quality signal, whereas most video cameras do not oversample, especially in the vertical direction.
Finally, the spatial and temporal modulation transfer function (MTF) characteristics (motion blur, etc) of film are more ameniable to the transform and quantization methods of MPEG.
What are some pre-processing enhancements ?

Adaptive de-interlacing:
This method maps interlaced video from a higher sampling rate (e.g 720 x 480) into a lower rate, progressive format (352 x 240). The most basic algorithm measures the variance between two fields, and if the variance is small enough, uses an average of both fields to form a frame macroblock. Otherwise, a field area from one field (of the same parity) is selected. More clever algorithms are much more complex than this, and may involve median filtering, and multirate/ multidimensional tools.

Pre-anti-aliasing and Pre-blockiness reduction:
A common method in still image coding is to pre-smooth the image before compression encoding. For example, if pre-analysis of a frame indicates that serious artifacts will arise if the picture were to be coded in the current condition, a pre-anti-aliasing filter can be applied. This can be as simple as having a smoothing severity proportional to the image activity. The pre-filter can be global (same smoothing factor for whole image) or locally adaptive. More complex methods will use multirate/multidimensional tools again.

The basic idea of multidimensional/multirate pre-processing is to apply source video whose resolution (sampling density) is greater than the target source and reconstruction sample rates. This follows the basic principles of oversampling, as found in A/D converters.
Most detail is contained in the lower harmonics anyway. Sharp-cut off filters are not widely practiced, so the "320 x 480 potential" of VHS is never truly realized.

Why use "advanced" pre-filtering techniques ?

Think of the DCT and quantizer as an A/D converter. Think of the pre-filter as the required anti-alias prefilter found before every A/D. The big difference of course is that the DCT quantizer assigns a varying number of bits per sample (transform coefficient).

Judging on the normalized activity measured in the pre-analysis stage of video encoding, and the target buffer size status, you have a fairly good idea of how many bits can be spared for the target macroblock, for instance.

Other pre-filtering techniques mostly take into account: texture patterns, masking, edges, and motion activity. Many additional advanced techniques can be applied at different immediate layers of video encoding (picture, slice, macroblock, block, etc.).

What are some advanced encoding methods ?

Quantizer feedback
[Thomson patent]

Horizontal variance

Motion vector cost:
this is true for any syntax elements, really. Signalling a macroblock quantization factor or a large motion vector differential can cost more than making up the difference with extra quantized DFD (prediction error) bits. The optimum can be found with, for example, a Lagrangian process. In summary, any compression system with side information, there is a optimum point between signalling overhead (e.g. prediction) and prediction error.

Liberal Interpretations of the Forward DCT
Borrowing from the concept that the DCT is simply a filter bank, a technique that seems to be gaining popularity is basis vector shaping. Usually this is combined with the quantization stage since the two are tied closely together in a rate-distortion sense. The idea is to use the basis vector shaping as a cheap alternative to pre-filtering by combining the more desiderable data adaptive properties of pre-filtering/ pre-processing into the transformation process... yet still reconstruct a picture in the decoder using the standard IDCT that looks reasonably like the source. Some more clever schemes will apply windowing. [Warning: watch out for eigenimage/basis vector orthogonality. ]

Frequency-domain enhancements:
Enhancements are applied after the DCT (and possibly quantization) stage to the transform coefficients. This borrows from the concept: if you don't like the (quantized) transformed results, simply reshape them into something you do like.

Temporal spreading of quantization error:
This method is similar to the original intent behind color subcarrier phase alternation by field in the NTSC analog TV standard: for stationary areas, noise does not hang" in one location, but dances about the image over time to give a more uniform effect. Distribution makes it more difficult for the eye to "catch on" to trouble spots (due to the latent temporal response curve of human vision). Simple encoder models tend to do this naturally but will not solve all situations.

Look-ahead and adaptive frame cycle structures:
Scene changes

Post-processing
(non-linear) Interpolation methods (Wu-Gersho) Convex hull projections Some ICASSP '93 papers, etc.

Conformance vs. post-processing:
Post-processing makes judging decoder output for conformance testing near impossible.

It is easy to spot encoders that do not employ any advanced encoding techniques: reconstructed video usually contains ringing around edges, color bleeding, and lots of noise.
What is MPEG-2 AUDIO ?

MPEG is developing the MPEG-2 Audio Standard for low bitrate coding of multichannel audio. MPEG-2 Audio coding will supply up to five full bandwidth channels (left, right, center, and two surround channels), plus an additional low frequency enhancement channel, and/or up to seven commentary/multilingual channels. The MPEG-2 Audio Standard will also extend the stereo and mono coding of the MPEG-1 Audio Standard (ISO/IEC IS 11172-3) to half sampling-rates (16 kHz, 22.05 kHz, and 24 kHz), for improved quality for bitrates at or below 64 kbits/s, per channel.

MPEG produced an updated version of the MPEG-2 Audio Working Draft, and is on track for achieving a Committee Draft specification by the November MPEG meeting.

The MPEG-2 Audio multichannel coding Standard will provide backward-compatibility with the existing MPEG-1 Audio Standard (ISO/IEC IS 11172-3). Together with ITU-RS, MPEG is organizing formal subjective testing of the proposed MPEG-2 multichannel audio codecs and up to three non-backward-compatible (NBC) codecs. The NBC codecs are included in order to determine whether an NBC mode should be introduced as an addendum to the standard. If the results show clear evidence that an NBC mode improves the performance, a formal call for NBC proposals will be issued by MPEG, with a view to incorporate these features in the audio syntax.

MPEG-2 audio attempts to maintain as much compatibility with MPEG-1 audio syntax as possible, while adding discrete surround-sound channels to the original MPEG-1 limit of 2 channels (Left, Right or matrix center and difference). The main channels (Left, Right) in MPEG-2 audio will remain backwards compatible, whereas new coding methods and syntax will be used for the surround channels.

A total of 5.1 channels are included that consist of the two main channels (L,R), two side/rear, center, and a 100 Hz special effects channel (hence the ".1" in "5.1").

At this time, non-backwards compatible (NBC) schemes are being considered as an ammedment to the MPEG-2 audio standard. One such popular system is Dolby AC-3.

What is MPEG-2 SYSTEMS ?

MPEG is developing the MPEG-2 Systems Standard to specify coding formats for multiplexing audio, video, and other data into a form suitable for transmission or storage. There are two data stream formats defined: the Transport Stream, which can carry multiple programs simultaneously, and which is optimized for use in applications where data loss may be likely, and the Program stream, which is optimized for multimedia applications, for performing systems processing in software, and for MPEG-1 compatibility.

Both streams are designed to support a large number of known and anticipated applications, and they retain a significant amount of flexibility such as may be required for such applications, while providing interoperability between different device implementations. The Transport Stream is well suited for transmission of digital television and video telephony over fiber, satellite, cable, ISDN, ATM, and other networks, and also for storage on digital video tape and other devices. It is expected to find widespread use for such applications in the very near future.

The Program Stream is similar to the MPEG-1 Systems standard (ISO/IEC 11172-1). It includes extensions to support new and future applications. Both the Transport Stream and Program Stream are built on a common Packetized Elementary Stream packet structure, facilitating common video and audio decoder implementations and stream type conversions. This is well-suited for use over a wide variety of networks with ATM/AAL and alternative transports. In New York, MPEG completed definitions of the features, syntax, and semantics of the Transport and Program Streams, enabling product designers to proceed. Among other items, the Transport Stream packet length was fixed at 188 bytes, including the 4-byte header. This length is suited for use with ATM networks, as well as a wide variety of other transmission and storage systems.

What about the Grand Alliance ?

The Grand Alliance was formed in May 1993 by seven organizations (AT&T, GI, MIT, Philips, Sarnoff, Thomson, Zenith) to evaluate technologies and to decide on key elements that will be at the heart of the best of the best HDTV system.

The video compression and transport technologies selected by the Grand Alliance are based on the proposed MPEG-2 standards. The scanning formats selected are focused primarily on computer-friendly progressive scanning, while offering and interlaced mode important to some broadcasters.

They have already agreed to use the MPEG-2 Video and Systems syntax, including B-pictures. Both interlaced (1440 x 960 x 30 Hz) and progressive (1280 x 720 x 60 Hz) modes will be supported. The Alliance must then settle upon a modulation (QAM, VSB, OFDM), convolution (MS or Viterbi), and error correction (RSPC, RSFC) specification.

The audio technology selected is a six-channel, compact-disc-quality digital surround sound system. The last major technical decision, the broadcast and cable transmission subsystem, is expected in early 1994 following testing of competing technologies.
MPEG-1 VIDEO
How does MPEG-1 VIDEO work ?

First off, it starts with a relatively low resolution video sequence (possibly decimated from the original) of about 352 by 240 frames by 30 frames/s (US--different numbers for Europe), but original high (CD) quality audio. The images are in color, but converted to YUV space, and the two chrominance channels (U and V) are decimated further to 176 by 120 pixels. It turns out that you can get away with a lot less resolution in those channels and not notice it, at least in "natural" (not computer generated) images.

The basic scheme is to predict motion from frame to frame in the temporal direction, and then to use DCT's (discrete cosine transforms) to organize the redundancy in the spatial directions. The DCT's are done on 8x8 blocks, and the motion prediction is done in the luminance (Y) channel on 16x16 blocks. In other words, given the 16x16 block in the current frame that you are trying to code, you look for a close match to that block in a previous or future frame (there are backward prediction modes where later frames are sent first to allow interpolating between frames). The DCT coefficients (of either the actual data, or the difference between this block and the close match) are quantized, which means that you divide them by some value to drop bits off the bottom end. Hopefully, many of the coefficients will then end up being zero. The quantization can change for every "macroblock" (a macroblock is 16x16 of Y and the corresponding 8x8's in both U and V). The results of all of this, which include the DCT coefficients, the motion vectors, and the quantization parameters (and other stuff) is Huffman coded using fixed tables. The DCT coefficients have a special Huffman table that is two-dimensional in that one code specifies a run-length of zeros and the non-zero value that ended the run. Also, the motion vectors and the DC DCT components are DPCM, (subtracted from the last one) coded.

There are three types of coded frames. There are I or intra frames. They are simply a frame coded as a still image, not using any past history. You have to start somewhere. Then there are P or predicted frames. They are predicted from the most recently reconstructed I or P frame. (I'm describing this from the point of view of the decompressor.) Each macroblock in a P frame can either come with a vector and difference DCT coefficients for a close match in the last I or P, or it can just be "intra" coded (like in the I frames) if there was no good match.

Lastly, there are B or bidirectional frames. They are predicted from the closest two I or P frames, one in the past and one in the future. You search for matching blocks in those frames, and try three different things to see which works best. (Now I have the point of view of the compressor, just to confuse you.) You try using the forward vector, the backward vector, and you try averaging the two blocks from the future and past frames, and subtracting that from the block being coded. If none of those work well, you can intra- code the block.

The sequence of decoded frames usually goes like:

IBBPBBPBBPBBIBBPBBPB...

Where there are 12 frames from I to I (for US and Japan anyway.) This is based on a random access requirement that you need a starting point at least once every 0.4 seconds or so. The ratio of P's to B's is based on experience. Of course, for the decoder to work, you have to send that first P *before* the first two B's, so the compressed data stream ends up looking like:
0xx312645...

where those are frame numbers. xx might be nothing (if this is the true starting point), or it might be the B's of frames -2 and -1 if we're in the middle of the stream somewhere.
You have to decode the I, then decode the P, keep both of those in memory, and then decode the two B's. You probably display the I while you're decoding the P, and display the B's as you're decoding them, and then display the P as you're decoding the next P, and so on.

What do B-frames buy you ?

Since bi-directional macroblock predictions are an average of two macroblocks blocks, noise is reduced at low bit rates. At nominal MPEG-1 video (352 x 240 x 30, 1.15 Mbit/sec) rates, it is said that B-frames improves SNR by as much as 2 dB. (0.5 dB gain is usually considered worth-while in MPEG). However, at higher bit rates, B-frames become less useful since they inherently do not contribute to the progressive refinement of an image sequence (i.e.not used as prediction by subsequent coded frames). Regardless, B-frames are still politically controversial.

Why do some people hate B-frames ?

Computational complexity, bandwidth, delay, and picture buffer size are the four B-frame Pet Peeves. Computational complexity is increased since a some macroblock modes require averaging between two macroblocks. Worst case, memory bandwidth is increased an extra 16 MByte/s (601 rate) for this extra prediction. An extra picture buffer is needed to store the future prediction reference (bi-directionality). Finally, extra delay is introduced in encoding since the frame used for backwards prediction needs to be transmitted to the decoder before the intermediate B-pictures can be decoded and displayed.

Cable television (e.g. General Instruments) have been particularly adverse to B-frames since the extra picture buffer pushes the decoder DRAM memory requirements past the magic 8-Mbit (1 Mbyte) threshold into the realm of 16 Mbits (2 MByte) for CCIR-610 frames (704 x 480), yet not for lowly 352 x 480. However, cable does not realize that DRAM does not come in convenient high-volume (low cost) 8-Mbit packages as 16-Mbit does. In a few years, the cost differences between 16 Mbit and 8 Mbit will become insignificant compared to the gain in compression. For the time being, cable boxes will start with 8-Mbit and allow future drop-in upgrades to 16-Mbit. The early market success of B-frames seem to have been determined by a fire at a Japanese chemical plant.

Can motion vectors be used to measure object velocity ?

Motion vector information cannot be reliably used as a means of determining object velocity unless the encoder model specifically set out to do so. First, encoder models that optimize picture quality form vectors that typically minimize prediction error and, consequentially, the vectors often do not represent true object translation. Standards converters that re-sample one frame rate to another (as in NTSC to PAL) use different methods (field coding, edge detection, et al) that are not concerned with optimizing SNR vs bitrate. Secondly, motion vectors are not transmitted for all macroblocks anyway.

How do you code interlaced video with MPEG-1 syntax ?

Two methods can be applied to interlaced video that maintain syntactic compatibility with MPEG-1 (which was originally designed for progressive frames only). In the field concatenation method, the encoder model can carefully construct predictions and prediction errors that realize good compression but maintain field integrity (distinction between adjacent fields of opposite parity). Some pre-processing techniques can also be applied to the interlaced source video that would, e.g., lessen sharp vertical frequencies. This technique is not efficient of course. On the other hand, if the original source was progressive (e.g. film), then it is more trivial to convert the interlaced source to a progressive format before encoding. (MPEG-2 would then only offer superior performance through greater DC block precision, non-linear mquant, intra VLC, etc.) Reconstructed frames are re-interlaced in the decoder Display process.

The second syntactically compatible method codes fields separately. Picture types are keyed to motion activity to aid efficiency of prediction.

Where did they get 352x240 ?

That derives from the CCIR-610 digital television standard which is used by professional digital video equipment. It is (in the US) 720 by 243 by 60 fields (not frames) per second, where the fields are interlaced when displayed. (It is important to note though that fields are actually acquired and displayed a 60th of a second apart.) The chrominance channels are 360 by 243 by 60 fields a second, again interlaced. This degree of chrominance decimation (2:1 in the horizontal direction) is called 4:2:2. The source input format for MPEG I, called SIF, is CCIR-610 decimated by 2:1 in the horizontal direction, 2:1 in the time direction, and an additional 2:1 in the chrominance vertical direction. And some lines are cut off to make sure things divide by 8 or 16 where needed. For 50 Hz display standards (PAL, SECAM) change the number of lines in a field from 243 or 240 to 288, and change the display rate to 50 fields/s or 25 frames/s. Similarly, change the 120 lines in the decimated chrominance channels to 144 lines. Since 288*50 is exactly equal to 240*60, the two formats have the same source data rate.

Can MPEG-1 encode higher sample rates than 352x240x30 ?

Yes. The MPEG-1 syntax permits sampling dimensions as high as 4095 x 4095 x 60 frames per second. The MPEG most people think of as "MPEG-1" is actually a kind of subset known as Constrained Parameters Bitstream (CPB).

What are Constrained Parameters Bitstreams ?

CPB are a limited set of sampling and bitrate parameters designed to normalize computational complexity, buffer size, and memory bandwidth while still addressing the widest possible range of applications. CPB limits video to 396 macroblocks (101,376 pixels) per frame if the frame rate is less than or equal to 25 fps (frames per second), and 330 macroblocks (84,480 pixels) per frame if the frame rate is less or equal to 30 fps. Therefore, MPEG video is typically coded at SIF dimensions (352 x 240 x 30fps or 352 x 288 x 25 fps).

The total maximum sampling rate is 3.8 Ms/s (million samples/sec) including chroma. The coded video rate is limited to 1.862 Mbit/sec. In industrial practice, the bitrate is the most often waived parameter of CPB, with rates as high as 6 Mbit/sec in use.

Why are Constrained Parameters Bitstreams so important?

It is an optimum point that allows (just barely) cost effective VLSI implementations in 1992 technology (0.8 microns). It also implies a nominal guarantee of interoperability for decoders and encoders. MPEG devices which are not capable of meeting SIF rates are not canonically considered to be true MPEG.

Are there ways of getting around Constrained Parameters Bitstreams for SIF class applications and decoder ?

Yes, some. Remember that CPB limits frames to 396 macroblocks (as in 352 x 288 SIF frames). 416 x 240 x 24 Hz sampling rates are still within the constraints, but this only aids NTSC (240 lines/field) displays. Deviating from 352 samples/line could throw off many decoder implementations that have limited horizontal sample rate conversion modes. Due to chip die size constraints (most chips barely pack in the necessary features), many decoders use simple doubling, e.g. 352 to 704 samples/line via binary taps which are simple shift-and-add operations. Future MPEG decoders will have arbitrary sample rate converters on-chip. Also remember that the 1.86 Mbit/sec limit is often ignored in real life.

How much does it compress ?

As mentioned before, audio CD data rates are about 1.5 Mbits/s. You can compress the same stereo program down to 256 Kbits/s with no loss in discernible quality. (So they say. For the most part it's true, but every once in a while a weird thing might happen that you'll notice. However the effect is very small, and it takes a listener trained to notice these particular types of effects.) That's about 6:1 compression. So, a CD MPEG I stream would have about 1.25 MBits/s left for video. The number I usually see though is 1.15 MBits/s (maybe you need the rest for the system data stream). You can then calculate the video compression ratio from the numbers here to be about 26:1. If you step back and think about that, it's little short of a miracle. Of course, it's lossy compression, but it can be pretty hard sometimes to see the loss, if you're comparing the SIF original to the SIF decompressed. There is, however, a very noticeable loss if you're coming from CCIR-610 and have to decimate to SIF, but that's another matter. I'm not counting that in the 26:1.

The standard also provides for other bit rates ranging from 32Kbits/s for a single channel, up to 448 Kbits/s for stereo.

MPEG-1 AUDIO
Is the same video compression applied to audio ?

Definitely no. The eye and the ear, even if they are only a few centimeters apart, works very differently. The ear has a much higher dynamic range and resolution. It can pick out more details but it is slower than the eye.

The MPEG committee chose to recommend 3 compression methods and named them Audio Level I, II and III. Level I is the simplest, a sub-band coder with a psycoacustic model (You'll get the details of this stuff further on). Layer II adds more advanced bit allocation techniques and greater accuracy. Layer III adds a hybrid filterbank and non- uniform quantization. Layer I, II and III gives increasing quality/compression ratios with increasing complexity and demands on processing power.

The reason for recommending 3 methods where partly that the testers felt that none of the coders was 100% transparent to all material and partly that the best coder (Layer III) was so computing heavy that it would seriously impact the acceptance of the standard.

The specs say that a valid Layer III decoder shall be able to decode any Layer I, II or III MPEG Audio stream. A Layer II decoder shall be able to decode Layer I and Layer II streams. I would not worry too much about Layer III. Layer II is where its happening and the info in this FAQ is mainly about this coder.

How does MPEG-1 AUDIO work ?

Well, first you need to know how sound is stored in a computer. Sound is pressure differences in air. When picked up by a microphone and fed through an amplifier this becomes voltage levels. The voltage is sampled by the computer a number of times per second. For CD-audio quality you need to sample 44100 times per second and each sample has a resolution of 16 bits. In stereo this gives you 1,4Mbit per second and you can probably see the need for compression.

To compress audio MPEG tries to remove the irrelevant parts of the signal and the redundant parts of the signal. Parts of the sound that we do not hear can be thrown away. To do this MPEG Audio uses psyco- acustic principles.

How good is MPEG-1 AUDIO compression ?

MPEG can compress to a bitstream of 32kbit/s to 384kbit/s (Layer II). A raw PCM audio bitstream is about 705kbit/s so this gives a max. compression ratio of about 22. Normal compression ratio is more like 1:6 or 1:7. If you think that this is not much please remember that unlike video we are talking about no perceivable quality loss here. 96kbit/s is considered transparent for most practical purposes. This means that you will not notice any difference between the original and the compressed signal for rock'n roll or popular music. For more demanding stuff like piano concerts and such you will need to go up to 128kbit/s.

How does MPEG-1 AUDIO achieve this compression ratio ?

Well, with audio you basically have two alternatives. Either you sample less often or you sample with less resolution (less than 16 bit per sample). If you want quality you can't do much with the sample frequency. Humans can hear sounds with frequencies from about 20Hz to 20kHz. According to the Nyquist theorem you must sample at least two times the highest frequency you want to reproduce. Allowing for imperfect filters, a 44,1kHz sampling rate is a fair minimum. So you either set out to prove the Nyquist theorem is wrong or go to work on reducing the resolution. The MPEG committee chose the latter.

Now, the real reason for using 16 bits is to get a good signal-to- noise (s/n) ratio. The noise we're talking about here is quantization noise from the digitizing process. For each bit you add, you get 6dB better s/n. (To the ear, 6dBu corresponds to a doubling of the sound level.) CD-audio achieves about 90dB s/n. This matches the dynamic range of the ear fairly well. That is, you will not hear any noise coming from the system itself (well, there is still some people arguing about that, but lets not worry about them for the moment). So what happens when you sample to 8 bit resolution ? You get a very noticeable noise floor in your recording. You can easily hear this in silent moments in the music or between words or sentences if your recording is a human voice. Waitaminnit. You don't notice any noise in loud passages, right? This is the masking effect and is the key to MPEG Audio coding. Stuff like the masking effect belongs to a science called psyco- acoustics that deals with the way the human brain perceives sound. And MPEG uses psycoacustic principles when it does its thing.

Explain the masking effect

Say you have a strong tone with a frequency of 1000Hz. You also have a tone nearby of say 1100Hz. This second tone is 18 dB lower. You are not going to hear this second tone. It is completely masked by the first 1000Hz tone. As a matter of fact, any relatively weak sounds near a strong sound is masked. If you introduce another tone at 2000Hz also 18 dB below the first 1000Hz tone, you will hear this. You will have to turn down the 2000Hz tone to something like 45 dB below the 1000Hz tone before it will be masked by the first tone. So the further you get from a sound the less masking effect it has. The masking effect means that you can raise the noise floor around a strong sound because the noise will be masked anyway. And raising the noise floor is the same as using less bits and using less bits is the same as compression.

Let's now try to explain how the MPEG Audio coder goes about its thing. It divides the frequency spectrum (20Hz to 20kHz) into 32 sub-bands. Each sub-band holds a little slice of the audio spectrum. Say, in the upper region of sub-band 8, a 1000Hz tone with a level of 60dB is present. OK, the coder calculates the masking effect of this sound and finds that there is a masking threshold for the entire 8th sub-band (all sounds w. a frequency...) 35dB below this tone. The acceptable s/n ratio is thus 60 - 35 = 25 dB. The equals 4 bit resolution. In addition there are masking effects on band 9-13 and on band 5-7, the effect decreasing with the distance from band 8. I a real-life situation you have sounds in most bands and the masking effects are additive. In addition the coder considers the sensitivity of the ear for various frequencies. The ear is a lot less sensitive in the high and low frequencies. Peak sensitivity is around 2-4kHz, the same region that the human voice occupies.

The sub-bands should match the ear, that is each sub-band should consist of frequencies that have the same psycoacustic properties. In MPEG layer II, each subband is 625Hz wide. It would been better if the sub-bands where narrower in the low frequency range and wider in the high frequency range. To do this you need complex filters. To keep the filters simple they chose to add FFT in parallel with the filtering and use the spectral components from the FFT as additional information to the coder. This way you get higher resolution in the low frequencies where the ear is more sensitive.

But there is more to it. We have explained concurrent masking, but the masking effect also occurs before and after a strong sound (pre- and postmasking)

If there is a significant (30 - 40dB ) shift in level. The reason is believed to be that the brain needs some processing time. Premasking is only about 2 to 5 ms. The postmasking can be up till 100ms. Other bit-reduction techniques involve considering tonal and non- tonal components of the sound. For a stereo signal you have a lot of redundancy between channels. The last step before formatting is Huffman coding.

The coder calculates masking effects by an iterative process until it runs out of time. It is up to the implementor to spend bits in the least obtrusive fashion. For layer II the coder works on 23 ms of sound (1152 samples) at a time. For some material the 23 ms time-window can be a problem. This is normally in a situation with transients where there are large differences in sound level over the 23 ms. The masking is calculated on the strongest sound and the weak parts will drown in quantization noise. This is perceived as a noise-echo by the ear. Layer III addresses this problem specifically.

What is the hardware demand ?

According to my informations Layer III needs about 20MIPS per channel for real-time coding. This means a real fast DSP. Layer II on the other hand needs only a simple DSP like for example the AD2015 that can be had for a few dollars. The process is asymmetrical, much more processing is needed on the coding side. A decoder could be made to work without hardware assistance on a decent computer.

Who is using MPEG-1 AUDIO?

Philips uses MPEG for their new digital video CD's. They say they will start shipping movies and music videos on CD's for their CD-I player by the end of this year. MPEG is accepted by Eureka-147. That means that when digital radio broadcasts starts in Europe a couple of years from now, you will receive MPEG coded audio.

Which sampling frequencies are used ?

You can have 48kHz, (used in professional sound equipment), 44,1kHz (used in consumer equipment like CD-audio) or 32kHz (used in some communications equipment).

How many audio channels?

MPEG I allows for two audio channels. These can be either single (mono) dual (two mono channels), stereo or joint stereo (intensity stereo or m/s-stereo). In normal (l/r) stereo one channel carries the left audio signal and one channel carries the right audio signal. In m/s stereo one channel carries the sum signal (l+r) and the other the difference (l-r) signal. In intensity stereo the high frequency part of the signal (above 2kHz) is combined. The stereo image is preserved but only the temporal envelope is transmitted. In addition MPEG allows for pre-emphasis, copyright marks and original/copy marks. MPEG II allows for several channels in the same stream.

Where can I get more details about MPEG audio ?

There is no description of the coder in the specs. The specs describes in great detail the bitstream and suggests psycoacustic models.

MPEG-1 SYSTEMS
What about MPEG-1 SYSTEMS ?

The MPEG system committee completed and approved for release the technical specification for combining a plurality of coded audio and video streams into a single data stream. The specification provides a fully synchronised audio and video and facilitates the storage in and the possible further transmission of the combined information through a variety of digital media.

This systems coding includes necessary and sufficient information in the bit stream to provide the system-level functions of synchronization of decoded audio and video, initial and continuous management of coded data buffers to prevent overflow and underflow, random access start-up, and absolute time identification. The coding layer specifies a multiplex data format that allows multiplexing of multiple simultaneous audio and video streams as well as privately defined data streams.

The basic principle of MPEG System coding is the use of time stamps which specify the decoding and display time of audio and video and the time of reception of the multiplexed coded data at the decoder, all in terms of a single 90kHz system clock. This method allows a great deal of flexibility in such areas as decoder design, the number of streams, multiplex packet lengths, video picture rates, audio sample rates, coded data rates, digital storage medium or network performance. It also provides flexibility in selecting which entity is the master time base, while guaranteeing that synchronization and buffer management are maintained. Variable data rate operation is supported. A reference model of a decoder system is specified which provides limits for the ranges of parameters available to encoders and provides requirements for decoders.

Some optional sets of constraints provide a framework for common industry acceptance of certain key parameters for use by decoder designs and information providers. While the MPEG Systems specification is included in the current work item of MPEG, it is designed for compatibility with future extensions to audio, video and hypermedia coding, and a wide variety of bitrates.
The MPEG standard
This is a collection of frequently asked question about the MPEG compression standard. It is organized as an hypertext in HTML format to be easily extensible and upgradable.

Many sources contributed to this list.

If you wish to contribute, correct any mistake or just send your comments and impressions please contact :

Luigi.Filippini@crs4.it

What is MPEG ?

MPEG (Moving Pictures Experts Group) is a group of people that meet under ISO (the International Standards Organization) to generate standards for digital video (sequences of images in time) and audio compression. In particular, they define a compressed bit stream, which implicitly defines a decompressor. However, the compression algorithms are up to the individual manufacturers, and that is where proprietary advantage is obtained within the scope of a publicly available international standard. MPEG meets roughly four times a year for roughly a week each time. In between meetings, a great deal of work is done by the members, so it doesn't all happen at the meetings. The work is organized and planned at the meetings. MPEG itself is a nickname. The official name is: ISO/IEC JTC1 SC29 WG11.

ISO: International Organization for Standardization IEC: International Electro-technical Commission JTC1: Joint Technical Committee 1 SC29: Sub-committee 29 WG11: Work Group 11 (moving pictures with... uh, audio)

Does it have anything to do with JPEG ?

Well, it sounds the same, and they are part of the same subcommittee of ISO along with JBIG and MHEG, and they usually meet at the same place at the same time. However, they are different sets of people with few or no common individual members, and they have different charters and requirements.

JPEG is for still image compression. JBIG is for binary image compression (like faxes), and MHEG is for multi-media data standards (like integrating stills, video, audio, text, etc.).

The most fundamental difference between MPEG and JPEG is MPEG's use of block-based motion compensated prediction (MCP), a general method falling into the temporal DPCM, category.

The second most fundamental difference is in the target application. JPEG adopts a general purpose philosophy: independence from color space (up to 255 components per frame) and quantization tables for each component. Extended modes in JPEG include two sample precisions (8 and 12 bit sample accuracy), combinations of frequency progressive, spatially progressive, and amplitude progressive scanning modes. Color independence is made possible thanks to down-loadable Huffman tables.

Since MPEG is targeted for a set of specific applications, there is only one color space (4:2:0 YCbCr), one sample precision (8 bits), and one scanning mode (sequential). Luminance and chrominance share quantization tables. The range of sampling dimensions are more limited as well. MPEG adds adaptive quantization at the macroblock (16 x 16 pixel area) layer. This permits both smoother bit rate control and more perceptually uniform quantization throughout the picture and image sequence. Adaptive quantization is part of the JPEG-2 charter. MPEG variable length coding tables are non-downloadable, and are therefore optimized for a limited range of compression ratios appropriate for the target applications.

The local spatial decorrelation methods in MPEG and JPEG are very similar. Picture data is block transform coded with the two-dimensional orthonormal 8x8 DCT. The resulting 63 AC transform coefficients are mapped in a zig-zag pattern to statistically increase the runs of zeros. Coefficients of the vector are then uniformly scalar quantized, run-length coded, and finally the run-length symbols are variable length coded using a canonical (JPEG) or modified Huffman (MPEG) scheme. Global frame redundancy is reduced by 1-D DPCM, of the block DC coefficients, followed by quantization and variable length entropy coding.

MCP DCT ZZ Q Frame -> 8x8 spatial block -> 8x8 frequency block
-> Zig-zag scan -> RLC VLC quantization -> run-length coding -> variable length
coding.
The similarities have made it possible for the development of hard-wired silicon that can code both standards. Even microcoded architectures can better optimize through hardwired instruction primitives or functional blocks. There are many additional minor differences. They include:
DCT and quantization precision in MPEG is 9-bits since the macroblock difference operation expands the 8-bit signal precision by one bit.
Quantization in MPEG-1 forces quantized coefficients to become odd values (oddification).
JPEG run-length coding produces run-size tokens (run of zeros, non-zero coefficient magnitude) whereas MPEG produces fully concatenated run-level tokens that do not require magnitude differential bits.
DC values in MPEG-1 are limited to 8-bit precision (a constant step size of 8), whereas JPEG DC precision can occupy all possible 11-bits. MPEG-2, however, re-introduced extra DC precision.
How do MPEG and H.261 differ ?

H.261 was targeted for teleconferencing applications where motion is naturally more limited. Motion vectors are restricted to a range of +/- 15 pixels. Accuracy is reduced since H.261 motion vectors are restricted to integer-pel accuracy. Other syntactic differences include: no B-pictures, different quantization method.

H.261 is also known as P*64. "P" is an integer number meant to represent multiples of 64kbit/sec. In the end, this nomenclature probably won't be used as many services other than video will adopt the philosophy of arbitrary B channel (64kbit) bitrate scalability.

Is H.261 the de facto teleconferencing standard ?

Not exactly. To date, about seventy percent of the industrial teleconferencing hardware market is controlled by PictureTel of Mass. The second largest market controller is Compression Labs of Silicon Valley. PictureTel hardware includes compatibility with H.261 as a lowest common denominator, but when in communication with other PictureTel hardware, it can switch to a mode superior at low bit rates (less than 300kbits/sec). In fact, over 2/3 of all teleconferencing is done at two-times switched 56 channel (~P = 2) bandwidth. Long distance ISDN ain't cheap. In each direction, video and audio are coded at an aggregate of 112 kbits/sec (2*56 kbits/sec).

The PictureTel proprietary compression algorithm is acknowledged to be a combination of spatial pyramid, lattice vector quantizer, and an unidentified entropy coding method. Motion compensation is considerably more refined and sophisticated than the 16x16 integer-pel block method specified in H.261.

The Compression Labs proprietary algorithm also offers significant improvement over H.261 when linked to other CLI hardware.

Currently, ITU-TS (International Telecommunications Union--Teleconferencing Sector), formerly CCITT, is quietly defining an improvement to H.261 with the participation of industry vendors.

What is the reasoning behind MPEG syntax symbols ?

Here are some of the Whys and Wherefores of MPEG symbols:

Start codes
These 32-bit byte-aligned codes provide a mechanism for cheaply searching coded bitstreams for commencement of various layers of video without having to actually parse or decode. Start codes also provide a mechanism for resynchronization in the presence of bit errors.

Coded block pattern
(CBP --not to be confused with Constrained Parameters!) When the frame prediction is particularly good, the displaced frame difference (DFD, or prediction error) tends to be small, often with entire block energy being reduced to zero after quantization. This usually happens only at low bit rates. Coded block patterns prevent the need for transmitting EOB symbols in those zero coded blocks.

DCT_coefficient_first
Each intra coded block has a DC coefficient. Inter coded blocks (prediction error or DFD) naturally do not since the prediction error is the first derivative of the video signal. With coded block patterns signalling all possible non-coded block patterns, the dct_coef_first mechanism assigns a different meaning to the VLC codeword that would otherwise represent EOB as the first coefficient.

End of Block
Saves unnecessary run-length codes. At optimal bitrates, there tends to be few AC coefficients concentrated in the early stages of the zig-zag vector. In MPEG-1, the 2-bit length of EOB implies that there is an average of only 3 or 4 non-zero AC coefficients per block. In MPEG-2 Intra (I) pictures, with a 4-bit EOB code, this number is between 9 and 16 coefficients. Since EOB is required for all coded blocks, its absence can signal that a syntax error has occurred in the bitstream.

Macroblock stuffing
A genuine pain for VLSI implementations, macroblock stuffing was introduced to maintain smoother, constant bitrate control in MPEG-1. However, with normalized complexity measures and buffer management performed on a a priori (pre-frame, pre-slice, and pre-macroblock) basis in the MPEG-2 encoder test model, the need for such localized smoothing evaporated. Stuffing can be achieved through virtually unlimited slice start code padding if required. A good rule of thumb: if you find yourself often using stuffing more than once per slice, you probably don't have a very good rate control algorithm. Anyway, macroblock stuffing is now illegal in MPEG-2.

MPEG's modified Huffman VLC tables
The VLC tables in MPEG are not Huffman tables in the true sense of Huffman coding, but are more like the tables used in Group 3 fax. They are entropy constrained, that is, non-downloadable and optimized for a limited range of bit rates (sweet spots). With the acception of a few codewords, the larger tables were carried over from the H.261 standard of 1990. MPEG-2 added an "Intra table". Note that the dct_coefficient tables assume positive/negative coefficient pmf symmetry.

How would you explain MPEG to the data compression expert ?

MPEG video is a block-based video scheme
Local decorrelations via DCT-Q-VLC hybrid
Dead-zone quantizer
DFD: quantized prediction error
What are the implementation requirements ?

MPEG pushes the limit of economical VLSI technology (but you get what you pay for in terms of picture quality or compaction efficiency)

Video Typical
decoder Total DRAM bus width Profile transistor count DRAM @ speed ------------
---------------- ------- ------------------- MPEG-1 CPB 0.4 to .75 million 4
Mbit 16 bits @ 80 ns MPEG-1 601 0.8 to 1.1 million 16 Mbit 64 bits @ 80 ns MPEG-2
MP@ML 0.9 to 1.5 million 16 Mbit 64 bits @ 80 ns MPEG-2 MP@High1440 2 to 3 million
64 Mbit N/A
70 or 80ns DRAM speed is a measure of the shortest period in which words can be transferred across the bus. In the case of MPEG-1 SIF, 80ns implies (1/80ns)(16bits) or about 25 MBytes/sec of bandwidth. Lack of cheap memory (DRAM) utilization is where the original DVI algorithm made a costly mistake. PAL required expensive VRAM/SRAM chips (a static RAM transistor requires 6 transistors compared to 1 transistor for DRAM). Fast page mode DRAM (which has slower throughput than SRAM and requires near-contiguous address mapping) is viable for MPEG due almost exclusively to the block nature of the algorithm and syntax (DRAM memory locations are broken into rows and columns).
How do I join MPEG ?

You don't join MPEG. You have to participate in ISO as part of a national delegation. How you get to be part of the national delegation is up to each nation. I only know the U.S., where you have to attend the corresponding ANSI meetings to be able to attend the ISO meetings. Your company or institution has to be willing to sink some bucks into travel since, naturally, these meetings are held all over the world. (For example, Paris, Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de Janeiro, London, etc.)

What is the evolution of standard documents ?

In chronological order:

The Proposal Stage
Voting members ballot on the creation of a new standards project.

The Preparatory Stage
Project Leader manages the development of a Working Draft.

The Committee Stag
Consensus is achieved on a Committee Draft.

The Approval Stage
National bodies vote on a Draft International Standard.

The Publication Stage
ISO publishes the International Standard.

How do I get the documents ?

MPEG is a draft ISO standard. It's exact name is ISO CD 11172. The draft consists of three parts: System, Video, and Audio. The System part (11172-1) deals with synchronization and multiplexing of audio-visual information, while the Video (11172-2) and Audio part (11172-3) address the video and the audio compression techniques respectively. Part 4, Conformance Testing, is currently a CD. You may order it from your national standards body (e.g. ANSI in the USA) or buy it from other companies like :

ISO Sales Case Postale 56 CH-1211
Geneve 20 Switzerland ANSI Attn: Sales 11 West 42nd Street New York, NY 10036
phone 212-642-4900 Phillips Business Information 7811 Montrose Rd Potomac, MD
20854. phone +1 301 424-3338 (800) OMNICOM fax +1 301 309-3847 Global Engineering
Documents For inquiries withing the US: 1990 M Street NW, Suite 400 Washington,
DC 20036 800-854-7179 (Voice) 202-331-0960 (Fax) For inquiries from outside
the US: 2805 McGaw Avenue Irvine, CA 92714 +1-714-261-1455 Beuth Verlag Postfach
1145 W-1000 Berlin 30 Germany
What are the important themes of MPEG ?

Application specific. MPEG does not solve everybody's application needs, but offers a syntax that is a good solution for most. MPEG does not, for example, decorrelate energies situated 1/256th of a pixel between a non-linear combination of 1000 frames. The syntax was designed to occupy an optimum between cost and quality ... in other words, between computational complexity (VLSI area, memory size and bandwidth) and compaction (compression) efficiency.
The DCT and Huffman algorithms are some of the least significant aspects of the standard, and yet somehow receive the most press coverage.
In the encoding algorithm, you can do what you want as long as the bitstreams produced are compliant. There is a huge difference in picture quality between, for example, the test model and real-world propriety implementions of encoding.
How do you tell a MPEG-1 bitstream from a MPEG-2 bitstream ?

All MPEG-2 bitstreams must have certain extension headers that *immediately* follow MPEG-1 headers. At the highest layer, for example, the MPEG-1 style sequence_header() is followed by sequence_extension() which is exclusive to MPEG-2. Some extension headers are specific to MPEG-2 profiles. For example, sequence_scalable_extension() is not allowed in Main Profile.

A simple program need only scan the coded bitstream for byte-aligned start codes to determine whether the stream is MPEG-1 or MPEG-2.

What is the precision of MPEG samples ?

By definition, MPEG samples have no more and no less than 8-bits uniform sample precision (256 quantization levels). For luminance (which is unsigned) data, black corresponds to level 0, white is level 255. However, in CCIR-610 recommendation chromaticy, levels 0 through 14 and 236 through 255 are reserved for blanking signal excursions. MPEG currently has no such clipped excursion restrictions.

What is the best compression ratio for MPEG ?

The MPEG sweet spot is about 1.2 bits/pel Intra and .35 bits/pel inter. Experimentation has shown that intra frame coding with the familiar DCT-Quantization-Entropy hybrid algorithm achieves optimal performance at about an average of 1.2 bits/sample or about 6:1 compression ratio. Below this point, artifacts become noticeable.

What about MPEG artifacts ?

If the encoder did its job properly, and the user specified a proper balance between sample rate and bitrate, there shouldn't be any visible artifacts. However, in sub-optimal systems, you can look for:

Gibbs phenomenon/Ringing/Aliasing (too few AC bits, not enough pre-filtering)
Blockiness (not considering your neighbors before quantizing) Posterization
(too few DC bits) Checkerboards (DCT eigenimages as a result of too few AC coefficients)
Color bleeding (not considering color in encoder cost model)
Are there single chip MPEG encoder ?

Yes, the C-Cube CL-4000 is the only single-chip, real-time encoder that can process true MPEG-1 SIF rate video.

Single chip for +/- 15 pel motion estimation at SIF rates (352x240x30 Hz) Two chips for +/- 32 pel at SIF rates (hierarchical) 5 or 6 chips for MPEG-2 at CCIR-610 rates (704 x 480 x 30 Hz) Highly microcoded architecture. Can code both H.261 and JPEG. Implements high picture quality microcode programs. [more details from CICC'93 and HotChips '93 conference to be included]

IBM and SGS-Thomson plan to introduce more hard-wired, multi-chip solutions in 1994.

What about MPEG-1 decoder chips ?

By implication of MPEG-2 Conformance requirements, all MPEG-2 decoders are required to decode MPEG-1 bitstreams as well. These chips, however, are strictly MPEG-1:

C-Cube CL-450 SIF rates. Single-chip. Has on-board CPU. SGS-Thomson 3400
SIF rates. Single-chip. Hardwired. Motorola MCD250 SIF rates. Single-chip. LSI
641172 CCIR-601 rates. Single-chip. Systems packet decoder on-chip.
What about audio chips ?

To date, only Layer I and Layer II have been implemented in dedicated (ASIC) silicon:

Motorola MCD260 Texas Instruments TI 320AV110 hardwired with
systems parsing) operates in free format (arbitrary sample rate) 120 pin PQFP
package Serial data port Part of technology exchange with C-Cube LSI Logic L64111
hardwired w/CPU with on-chip systems parsing. Serial data port 100-pin PQFP
GCA/ASCII ? Crystal Semiconductor CS4920 on-chip, 2 channel 16-bit digital-to-analog
converter (DAC) 16 MIPS, 24-bit DSP programmable clock manager 44-pin PLCC package
Programmable architecture. For example, can download Layer II MPEG-1 audio or
Dolby AC-2 $38 each in large quantities Dolby AC-3 MPEG NY disclosure claimed
to be less computationally intensive Zoran, GI working on own DSP-like dedicated
chips.
Will there be an MPEG video tape format ?

There is a consortium of companies (Philips, JVC, Sony, Matushista, et al) developing a metal particle based 6 millimeter consumer digital video tape format. It will initially use more JPEG-like independent frame compression for cheap encoding of source analog (NTSC, PAL) video. The consequence of course is less efficient use of bandwidth ( 25 Mbit/sec for the same quality achieved at 6 Mbit/sec with MPEG). Pre-compressed video from broadcast sources will be directly recorded to tape and "passed-through" as a coded bitstream to the video decompression "box" upon playback.

Is so-and-so really MPEG compliant ?

At the very least, there are two areas of conformance/compliance in MPEG: 1. Compliant bitstreams 2. compliant decoders. Technically speaking, video bitstreams consisting entirely of I-frames (such as those generated by Xing software) are syntactically compliant with the MPEG specification. The I-frame sequence is simply a subset of the full syntax. Compliant bitstreams must obey the range limits (e.g. motion vectors limited to +/-128, frame sizes, frame rates, etc.) and syntax rules (e.g. all slices must commence and terminate with a non-skipped macroblock, no gaps between slices, etc.).

Decoders, however, cannot escape true conformance. For example, a decoder that cannot decode P or B frames are *not* legal MPEG. Likewise, full arithmetic precision must be obeyed before any decoder can be called "MPEG compliant." The IDCT, inverse quantizer, and motion compensated prediction must meet the specification requirements... which are fairly rigid (e.g. no more than 1 least significant bit of error between reference and test decoders). Real-time conformance is more complicated to measure than arithmetic precision, but it is reasonable to expect that decoders that skip frames on reasonable bitstreams are not likely to be considered compliant.

What are some journals on related MPEG topics ?

IEEE Multimedia [first edition Spring 1994] IEEE Transactions on Consumer
Electronics IEEE Transactions on Broadcasting IEEE Transactions on Circuits
and Systems for Video Technology Advanced Electronic Imaging Electronic Engineering
Times (EE Times) IEEE Int'l Conference on Acoustics, Speech, and Signal Processing
(ICASSP) International Broadcasting Convention (IBC) Society of Motion Pictures
and Television Engineers (SMPTE) SPIE conference on Visual Communications and
Image Processing SPIE conference on Video Compression for Personal Computers
(to be held Feb 1994 in San Jose)
Which performances should I expect from MPEG boards ?

The OptiVision, along with products from Optibase and Scientific Atlanta do real time compression and storage to disk. The cheap video boards, at best, can only do 30 fps with about 160 x 120 windows. Nobody can do 352 x 240 in real time without the right hardware. The SA product is about $30K list and the Optibase somewhere around $20K for the board set.

A board from Optivision that can do the MPEG conversion off line. Even this is costly (about $2,000) to get it done in any decent time frame.

If you believe that $20,000 is high, AT&T at the Western Cable Show 1993 demonstrated a real time MPEG-2 compression system at $90,000.

The market for these real time systems is very real; it is the satellite uplink and cable television market. Nominal compression ratios are running about 200:1 for MPEG-1 in the Optibase product. For broadcast quality, compression ratios are lower. Even here, you have to be careful. 200:1 really means "take a 640 x 480 image, sub-sample it to 320x240 (throwing out data to get 4:1 compression), then compress it 50:1 doing MPEG".

FrameRate Labs is about ready to release a board that does 640 x 240 real time capture and storage to disk without any compression or dropped frames; it will compress offline. This is brute force but far cheaper than a $20,000 solution. If you need real-time all day long, talk to Scientific-Atlanta, Optibase or OptiVision. If you need real-time for a brief-time with dropped frames, use the low-end boards like Video Spigot, etc. If you need real-time for a brief-time without loss of data, FrameRate Labs might have a solution.

The low end board manufacturers label their products real-time 30 FPS and then, in the next sentence, they claim to be able to capture an image 640 x 480. But, they never say these things in the same sentence.

Are there any MPEG FTP or WWW sites ?

There are now many anonymous FTP site with MPEG programs or movies. A site archiving most of the public domain programs and documents about the MPEG standard (and also other compression techniques) may be found at ftp.crs4.it