G723.1音频编解码方案

G.723.1 specifies a coded representation that can be used for compressing the speech or other audio signal component of multimedia services at a very low bit rate. In the design of this coder, the principal application considered was very low bit rate visual telephony as part of the overall H.324 family of standards.

G.723.1 has two bit rates associated with it. These are 5.3 and 6.3 kbit/s. The higher bit rate has greater quality. The lower bit rate gives good quality and provides system designers with additional flexibility. Both rates are a mandatory part of the encoder and decoder. It is possible to switch between the two rates at any 30 ms frame boundary. An option for variable rate operation using discontinuous transmission and noise fill during non-speech intervals is also possible.

G.723.1 coder encodes speech or other audio signals in 30 msec frames. In addition, there is a look ahead of 7.5 msec, resulting in a total algorithmic delay of 37.5 msec. All additional delays in the implementation and operation of this coder are due to:

actual time spent processing the data in the encoder and decoder;
transmission time on the communication link;
additional buffering delay for the multiplexing protocol.
G.723.1 coder is designed to operate with a digital signal obtained by first performing telephone bandwidth filtering (Recommendation G.712) of the analogue input, then sampling at 8000 Hz and then converting to 16-bit linear PCM for the input to the encoder. The output of the decoder should be converted back to analogue by similar means. Other input/output characteristics, such as those specified by Recommendation G.711 for 64 kbit/s PCM data, should be converted to 16-bit linear PCM before encoding or from 16-bit linear PCM to the appropriate format after decoding.

The coder is based on the principles of linear prediction analysis-by-synthesis coding and attempts to minimize a perceptually weighted error signal. The encoder operates on blocks (frames) of 240 samples each. That is equal to 30 msec at an 8 kHz sampling rate. Each block is first high pass filtered to remove the DC component and then divided into four subframes of 60 samples each. For every subframe, a 10th order Linear Prediction Coder (LPC) filter is computed using the unprocessed input signal. The LPC filter for the last subframe is quantized using a Predictive Split Vector Quantizer (PSVQ). The unquantized LPC coefficients are used to construct the short-term perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually weighted speech signal.

For every two subframes (120 samples), the open loop pitch period, LOL, is computed using the weighted speech signal. This pitch estimation is performed on blocks of 120 samples. The pitch period is searched in the range from 18 to 142 samples. From this point the speech is processed on a 60 samples per subframe basis.

Using the estimated pitch period computed previously, a harmonic noise shaping filter is constructed. The combination of the LPC synthesis filter, the formant perceptual weighting filter, and the harmonic noise shaping filter is used to create an impulse response. The impulse response is then used for further computations.

Using the pitch period estimation, LOL, and the impulse response, a closed loop pitch predictor is computed. A fifth order pitch predictor is used. The pitch period is computed as a small differential value around the open loop pitch estimate. The contribution of the pitch predictor is then subtracted from the initial target vector. Both the pitch period and the differential value are transmitted to the decoder.

Finally the non-periodic component of the excitation is approximated. For the high bit rate, Multi-pulse Maximum Likelihood Quantization (MP-MLQ) excitation is used, and for the low bit rate, an algebraic-code-excitation (ACELP) is used.

G.723.1 decoder operation is also performed on a frame-by-frame basis. First the quantized LPC indices are decoded, then the decoder constructs the LPC synthesis filter. For every subframe, both the adaptive codebook excitation and fixed codebook excitation are decoded and input to the synthesis filter. The adaptive postfilter consists of a formant and a forward-backward pitch postfilter. The excitation signal is input to the pitch postfilter, which in turn is input to the synthesis filter whose output is input to the formant postfilter. A gain scaling unit maintains the energy at the input level of the formant postfilter.

Applications:
WIFI phones VoWLAN
Wireless GPRS EDGE systems.
Personal Communications
Wideband IP telephony
Audio and Video Conferencing
Wideband IP telephony
Features:
Full and half duplex modes of operation.
Passes ITU test vectors.
Common compressed speech frame stream interface to support systems with multiple speech coders (G.729, G.728, G.726 et al).
Optimized for high performance on leading edge DSP architectures.
Multi-tasking environment compatible.
Configurations:
DAA interface using linear codec at 8.0 kHz sample rate.
Direct interface to 8.0 kHz PCM data stream (A-law or mu-law).
North American/International Telephony (including caller ID) support available.
Simultaneous DTMF detector operation available - (less than 150 hits on Bellcore test tape typical).
MF tone detectors, general purpose programmable tone detectors/generators available.
Line echo cancellation (G.165 & G.168 compliant) available.
Where multiple speech coders (G.729, G.728, G.726 et al.) are available, coder selection can occur at run-time.
Data/Facsimile/Voice Distinction available.
Various startup procedures available (V.8 and V.8bis).
Multiple ports can be executed on a single DSP.
Example Resource Requirements (ADSP-2181):
Encoder 5 1/3k bps requires 18 MIPS
Encoder 6.4k bps requires 26 MIPS
Decoder (5 1/3k bps or 6.4k bps) requires 2 MIPS