Wyner-Ziv video coding for wireless lightweight multimedia applications
© Deligiannis et al; licensee Springer. 2012
Received: 10 October 2011
Accepted: 14 March 2012
Published: 14 March 2012
Skip to main content
© Deligiannis et al; licensee Springer. 2012
Received: 10 October 2011
Accepted: 14 March 2012
Published: 14 March 2012
Wireless video communications promote promising opportunities involving commercial applications on a grand scale as well as highly specialized niche markets. In this regard, the design of efficient video coding systems, meeting such key requirements as low power, mobility and low complexity, is a challenging problem. The solution can be found in fundamental information theoretic results, which gave rise to the distributed video coding (DVC) paradigm, under which lightweight video encoding schemes can be engineered. This article presents a new hash-based DVC architecture incorporating a novel motion-compensated multi-hypothesis prediction technique. The presented method is able to adapt to the regional variations in temporal correlation in a frame. The proposed codec enables scalable Wyner-Ziv video coding and provides state-of-the-art distributed video compression performance. The key novelty of this article is the expansion of the application domain of DVC from conventional video material to medical imaging. Wireless capsule endoscopy in particular, which is essentially wireless video recording in a pill, is proven to be an important application field. The low complexity encoding characteristics, the ability of the novel motion-compensated multi-hypothesis prediction technique to adapt to regional degrees of temporal correlation (which is of crucial importance in the context of endoscopic video content), and the high compression performance make the proposed distributed video codec a strong candidate for future lightweight (medical) imaging applications.
Traditional video coding architectures, like the H.26x  recommendations, mainly target broadcast applications, where video content is distributed to multiple users, and focus on optimizing the compression performance. The source redundancy is exploited at the encoder by means of predictive coding. In this way, traditional video coding implies joint encoding and decoding of video. Namely, the encoder produces a prediction of the source and then codes the difference between the source and its prediction. Motion-compensated prediction in particular, a key algorithm to achieve high compression performance by removing the temporal correlation between successive frames in a sequence, is very effective but computationally demanding.
The need for highly efficient video compression architectures maintaining lightweight encoding remains challenging in the context of wireless video capturing devices that have only modest computational capacity or operate on limited battery life. The solution to reduce the encoding complexity can be found in the fundamentals of information theory, which constitute an original coding perspective, known as distributed source coding (DSC). The latter stems from the theory of Slepian and Wolf  on lossless separate encoding and joint decoding of correlated sources. Subsequently, Wyner and Ziv  extended the DSC problem to the lossy case, deriving the rate distortion function with side information at the decoder. Driven by these principles, the distributed, alias Wyner-Ziv, video coding paradigm has arisen [4, 5].
Unlike traditional video coding, in distributed video coding (DVC), the source redundancies are exploited at the decoder side, implying separate encoding and joint decoding. Specifically, a prediction of the source, named side information, is generated at the decoder by using the already decoded information. By expressing the statistical dependency between the source and the side information in the form of a virtual correlation channel, e.g. [4–8], compression can be achieved by transmitting parity or syndrome bits of a channel code, which are used to decode the source with the aid of the side information. Hence, computationally expensive tasks, like motion estimation, could be relocated to the decoder, allowing for a flexible sharing of the computational complexity between the encoder and the decoder and enabling the design of lightweight encoding architectures.
DVC has been recognized as a potential strategic component for a wide range of lightweight video encoding applications, including visual sensor networks and wireless low-power surveillance [9, 10]. A unique application of particular interest in this article is wireless capsule endoscopya. Conventional endoscopy, like colonoscopy or gastroscopy, has proven to be an indispensable tool in the diagnosis and remedy of various diseases of the digestive track. Significant advances in miniaturization have led to the emergence of wireless capsule endoscopy . At the size of a large pill, a wireless capsule endoscope comprises a light source, an integrated chip video camera, a radio telemetry transmitter and a limited lifespan battery. The small-scale nature of the recording device forces severe constraints on the required video coding technology, in terms of computational complexity, operating time, and power consumption. Moreover, since the recorded video is used for medical diagnosis, high-quality decoded video at an efficient compression ratio is of paramount importance.
Generating high-quality side information plays a vital role in the compression performance of a DVC system. Conversely to traditional predictive coding, in DVC the original frame is not available during motion estimation, since this is performed at the decoder. Producing accurate motion-compensated predictions at the decoder for a wide range of video content, while at the same time constraining the encoding complexity and guaranteeing high compression performance, poses a major challenge. This problem becomes even more intricate in the largely unexplored application of DVC in wireless capsule endoscopy, in which the recorded video material contains extremely irregular motion, due to low frame acquisition rates and the erratic movement of the capsule along the gastrointestinal track. Towards tackling this challenge, this study presents a novel hash-based DVC architecture.
First and foremost, this study paves the road for the application of DVC systems in lightweight medical imaging where the proposed codec achieves high compression efficiency with the additional benefit of low computational encoding complexity. Second, the proposed Wyner-Ziv video codec incorporates a novel motion-compensated multi-hypothesis prediction scheme, that supports online tuning to the spatial variations in temporal correlation in a frame by obtaining information from the coded hash in case temporal prediction is unreliable. Third, this article includes a thorough experimental evaluation of the proposed hash-based DVC scheme on (i) conventional test sequences, numerous (ii) traditional endoscopic as well as (iii) wireless capsule endoscopic video content. The experimental results show that the proposed DVC outperforms alternative DVC schemes, including DISCOVER, the hash-based DVC from  and our previous study , as well as conventional codecs, namely, Motion JPEG and H.264/AVC Intra . Four, this article incorporates a detailed analysis of the encoding complexity and buffer size requirements of the proposed system.
The rest of the article is structured as follows. Section 2 covers an overview of Slepian-Wolf and Wyner-Ziv coding and their instantiation in DVC. Section 3 describes two application scenarios, both relevant to DVC in general and the proposed video codec in particular. Our novel DVC codec is explained in Section 4 and experimentally evaluated in Section 5, using conventional test sequences as well as endoscopic test video. Section 6 draws the conclusions of this study.
Consider the compression of two correlated, discrete, identically and independently distributed (i.i.d.) random sources X and Y. According to Shannon's source coding theory , the achievable lower rate bound for lossless joint encoding and decoding is given by the joint entropy H(X, Y) of the sources. Slepian and Wolf  studied the lossless compression scenario in which the sources are independently encoded and jointly decoded. According to their theory, the achievable rate region for decoding X and Y with an arbitrarily small error probability is given by R X ≥ H(X|Y), R Y ≥ H(Y|X), R X + R Y ≥ H(X, Y), where H(X|Y) and H(Y|X) are the conditional entropies of the considered sources, and R X , R Y are the respective rates at which the sources X and Y are coded, i.e., the Slepian-Wolf theorem states that even when correlated sources are encoded independently, a total rate close to the joint entropy suffices to achieve lossless compression.
The Slepian-Wolf theorem constructs a random binning argument, in which the employed code generation is asymptotic and non-constructive. In , Wyner pointed out the strong relation between random binning and channel coding, suggesting the use of linear channel codes as a practical solution for Slepian-Wolf coding. Wyner's methodology was recently used by Pradhan and Ramchandran , in the context of practical Slepian-Wolf code design based on conventional channel codes like block and trellis codes. In the particular case of binary symmetric correlation between the sources, Wyner's scheme can be extended to state-of-the-art binary linear codes, such as Turbo [5, 17], and low-density parity-check (LDPC) codes , approaching the Slepian-Wolf limit. A turbo scheme with structured component codes was used in  while parity bits instead of syndrome bits were sent in . Although breaking the close link with channel coding, characterized by syndromes and coset codes, the latter solutions offer inherent robustness against transmission errors.
Wyner-Ziv coding  refers to the problem of lossy compression with decoder side information. Suppose X and Y are two statistically dependent i.i.d. random sources, where X is independently encoded and decoded using Y as side information. The reconstructed source has an expected distortion . According to the Wyner-Ziv theorem , a rate loss is sustained when the encoder is ignorant of the side information, namely , where is the Wyner-Ziv rate and R X|Y (D) is the rate when the side information is available to the encoder as well. However, Wyner and Ziv further showed that equality holds for the quadratic Gaussian case, namely the case where X and Y are jointly Gaussian and a mean-square distortion metric d(•,•) is used.
Initial practical Wyner-Ziv code design focused on finding good nested codes among lattice  and trellis-based codes  for the quadratic Gaussian case. However, as the dimensionality increases, lattice source codes approach the source coding limit much faster than lattice channel codes approach capacity. This observation has induced the second wave of Wyner-Ziv code design which is based on nested lattice codes followed by binning . The third practical approach to Wyner-Ziv coding considers non-nested quantization followed by efficient binning, realized by a high-dimensional channel code . Other constructions in the literature propose turbo-trellis Wyner-Ziv codes, in which trellis coded quantization is concatenated with a Turbo  or an LDPC  code.
One of the applications of DSC that has received a substantial amount of research attention is DVC. Except for providing low-complexity encoding solutions for video, Wyner-Ziv coding has been shown to provide error resilient video coding by means of distributed joint-source channel coding , or systematic forward error protection . Moreover, layered Wyner-Ziv code  constructions support scalable video coding .
An early practical DVC implementation was the PRISM codec , combining Bose-Chaudhuri-Hocquenghem channel codes with efficient entropy coding and performing block-based joint decoding and motion estimation. An additional CRC check was sent to the decoder to select between many decoded versions of a block, each version in fact corresponding to a different motion vector. An alternative DVC architecture, that implemented Wyner-Ziv coding as quantization followed by turbo coding using a feedback channel to enable decoder-driven optimal rate control, was presented in . In this architecture, side information was generated at the decoder using motion-compensated interpolation (MCI). The architecture was further improved upon, resulting in the DISCOVER codec , which included superior MCI  through block-based bidirectional motion estimation and compensation combined with spatial smoothing. The DISCOVER codec is a well-established reference in DVC, delivering state-of-the-art compression performance.
In sequences with highly irregular motion content, blind motion estimation at the decoder, by means of MCI for example, fails to deliver adequate prediction quality. One technique to overcome this problem is to perform hash-based motion estimation at the decoder. Aaron et al.  proposed a hash code consisting of a coarsely sub-sampled and quantized version of each block in a Wyner-Ziv frame. The encoder performed a block-based decision whether to transmit the hash. For the blocks for which a hash code was sent, hash-based motion estimation was carried out at the decoder, while for the rest of the blocks, for which no hash was sent, the co-located block in the previous reconstructed frame was used as side information. In , several hash generation approaches--either in the pixel or in the transform domain--were investigated. It was shown that hash information formed by a quantized selection of low-frequency DCT bands per block was outperforming the other methods . In , a block-based selection, based on the current frame to be coded and its future and past frames in hierarchical order, was performed at the encoder. Blocks for which MCI was foreseen to fail were low-quality H.264/AVC Intra encoded and transmitted to the decoder to assist MCI. The residual frame, given by the difference between all reconstructed intra coded blocks or the central luminance value (for non-hash blocks) and the corresponding blocks in the Wyner-Ziv frame, was formed and Wyner-Ziv encoded. In our previous study , we have introduced a hash-based DVC, where the auxiliary information conveyed to the decoder comprised a number of most significant bit-planes of the original Wyner-Ziv frames. Such a bit-plane-based hash facilitates accurate decoder-side motion estimation and advanced probabilistic motion compensation . Transform-domain Wyner-Ziv encoding was applied on the remaining least significant bit-planes, defined as the difference of the original frame and the hash . In , hash-based motion estimation was combined with side information refinement to further improve the compression performance at the expense of minimal structural decoding delay.
Driven by the requirements of niche applications like wireless capsule endoscopy, this study proposes a novel hash-based DVC architecture introducing the following novelties. First, in contrast to our previous DVC architectures [30, 31], which employed a bit-plane hash, the presented system generates the hash as a downscaled and subsequently conventionally intra coded version of the original frames. Second, unlike our previous study [30–32], the hash is exploited in the design of a novel motion-compensated multi-hypothesis prediction scheme, which is able to adapt to the regional variations in temporal correlation in a frame by extracting information from the hash when temporal prediction is untrustworthy. Compared to alternative techniques in the literature, i.e., [12, 13, 26, 27], the proposed methodology delivers superior performance under strenuous conditions, namely, when irregular motion content is encountered as in for example endoscopic video material, where gastrointestinal contractions can generate severe morphological distortions in conjunction with extreme camera panning. Third, the way the hash is constructed and utilized to generate side information in the proposed codec also differs from the approaches in [28, 29]. Fourth, conversely to alternative hash-based DVC systems [12, 31], the proposed architecture codes the entire frames using powerful channel codes instead of coding only the difference between the original frames and the hash. Fifth, unlike existing works in the literature, this article experimentally shows the state-of-the-art compression performance of the proposed DVC not only on conventional test sequences, but also on traditional and wireless capsule endoscopic video content, while low-cost encoding is guaranteed.
Wyner-Ziv video coding can be a key component to realize many-to-many video streaming over wireless networks. Such a setting demands optimal video streams, tailored to specific requirements in terms of quality, frame-rate, resolution, and computational capabilities imposed by a set of recorders and receivers. Consider a network of wireless visual sensors that is deployed to monitor specific scenes, providing security and surveillance. The acquired information is gathered by a central node for decoding and processing. Wireless network surveillance applications are characterized by a wide variety of scene content, ranging from complex motion sequences, e.g., crowd or traffic monitoring, to surveillance of scenes mostly devoid of significant motion, e.g., fire and home monitoring.
In such scenarios, wireless visual sensors are understood to be cheap, battery powered and modest in terms of complexity. In this concept, Wyner-Ziv video coding facilitates communications from the sensors to the central base station, by maintaining low computational requirements at the recording sensor, while simultaneously ensuring fast, highly efficient, and scalable coding. From a complementary perspective, a conventional predictive video coding format with low-complexity decoding characteristics provides a broadcast oriented one-to-many video stream for further dissemination from the base station. Such a video communications' scenario centralizes the computational complexity in the fixed network infrastructure, which would be responsible for transcoding the Wyner-Ziv video coding streams to a conventional format.
Focussing on the video coding technology part, it is apparent that wireless endoscopy is subjected to severe constraints in terms of available computational capacity and power consumption. Contemporary capsule video chips employ conventional coding schemes operating in a low-complexity, intra-frame mode, i.e., Motion JPEG , or even no compression at all. Current capsule endoscopic video systems operate at modest frame resolutions, e.g., 256 × 256 pixels, and frame rates, e.g., 2-5 Hz, on a battery life time of approximately 7 h. Future generations of capsule endoscopes are intended to transmit at increased resolution, frame rate, and battery life time and will therefore require efficient video compression at a computational cost as low as possible. In addition, a video coding solution supporting temporal scalability has an attractive edge, enabling increased focus during the relevant stages of the capsules bodily journey. DVC is a strong candidate to fulfil the technical demands imposed by wireless capsule endoscopy, offering low-cost encoding, scalability, and high compression efficiency .
Every incoming frame is categorized as a key or a Wyner-Ziv frame, denoted by K and W, respectively, as to construct groups of pictures (GOP) of the form KW ...W. The key frames are coded separately using a conventional intra codec, e.g., H.264/AVC intra  or Motion JPEG.b The Wyner-Ziv frames on the other hand are encoded in two stages. For every Wyner-Ziv frame, the encoder first generates and codes a hash, which will assist the decoder during the motion estimation process. In the second stage, every Wyner-Ziv frame undergoes a discrete cosine transform (DCT) and is subsequently coded in the transform domain using powerful channel codes, thus generating a Wyner-Ziv bit stream.
Our Wyner-Ziv video encoder creates an efficient hash that consists of a low-quality version of the downsized original Wyner-Ziv frames. In contrast to our previous hash-based DVC architectures [30, 31], where the dimensions of the hash were equal to the dimensions of the original input frames, coding a hash-based on the downsampled Wyner-Ziv frames reduces the computational complexity. In particular, every Wyner-Ziv frame undergoes a down-scaling operation by a factor, d ∈ ℤ+. To limit the involved operations, straightforward downsampling is applied. Foregoing a low-pass filter to bandlimit, the signal prior to downsampling runs the risk of introducing undesirable aliasing artefacts. However, experimental experience has shown that the impact on the overall rate-distortion (RD) performance of the entire system does not outweigh the computational complexity incurred by the use of state-of-the-art downsampling filters, e.g., Lanczos filers .
After the dimensions of the original Wyner-Ziv frames have been reduced, the result is coded using a conventional intra video codec, exploiting spatial correlation within the hash frame only. The quality at which the hash is coded has experimentally been selected and constitutes a trade-off between (i) obtaining a constant quality of the decoded frames, which is of particular interest in medical applications, (ii) achieving high RD performance for the proposed system and (iii) maintaining a low hash rate overhead. We notice that constraining the hash overhead comes with the additional benefit of minimizing the hash encoding complexity. On the other hand, ensuring sufficient hash quality so that the accuracy of the hash-based motion estimation at the decoder is not compromised or so that even pixels in the hash itself could serve as predictors is important. Afterwards, the resulting hash bit stream is multiplexed with the key frame bit stream and sent to the decoder.
We wish to highlight that, apart from assisting motion estimation at the decoder as in contemporary hash-based systems, the proposed hash code is designed to also act as a candidate predictor for pixels for which the temporal correlation is low. This feature is of particular significance especially when difficult-to-capture endoscopic video content is coded. To this end, the presented hash generation approach was chosen over existing methods in which the hash consists of a number of most significant Wyner-Ziv frame bit-planes [30, 31], of coarsely sub-sampled and quantized versions of blocks , or of quantized low frequency DCT bands  in the Wyner-Ziv frames.
Furthermore, we note that, in contrast to other hash-based DVC solutions [12, 28], the proposed architecture avoids block-based decisions on the transmission of the hash at the encoder side. Although this can increase the hash rate overhead when easy-to-predict motion content is coded, it comes at the benefit of constraining the encoding complexity, in the sense that the encoder is not burdened by expensive block-based comparisons or memory requirements necessary for such mode decision. An additional key advantage of the presented hash code is that it facilitates accurate side information creation using pixel-based multi-hypothesis compensation at the decoder, as explained in Section 4.2.2. In this way, the presented hash code enhances the RD performance of the proposed system especially for irregular motion content, e.g., endoscopic video material.
In addition to the coded hash, a Wyner-Ziv layer is created for every Wyner-Ziv frame, providing efficient compression  and scalable coding . In line with the DVC architecture introduced in , the Wyner-Ziv frames are first transformed with a 4 × 4 integer approximation of the DCT  and the obtained coefficients are subsequently assembled in frequency bands. Each DCT band is independently quantized using a collection of predefined quantization matrices (QMs) , where the DC and the AC bands are quantized with a uniform and double-deadzone scalar quantizer, respectively. The quantized symbols are translated into binary codewords and passed to a LDPC Accumulate (LDPCA) encoder , assuming the role of Slepian-Wolf encoder.
The LDPCA  encoder realizes Slepian and Wolf's random binning argument  through linear channel code syndrome binning. In detail, let b be a binary M-tuple containing a bit-plane of a coded DCT band β of a Wyner-Ziv frame, where M is the number of coefficients in the band. To compress b, the encoder employs an (M, k) LDPC channel code C constructed by the generator matrix c. The corresponding parity check matrix of C is . Thereafter, the encoder forms the syndrome vector as s = bH T . In order to achieve various puncturing rates, the LDPC syndrome-based scheme is concatenated with an accumulator . Namely, the derived syndrome bits s are in turn mod-2 accumulated, producing the accumulated syndrome tuple α. The encoder stores the accumulated syndrome bits in a buffer and transmits them incrementally upon the decoder's request using a feedback channel, as explained in Section 4.2.3. Note that contemporary wireless (implantable) sensors--including capsule endoscopes--support bidirectional communication [33, 37, 38]. That is, a feedback channel from the encoder to the decoder is a viable solution for the pursued applications. The effect of the employed feedback channel on the decoding delay, and in turn on the buffer requirements at the encoder of a wireless capsule endoscope, is studied in Section 5.3.
Note that the focus of this study is to successfully target various lightweight applications by improving the compression efficiency of Wyner-Ziv video coding while maintaining low computational cost at the encoder. Hence, in order to accurately evaluate the impact of the proposed techniques on the RD performance, the proposed system employs LDPCA codes which are also used in the state-of-the-art codecs of [13, 26]. Observe that for distributed compression under a noiseless transmission scenario the syndrome-based Slepian-Wolf scheme  is optimal since it can achieve the information theoretical bound with the shortest channel codeword length . Nevertheless, in order to address distributed joint source-channel coding (DJSCC) in a noisy transmission scenario the parity-based  Slepian-Wolf scheme needs to be deployed. In the latter, parity-check bits are employed to indicate the Slepian-Wolf bins, thereby achieving equivalent Slepian-Wolf compression performance at the cost of an increased codeword length .
It is important to mention that, conversely to other hash-driven Wyner-Ziv schemes operating in the transform domain, e.g., [12, 31], the presented Wyner-Ziv encoder encodes the entire original Wyner-Ziv frame, instead of coding the difference between the original frame and the reconstructed hash. The motivation for this decision is twofold. The first reason stems from the nature of the hash. Namely, coding the difference between the Wyner-Ziv frame and the reconstructed hash would require decoding and interpolating the hash at the encoder, an operation which is computationally demanding and would pose an additional strain on the encoder's memory demands. Second, compressing the entire Wyner-Ziv frame with linear channel codes enables the extension of the scheme to the DJSCC case , thereby providing error-resilience for the entire Wyner-Ziv frame if a parity based Slepian-Wolf approach is followed.
The main components of the presented DVC architecture's decoding process are treated separately, namely dealing with the hash, side information generation and Wyner-Ziv decoding. The decoder first conventionally intra decodes the key frame bit stream and stores the reconstructed frame in the reference frame buffer. In the following phase, the hash is handled, which is detailed next.
The hash bit-stream is decoded with the appropriate conventional intra codec. The reconstructed hash is then upscaled to the original Wyner-Ziv frame's resolution. The ideal upscaling process consists of upsampling followed by ideal interpolation filtering. The ideal interpolation filter is a perfect low-pass filter with gain d and cut-off frequency π/d without transition band . However, such a filter corresponds to an infinite length impulse response hideal, to be precise, a sinc function hideal(n) = sinc(n/d) where n ∈ ℤ+, which cannot be implemented in practice.
Such interpolation filter is known in the literature as a Lanczos3 filter . Following , the resulting filter taps are normalized to obtain unit DC gain while the input samples are preserved by the upscaling process since h0(n) = 1.
After the hash has been restored to the same frame size as the original Wyner-Ziv frames, it is used to perform decoder-side motion estimation. The quality of the side information is an important factor on the overall compression performance of any Wyner-Ziv codec, since the higher the quality the less channel code rate is required for Wyner-Ziv decoding. The proposed side information generation algorithm performs bidirectional overlapped block motion estimation (OBME) using the available hash information and a past and a future reconstructed Wyner-Ziv and/or key frame as references.
Temporal prediction is carried out using a hierarchical frame organization, similar to the prediction structures used in [5, 12, 26]. It is important to note that conversely to our previous study , in which motion estimation was based on bit-planes, this study follows a different approach regarding the nature of the hash as well as the block matching process. Before motion estimation is initiated, the reference frames are preprocessed. Specifically, to improve the consistency of the resulting motion vectors, the reference frames are first subjected to the same downsampling and interpolation operation as the hash.
where p visits all the co-located pixel positions in the blocks and , respectively. The motion search is executed at integer-pel accuracy and the obtained motion field is extrapolated to the original reference frames R k . By construction, every pixel Y(p), p = (p1, p2) in the side information frame Y is located inside a number of overlapping blocks with u n = (un,1, un,2). After the execution of the OBME, a temporal predictor block for every block has been identified in one reference frame. As a result, each pixel Y(p) in the side information frame has a number of associated temporal predictors in the blocks .
However, some temporal predictors may stem from rather unreliable motion vectors. Especially when the input sequence was recorded at low frame rates or when the motion content is highly irregular, as might be the case in endoscopic sequences, temporal prediction is not the preferred method for all blocks at all times. Therefore, to avoid quality degradation of the side information due to untrustworthy predictors, all obtained motion vectors are subjected to a reliability screening. Namely, when the SAD, based on which the motion vector associated with temporal predictor was determined, is not smaller than a certain threshold T, the motion vector and associated temporal predictor is labeled as unreliable. In this case, a temporal predictor for the side information pixel Y(p) is replaced by the co-located pixel of Y(p) in the upsampled hash frame, that is . In other words, when motion estimation is considered not to be trusted, the hash itself is assumed to convey more dependable information. This feature of OBME is referred to as hash-predictor-selection (HPS).
where, denotes the number of predictors for pixel Y(p) and when is reliable or when is unreliable. The derived multi-hypothesis motion field is employed in an analogous manner to estimate the chroma components of the side information frame from the chroma components of the reference frames R k or the upsampled hash.
Thereafter, the estimated correlation channel statistics per coded DCT band bit-plane are interpreted into soft estimates, i.e., log-likelihood ratios (LLRs). These LLRs, which provide a priori information about the probability of each bit to be 0 or 1, are passed to the variable nodes of the LDPCA decoder. Then, the message passing algorithm  is used for iterative LDPC decoding, in which the received syndrome bits correspond to the check nodes on the bipartite graph.
where equality in (6) stems from: p(b l |y, b1, ..., bl-1) = p(b1, ..., bl-1, b l |y)/p(b1, ..., bl-1|y). Hence, in (6) the nominator and the denominator are calculated by integrating the conditional probability density function of the correlation channel, i.e., fX|Y(x|y), over the quantization bin indexed by b1, ..., b l .
Remark that the LDPCA decoder achieves various rates by altering the decoding graph upon reception of an additional increment of the accumulated syndrome . Initially, the decoder receives a short syndrome based on an aggressive code and the decoder tries to decode . If decoding falls short, the encoder receives a request to augment the previously received syndrome with extra bits. The process loops until the syndrome is sufficient for successful decoding.
where, qL, qH denote the lower and upper bound of the quantization bin q. Finally, the inverse DCT transform provides the reconstructed frame Ŵ in the spatial domain. The reconstructed frame is now ready for display and is stored in the reference frame buffer, serving as a reference for future temporal prediction.
The experimental results have been divided into three distinct parts. Namely, first the proposed system is compared against a set of relevant alternative video coding solutions using traditional test sequences. The second part comprises the experimental validation of our system in the application of wireless capsule endoscopy, comparing its performance against coding solutions currently used for the compression of endoscopic video. The third part elaborates on the encoding complexity of the proposed architecture.
We begin by defining the configuration elements of the proposed system, which are common to both types of input video. Namely, the motion estimation algorithm was configured with an overlap step size ε = 4, the size of the overlapping blocks was set to B = 16 and the threshold was chosen T = 400. The motion search was executed in an exhaustive manner at integer-pel accuracy within a search range of ± 16 pixels. The downscaling factor to create the hash was fixed at d = 2.
Employed quantization parameters for the key, the hash and the Wyner-Ziv frames as well as the resulting RSD for the entire sequence
RD point 1 (QM1)
RD point 2 (QM4)
RD point 3 (QM7)
RD point 4 (QM8)
Key frame QP
Key frame QP
Key frame QP
Key frame QP
To further evaluate the performance of our proposed scheme, the coding results of  are included in Figure 4. The hash-based Wyner-Ziv video codec of  combines MCI with hash-driven motion estimation using low quality H.264/AVC Intra coded Wyner-Ziv blocks to generate side information. Even though the codec of  advances over DISCOVER, our proposed hash-based solution generally exhibits higher performance bringing BD rate savings of 17.68 and 12.18% in Foreman and Soccer, in GOP8, respectively.
Lastly, the proposed DVC is compared with H.264/AVC Intra, which represents the low-complexity configuration of the state-of-the-art traditional coding paradigm. One can observe from Figure 5 that in low-motion sequences the proposed codec is superior to H.264/AVC Intra, bringing BD rate savings of up to 26.7% in Silent, GOP8. However, under difficult motion conditions like in Ice or Soccer H.264/AVC Intra is very efficient compared to DVC systems, which is in agreement with the results shown in Figures 4 and 5. We emphasize that the encoding complexity of H.264/AVC Intra is much higher than any of the presented DVC solutions, as discussed in Section 5.3.
A major contribution of this article is the assessment of Wyner-Ziv coding for endoscopic video data, characterized by its unique content. In the proposed codec, the quantization parameters of the Wyner-Ziv frames, the key frames, and the hash are meticulously selected so as to retain high and quasi-constant decoded frames' quality, as demanded by medical applications. Furthermore, in order to deliver high-quality decoding under the strenuous conditions of highly irregular motion content and low frame acquisition rates, the proposed codec employs a GOP size of 2.
Initially, in order to prove the potential of its application in contemporary wireless capsule endoscopic technology, the proposed codec has been appraised using four capsule endoscopic test video sequences visualizing diverse areas of the gastrointestinal track. These sequences were extracted from extensive capsule endoscopic video material of two capsule examinations from two random volunteersf performed at the Gastroenterology Clinic of the Universitair Ziekenhuis Brussels, Belgium. In the aforementioned clinical examinations, the capsule acquisition rate was two frames per second with a frame resolution of 256 × 256 pixels. The obtained test video sequencesg are termed "Capsule Test Video 1" to "Capsule Test Video 4" in the remainder of the article.
Figure 7 also evaluates the impact of the flexible scheme that enables the proposed OBME method to identify erroneous motion vectors and to replace the temporal predictor pixel with the decoded and interpolated hash. The results show that the proposed system with the HPS module remarkably advances over its equivalent that solely retains predictors from the reference frames. Specifically, in "Capsule Test Video 1" to "Capsule Test Video 4" adding the HPS functionality results in BD  rate improvements of 21.1, 16.02, 12.93, and 12.06%, respectively.
Future generations of capsule endoscopic technology aim at diminishing the quality difference with respect to conventional endoscopy by increasing the frame rate and resolution. Therefore, to confirm its capability under these conditions, the proposed Wyner-Ziv video codec is evaluated using conventional endoscopic video sequences monitoring diverse parts of the digestive track of several patients. The endoscopic test video sequences considered in this experimental setting have a frame rate of 30 Hz and a frame resolution of 480 × 320 pixels. These endoscopic test video sequences are further referred to as "Endoscopic Test Video 1" to "Endoscopic Test Video 6". In this experiment, the proposed codec employs H.264/AVC Intra (Main profile) to code the key and the hash frames. Notice that the H.264/AVC Intra codec constitutes a recognized reference for medical video compression, e.g. .
Compared to H.264/AVC Intra, the experimental results in Figure 10 show that the proposed codec delivers BD rate savings of 4.1% in "Endoscopic Test Video 2". In "Endoscopic Test Video 1" and "Endoscopic Test Video 3" the proposed codec falls behind H.264/AVC Intra, incurring a BD rate loss of 3.84 and 0.20%, respectively. Only in "Endoscopic Test Video 4", which comprises highly irregular motion, the experienced Bjøntegaard rate overhead is notable amounting to 15.68%. Notice that the benefit of the HPS functionality of the proposed codec is reduced in case of conventional endoscopic video with respect to the capsule endoscopic sequences. This is due to the fact that the former sequences were recorded at a much higher frame rate and contain more temporal correlation. Nevertheless, in "Endoscopic Test Video 4" the HPS module brings BD rate savings of 4.63%.
Low-cost encoding is a key aspect of distributed video compression. During the evaluation of the DISCOVER  codec, it was shown that the Wyner-Ziv frames' encoding complexity is very low compared to the complexity associated with the intra encoding of the key frames. Therefore, the lower the number of key frames, i.e., the longer the GOP, the higher the gain in complexity reduction offered by DVC over H.264/AVC Intra frame coding. Execution time measurements under controlled conditions, as established by the DISCOVER group , have shown that our codec (using H.264/AVC Intra to code the hash and the key frames) brings a reduction in average encoding time of approximately 30, 50, and 60% for a GOP size of 2, 4, and 8, respectively, compared to H.264/AVC Intra.
In contrast to hash-less Wyner-Ziv codecs, e.g. , our proposed codec has a higher encoding complexity caused by the additional hash formation and coding. However, the hash-related complexity overhead is kept low, since the hash dimensions were reduced to one fourth of the original frame resolution prior to coarse H.264/AVC Intra frame coding. When compared to Motion JPEG, the proposed codec (although currently not optimized for speed) exhibits similar encoding time but offers superior compression performance. We remark that compared to DISCOVER or Motion JPEG, the proposed codec offers a significant reduction of the encoding rate for a given distortion level. Such a not-able rate reduction induces an important decrease in power consumption by the transmission part of wireless video recording devices, e.g., wireless capsule endoscopes.
The proposed system links the encoder to the decoder via a feedback channel. Such a reverse channel implies that the encoder is forced to store Wyner-Ziv data in a buffer pending the decoder's directives. Based on our prior work , we analyze of the buffer size requirements imposed on the presented system's encoder due to the decoding delay for the capsule endoscopy application scenario. Recall that the GOP size in this scenario is restricted to 2 frames (see Section 5.2). The prime factors determining the decoding delay are the frame acquisition period t F , the time tSI to generate a side information frame, the transmission time (time-of-flight) tTOF between encoder and decoder, and the LDPC soft-input soft-output decoding time, denoted by tSISO.
Continuing our analysis, the reported capsule and the conventional endoscopic sequences were recorded using a camera with an acquisition rate of 2 and 30 Hz, respectively, corresponding to an acquisition period t F of 500 and 33.33 ms.
An estimation of the transmission time tTOF through the body can be made by calculating the velocity of a uniform plane in a lossy medium , characterized by its dielectric properties, i.e. the conductivity and permittivity. These values can be calculated based on [46, 47] for a wide range of body tissues and frequencies. It can be verified that at a frequency of 433 MHz the velocity is always greater than 10% of the speed of light through all body tissue cases included in , leading to a time-of-flight tTOF in the order of 15 ns through 0.5 m of tissue.
It is clear that the time tSI to generate a side information frame is dominated by OBME. Fortunately, several VLSI designs for hardware implementation of block motion estimation have been proposed. Considering the state-of-the-art architecture of , full integer-pel motion search can be executed at 4ρ2 + B-1 cycles per macroblock (MB), where ρ and B are the search range and MB size, respectively. However, our presented scheme employs bidirectional OBME. Specifically, the total number of overlapping blocks per frame is (H·V)/ε2, where H and V are the horizontal and vertical frame dimensions and ε is the overlap size. Hence, based on the VLSI architecture in , the total number of cycles per frame is given by 2·[(4·ρ2 + B-1)·H·V]/ε2, where the factor 2 stems from the bidirectionality. Considering a simplified decoding device with a single core CPU running at 800 MHz with a 1DIMPS/MHz/coreh and instantiating the OBME parameters for ρ = 16, ε = 4, and B = 16, yields a delay of 10.63 and 24.9 ms per frame for the capsule endoscopic (H = V = 256) and the endoscopic (H = 480, V = 320) sequences, respectively.
Average feedback channel requests per Wyner-Ziv frame for the capsule endoscopic video sequences
RD point 1
RD point 2
RD point 3
RD point 4
Capsule Test Video 1
Capsule Test Video 2
Capsule Test Video 3
Capsule Test Video 4
Based on the above approximations, Equation (8) yields an estimated buffer size of L = 2 and L = 3 frames for the capsule (t F = 500 ms) and the conventional endoscopic (t F = 33.33 ms) sequences, respectively, thus confirming the applicability of the proposed scheme. However, the encoder buffer size can be further restrained. An elegant solution is to constrain the number of feedback requests to a fixed number of requests for an entire Wyner-Ziv frame as proposed in our previous study , where we show that the loss of compression efficiency compared to unconstrained feedback is less than 5% when at most F = 5 requests per Wyner-Ziv frame are allowed. In addition to this, the structural latency induced by bidirectional temporal prediction could be reduced by employing unidirectional prediction.
Motivated by the strict prerequisites of wireless lightweight multimedia applications, such as wireless capsule endoscopy, this article has introduced a novel video codec based on the principles of Wyner-Ziv coding. The proposed codec maintains low encoding complexity and facilitates quality and temporal scalability. Intrinsically, the proposed codec achieves high compression performance by embracing a novel hash-driven motion estimation technique, which generates accurate side information at the decoder. The presented technique performs motion-compensated multi-hypothesis prediction, enabling adaptation to the regional variations in temporal correlation in a frame. Concrete experimentation using various conventional and endoscopic test video sequences has confirmed the superior compression performance of the proposed codec against several state-of-the-art traditional and Wyner-Ziv video codecs. In effect, in conventional and endoscopic test video material significant Bjøntegaard rate savings of up to 32.13 and 43.37% over the state-of-the-art have been obtained.
aThis paper has been presented in part at the IEEE International Conference on Image Processing, Brussels, Belgium, September 2011 . bMotion JPEG is based on the JPEG coding standard  and includes a file format that can handle multiple JPEG images. Unlike the Motion JPEG 2000 standard , no standard specification has been defined for Motion JPEG, and hence only proprietary solutions are available (e.g., support in Microsoft AVI files, Apple Quicktime format or the RFC 2435 spec that describes how Motion JPEG can be supported by an RTP stream). cTo simplify the presentation, the LDPC code is assumed systematic. dThe experimental results of DISCOVER  have been obtained using the executable of the DISCOVER codec which is available on the projects website . eGiven an RD point and a number of iterations, the process starts from a specific hash QP value (QP_hash = QP_key+1), and calculates the total and the hash rate, and the resulting RSD of the decoded frames (both key and WZ frames). If the RSD is lower that a strict threshold, the QP and the rate values are stored; otherwise they are discarded. Next, the hash QP is increased and the algorithm continuous till it reaches a given number of iterations. Out of the retained QPs, the one which minimizes the total rate is chosen as the best for the specific rate point. In case of equal total rates the highest QP value is selected. fThese volunteers presented no evidence of gastrointestinal pathologies. gThese sequences were transformed to the YUV 4:2:0 format supported by the proposed codec. hNowadays more powerful processors exist to be deployed in devices at the size of the decoder of a capsule endoscope. For instance, the Apple A5 processor of iPhone 4S has an ARM Cortex-A9 MPCore 32-bit multicore processor at 800 MHz, 2.5DIMPS/MHz/core. iAll these figures correspond to Soft-Input Soft Output (SISO) decoders.
advanced video coding
central processing unit
cyclic redundancy check
discrete cosine transform
distributed coding for video services
distributed joint source-channel coding
distributed video coding
distributed source coding
group of pictures
joint photographic experts group
low-density parity-check accumulate
log likelihood ratio
overlapped block motion estimation
power-efficient robust high-compression syndrome-based multimedia coding
peak signal-to-noise ratio
relative standard deviation
quarter common intermediate format
sum of the absolute differences
This study was supported by the FWO Flanders projects G.0391.07, G.0146.10 and the postdoctoral fellowship of Peter Schelkens. The authors would like to thank Prof. Dr. Daniel Urbain, head of the Gastroenterology clinic of the Universitair Ziekenhuis Brussel for providing numerous endoscopic video sequences.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.