We investigate deep learning for video compressive sensing within the scope of snapshot compressive imaging (SCI). In video SCI, multiple high-speed frames are modulated by different coding patterns and then a low-speed detector captures the integration of these modulated frames. In this manner, each captured measurement frame incorporates the information of all the coded frames, and reconstruction algorithms are then employed to recover the high-speed video. In this paper, we build a video SCI system using a digital micromirror device and develop both an end-to-end convolutional neural network (E2E-CNN) and a Plug-and-Play (PnP) framework with deep denoising priors to solve the inverse problem. We compare them with the iterative baseline algorithm GAP-TV and the state-of-the-art DeSCI on real data. Given a determined setup, a well-trained E2E-CNN can provide video-rate high-quality reconstruction. The PnP deep denoising method can generate decent results without task-specific pre-training and is faster than conventional iterative algorithms. Considering speed, accuracy, and flexibility, the PnP deep denoising method may serve as a baseline in video SCI reconstruction. To conduct quantitative analysis on these reconstruction algorithms, we further perform a simulation comparison on synthetic data. We hope that this study contributes to the applications of SCI cameras in our daily life.
I. INTRODUCTION
Compressive sensing (CS)1,2 has inspired various compressive imaging systems that capture high-dimensional data, such as videos3–11 and hyperspectral images,12–17 in a snapshot fashion. For example, in video CS as shown in Fig. 1, the high-speed frames of a video are modulated at a higher speed than the capture rate of the camera. With knowledge of the modulation, multiple frames can be reconstructed from each single measurement. This type of technique is also termed snapshot compressive imaging (SCI).18 This paper focuses on the video CS, which is also known as the video SCI problem.
SCI systems were originally proposed to capture high-dimensional data using low-dimensional detectors and then employ iterative algorithms to solve such an ill-posed inversion problem, which suffers from long running time. On the one hand, this hardware encoder plus software decoder regime is capable of providing us with more information of the scene. On the other hand, the long running time of the algorithm18 precludes wide applications of SCI since in some cases, a real-time visualization is desired. Thanks to recent advances in deep learning, fast end-to-end reconstruction has been demonstrated in computational imaging (CI).19,20 The deep learning approach first learns an approximate inverse function of the system forward model in training and then provides instantaneous reconstruction by directly estimating outputs from the input measurements. While this approach enjoys the speed advantage, it usually requires a deep model, long training time, and a large amount of training data. Furthermore, it is less flexible than the iteration-based algorithms because the model is trained and then works on the system with determined hyper-parameters such as image size, compression ratio, and coding patterns. Most recently, pre-trained denoising networks have been integrated into the iteration-based algorithms, dubbed Plug-and-Play (PnP) framework,21 to improve the reconstruction speed and image quality. Although the speed of the PnP framework is not comparable to that of the end-to-end deep learning framework, its overall performance on quality, flexibility, ease-of-use, cost, and speed makes it a good baseline for SCI reconstruction.
This paper aims to validate and compare these two deep learning based (or related) regimes for video SCI reconstruction. For a fair comparison, we build a video SCI system and capture measurement data with different compression rates. Resting on these data, we develop both end-to-end neural networks and PnP algorithms (with deep denoising priors) to also compare their performance with traditional optimization baseline algorithms, namely, GAP-TV.18,22 We hope this comprehensive study will provide guidance to researchers and engineers to apply SCI cameras in our daily life.
A. Snapshot video compressive sensing
The sensing process of video SCI is shown in Fig. 1. A dynamic scene, modeled as a time-series of two-dimensional (2D) images, passes through a dynamic aperture which applies timestamp-specified spatial coding. In specific, the value of each timestamp-specified spatial coding is superposed by a random pattern (binary patterns are used in this paper, {0, 1} with 0 denoting blocking the light and 1 meaning the light passing through) and the spatial coding of each two timestamps are different and independent from each other. The coded frames after the aperture are then integrated over time on a camera, forming a compressed coded measurement. Given the coding pattern for each frame, the time series of the scene can be reconstructed from the compressed measurement through iterative algorithms or pre-trained convolutional neural networks (CNNs).
B. Deep learning for reconstruction
In general, the SCI problem we focus on is developed from the field of computational imaging (CI).23 Different from traditional imaging where the users acquire the desired signal directly, in CI, the captured measurement may not be visually interpretable but includes the signal in a carefully designed mechanism. As a result, the reconstruction algorithms are required to recover the signals from the measurement. For SCI problems, the well established algorithms include TwIST,24 GAP-TV,22 and GMM25,26 based algorithms, where different priors are used. Most recently, the DeSCI algorithm18 has led to the state-of-the-art results of video SCI. DeSCI applies the weighted nuclear norm minimization (WNNM)27 of nonlocal similar patches in the video frames into the alternating direction method of multiplier (ADMM)28 regime.
Inspired by the recent advances of deep learning on image restoration,29,30 researchers have started using deep learning in CI.19,20,31–38 Deep fully connected neural network was used for video CS in Ref. 32, and most recently, a deep tensor ADMM-net was proposed in Ref. 36 for a video SCI problem. A joint optimization and reconstruction network was trained in Ref. 39 for video CS. The coding patterns used in Ref. 32 are a repeated pattern of a small block; this is not practical in real imaging systems, and only simulation results were shown therein. The deep tensor ADMM-net36 employs deep-unfolding technique,40,41 and limited results were shown based on the data in Refs. 5 and 6. In this paper, we build an end-to-end sensing and a reconstruction regime using deep learning.
It is anticipated that an end-to-end convolution neural network (E2E-CNN) can provide excellent results for CI given sufficient training data. Different from the conventional iteration-based algorithms,18 which must implement the iteration and computation for each measurement, the end-to-end CNN conducts optimization only on the training phase and recovers images on the inference phase in an efficient way. Intuitively, the end-to-end network enables a millisecond-level reconstruction for SCI problems. However, one drawback of E2E-CNN is the flexibility, in addition to the training data and time. Specifically, a new network has to be trained from scratch (or at least fine tuning using transfer learning) to perform the inference when the sensing matrices (or the coding patterns) change, which severely limits the application of adaptive sensing.42
One promising solution to solve this trilemma, i.e., speed, accuracy, and flexibility, is to incorporate the deep learning with optimization-based algorithms. In particular, the deep denoising priors43 can be plugged into the ADMM28 (or other optimization) framework to construct the plug-and-play21 regime for SCI reconstruction. In this case, the denoising network can be pre-trained on regular images, and the repeat training for each individual task is not necessary.
C. Contributions and organizations of this paper
Bearing the above concerns in mind, this paper makes the following contributions:
Aiming at a fair comparison between different inversion algorithms on real data, we build a video SCI system using a digital micromirror device (DMD) as the high-speed spatial modulator,3,7,8 which provides up to 22 000 refreshing rates; different compression rates of 10, 20, 30, 40, and 50 are implemented, providing (reconstructed) video frame rates of 500, 1000, 1500, 2000, and 2500, respectively, for a fixed capturing (camera) frame rate of 50 fps (frames per second).
We train an end-to-end CNN by applying residual learning44 to the encoder–decoder structure45 for fast SCI reconstruction.
We develop a PnP-ADMM algorithm by integrating the FFDNet43 and the total variation (TV) denoising algorithms into the ADMM framework, dubbed PnP-TV-FFDnet.
We compare the results of E2E-CNN, PnP-TV-FFDnet with the state-of-the-art iterative optimization method DeSCI.18 It is recommended that for a fixed setup, a well trained E2E-CNN can provide high accuracy results within a short time. Without training data, PnP-TV-FFDNet can provide decent results in a limited running time. DeSCI can still provide visually excellent results but suffers from long running time. Therefore, a good trade-off of speed, accuracy, and flexibility is PnP-TV-FFDNet, which can be used as a new baseline in video SCI reconstruction.
The rest of this paper is organized as follows: Sec. II describes the optical setup of our video SCI system. Section III derives the mathematical model of video SCI and reviews the recent theoretical results. Section IV develops the E2E-CNN for SCI reconstruction, and the PnP algorithms are derived in Sec. V along with the convergence analysis. Extensive experimental results are presented in Sec. VI, and quantitative simulation results are shown in Sec. VII. Section VIII concludes the entire paper.
II. HARDWARE SETUP OF VIDEO SCI
The optical setup of our system is depicted in Fig. 2. A commercial photographic lens (L1) images the object onto an intermediate plane, where a DMD (Vialux, DLP7000, 768 × 1024 elements that are each with 13.7 µm squares) is employed to apply spatial modulation to the high-speed image sequences. A 4f system consisting of a tube lens L2 (f = 150 mm) and an objective lens (4×, f = 45 mm, NA = 0.1) relays the modulated image to a camera (Basler, acA1300-200um) of 1024 × 1280 pixels that are each 4.8 µm squares. The 4f system has a magnification rate of 0.3, yielding a 0.85:1 pixel mapping between the DMD and camera. 2 × 2 random binary patterns are displayed on DMD, which prove to perform better than 1 × 1 patterns (according to our experience). The camera operates at a fixed frame rate of 50, whereas the DMD operates at various frame rates between 500 and 2500, resulting in compression ratios (Cr) ranging from 10 to 50 (for clarification, Cr = 10 means 10× compression and so on and so forth). The camera and the DMD are triggered by a data acquisition board (NI, USB6341). The modulation patterns used in the reconstruction are pre-calibrated by illuminating DMD with a uniform light and capturing the images of the DMD patterns with the camera.
In practice, the two controlled states of a DMD micromirror are oriented +12 and −12° relative to the array panel. In order to steer the reflected light from the DMD to the camera, the BS is horizontally tilted by 12° (not illustrated in Fig. 2). Besides, since the hinges of the micromirror are along the diagonal of the mirror, the mirrors rotate about an axis that is oriented 45° to the array dimensions. To facilitate the setup alignment, we make the hinge axis vertical by rotating the DMD by 45° (within the panel plane). Accordingly, the camera is also rotated 45° to align its sensor-array to the DMD array.
III. MATHEMATICAL MODEL OF VIDEO CS
Let (x, y, t) denote the transverse spatial and temporal dimensions of a dynamic scene f(x, y, t). Video SCI compresses the scene onto a two-dimensional (2D) monochrome camera with a limited frame rate, e.g., 50 fps used in our experiments. Let (x′, y′, t′) denote the measurement-space coordinates. The measurement formed on the detector plane can be represented as
where T(x, y, t) denotes the temporal modulation introduced by the DMD, variables Δ and Δt denote the (square) pixel pitch and the integration time of the camera, and Nx, Ny and Nt denote the length in the spatial and temporal dimensions. The spatial and temporal pixel sampling functions are given by p and pt, respectively.
The pixel sampling (first on DMD and then on the camera) process discretizes the continuous projected signal. Such discretization can be represented by applying rectangular functions in all dimensions in (1). Consider that B high-speed frames are modulated by the coding patterns , correspondingly (Fig. 1). The measurement is given by
where ⊙ denotes the Hadamard (element-wise) product and G represents the noise. For all B pixels (in the B frames) at position (i, j), i = 1, …, nx; j = 1, …, ny, they are collapsed to form one pixel in the snapshot measurement as
Define
where xk = vec(Xk), and let Dk = diag(vec(Ck)) for k = 1, …, B, where vec() vectorizes the matrix inside () by stacking the columns and diag() transfers the ensured vector into a matrix by putting each element of the vector to the diagonal positions. We, thus, have the vector formulation of the sensing process,
where is the sensing matrix with n = nxny, is the desired signal, and again denotes the noise.
Unlike traditional CS, the sensing matrix considered here is not a dense matrix. In SCI, the matrix Φ in (5) has a very special structure and can be written as
where are diagonal matrices. Therefore, the compressive sampling rate in SCI is equal to 1/B.
It is worth noting that due to the special structure of Φ in (6), we have ΦΦT being a diagonal matrix. This fact will be useful to derive the efficient algorithm in Sec. V for handling the massive data in SCI.
One natural question is whether it is theoretically possible to recover x from the measurement y defined in Eq. (5) for B > 1. Most recently, by using the compression-based compressive sensing regime,46 this has been addressed in Refs. 47 and 48 via the following theorem, where {f, g} denotes the encoder/decoder, respectively.
Details of the optimization and proof can be found in Ref. 48. Most importantly, Theorem 1 characterizes the performance of SCI recovery by connecting the parameters of the (compression/decompression) code, its rate r, and its distortion δ to the number of frames B and the reconstruction quality. This theoretical finding strongly encourages our algorithmic design using both deep learning and optimizations for SCI systems.
Recall the forward model of SCI in (5), where we can see that it is an ill-posed problem. To solve this problem, as shown in Fig. 3(a), one common way is to employ a prior, e.g., TV22 or sparsity.6,49 More complicated priors, such as low rankness of similar patch groups, have also been utilized.18 While these iterative algorithms are usually time consuming, motivated by recent advances in deep learning, the E2E-CNNs have been used for video SCI,32,36 shown in Fig. 3(c). However, as mentioned before, these E2E-CNNs usually need a significant amount of training data and time. When the sensing matrix, Φ, changes, another network has to be re-trained. This again needs additional time and training data. To mitigate this, PnP frameworks have been proposed to use pre-trained deep denoising priors into the optimization framework [Fig. 3(b)] as a marriage of both to seek a trade-off of time and flexibility.
In the following, we first build an E2E-CNN for SCI reconstruction and then derive the PnP framework by considering the hardware constrains of SCI. Performances of these methods will be compared thoroughly with extensive real data captured by our video SCI system. For a quantitative comparison, we also conduct simulations on synthetic data to compare these algorithms.
IV. END-TO-END CNNS FOR VIDEO CS RECONSTRUCTION
In this section, we build an end-to-end CNN to perform the SCI reconstruction.
A. Network structure
As illustrated in Fig. 4, E2E-CNN is based on a convolutional encoder–decoder architecture with residual connection.44 We impose five residual blocks in both encoder and decoder parts, which are connected by two convolution layers. The input first passes one convolution layer for multi-dimensional feature extraction. Each convolution operation is followed by ReLU activation and batch normalization. Besides, the output of one encoder residual block is added to the input of the mapped decoder residual block. Note that we use summation ⊕ instead of concatenation according to experimental results. Furthermore, we use long-span residual connection to synthesize the network input into the final reconstruction. We employ tanh as the activation function of the network to ensure a desired scale of the final output. Note that we use neither pooling nor upsampling in our network to avoid losing image details.50
To reduce the burden in learning the forward operator of the imaging system, we use Φ as an approximate inverse operator to initialize the network input as ΦT(ΦΦT)−1y, which has the same dimension and scale with x. Such setting is considered to be reliable for CI problems51 and has been widely used in recent works.52,53
B. Datasets and implementation
For model training, we capture 30 high-speed motions of daily life objects, e.g., boxes, letters, toys, etc., using a conventional high-speed camera. By scaling, cropping, and rotating, we generate 2400 videos as the ground truth x, each containing B frames image of 512 × 512 pixels. The corresponding measurements y of the truth are generated by simulating the system’s forward model using the calibrated coding patterns Φ.
For each training pair {ΦT(ΦΦT)−1y, x}, we train the model by feeding ΦT(ΦΦT)−1y into the network and using the Adam optimizer54 to minimize the objective function—root mean square error (RMSE) and multiscale structural similarity index (MS-SSIM)55 between the network output and the truth x. Specifically, the loss function of our model is
where denotes the output of the network. A detailed description on this loss function can be found in Ref. 55. The parameters α and β are set to 1 and 0.1 in our model training.
To reduce the gap between the simulated and real data, shot noise in the real system is simulated by Poisson sampling the synthesized measurements. Specifically, for each epoch, we use measurements of the same scene sets but with different shot noise realization to improve the robustness of the network. The training is implemented in TensorFlow on a NVIDIA GeForce GTX 1080Ti graphics processing unit (GPU). The leaning rate is initially set to 0.01, scaled by 0.8 after each 30 epochs.
V. PLUG-AND-PLAY ALGORITHMS FOR VIDEO SCI RECONSTRUCTION
As mentioned before, the E2E-CNNs lack the flexibility for different tasks, e.g., different compression rates in this work. In the following, we develop the PnP algorithms for video SCI.
We aim to recover x based on the captured measurement y and coding patterns Φ. Similar with previous research, priors (here including both the conventional and deep priors) are usually employed to solve this problem. Let R(x) denote the prior to be used, the reconstruction target is formulated as
where τ is a parameter to balance the fidelity term (the ℓ2-norm) and the prior.
Hereby, as in Ref. 21, we employ the ADMM framework to solve the problem, and by introducing the variable z, Eq. (9) is now formulated as
We use the superscript t to index the iteration number and introduce an auxiliary variable u. Equation (10) is solved by iteratively solving the following three sub-problems:
x sub-problem:
where γ > 0 is a balancing parameter.
z sub-problem:
u sub-problem:
Equation (11) is a quadratic form and has closed-form solution,
By utilizing the special structure of Φ in Eq. (6), this can be solved efficiently via element-wise operation rather than calculating the huge matrix inversion. Specifically, let
where [a]i is the i-th element in a, and since can be updated in one shot, can be solved efficiently.
A. Utilizing deep denoising priors
In order to embrace deep learning, deep denoising priors30 are employed to solve Eq. (12) without re-training the model. This set-asides time and training data and enables the algorithm’s flexibility. In our work, after trying several deep denoisers, we found that FFDNet43 is efficient and leads to the best results.
In our experiments, we observed that solely using FFDNet leads to some undesired artifacts (especially at large compression rates), while a TV prior alone usually leads to noisy reconstructions. One reason is that FFDNet was trained by Gaussian noise, while in our video SCI, in each iteration, the noise distribution is different. Motivated by this, we propose to use a joint denoising strategy as follows:
where σt is the estimated Gaussian noise level in each iteration. The final zt+1 in each iteration is achieved by
where 0 ≤ α ≤ 1 is a weighted parameter. This can be recognized as a complement prior strategy. We further observed that α should be decreasing with the iteration, i.e., at the beginning, mostly the TV denoiser has a larger weight, while the FFDNet denoiser starts getting more relevance as the reconstruction going on. We term this algorithm PnP-TV-FFDNet.
B. Convergence analysis
PnP-ADMM has been proved to converge to a fixed point21 under the conditions of a bounded denoiser and a bounded gradient of the loss function. In our case, the loss function is
In this paper, we assume that the denoisers are all bounded. Taking the hardware constraints of video SCI into consideration, we prove that the gradient of is bounded. The gradient of in SCI is
where Φ is a block diagonal matrix of size n × nB as in Eq. (6).
Φ⊤y is a non-negative constant since both the measurement y and the mask are non-negative in our system.
Now let us focus on Φ⊤Φx. Since
due to this special structure, Φ⊤Φx is the weighted sum of x and ∥Φ⊤Φx∥2 ≤ BCmax∥x∥2, where Cmax is the maximum value in the sensing matrix. Usually, the sensing matrix is normalized to [0, 1] and this leads to Cmax = 1 and, therefore, ∥Φ⊤Φx∥2 ≤ B∥x∥2.
Thus, ∇f(x) is bounded.
Therefore, along with the bounded denoiser assumption, the PnP-ADMM may converge to a fixed point for video SCI reconstruction. Details about the proof can be found in Ref. 21. Our experimental results further show that PnP algorithms with TV and FFDNet converge well.
VI. EXPERIMENTAL RESULTS
In this section, we present extensive experimental results to verify our hardware setup and different algorithms.
A. Hardware and algorithms verification
We validate our system through two dynamic scenes: moving letters (“labs”) and falling dominoes. Various compression ratios from 10 to 50 are adopted to achieve different image frame rates from 500 to 2500 with a fixed measurement (camera) frame rate of 50.
The measurements are shown on the left columns in Figs. 5 and 6. Due to limited GPU memory, we only trained the E2E-CNN models for Cr ≤ 30 (each with a different network). The training and reconstruction take 27 h and 70 ms for Cr = 10 h, 36 h and 90 ms for Cr = 20 h, and 43 h and 120 ms for Cr = 30, respectively. Although the training sample numbers are the same, a larger Cr needs a larger memory since the data size is getting larger and, thus, a longer training time. The results of E2E-CNN are shown in the upper parts in Figs. 5 and 6, from which we observe clean and clear letters or domino boxes, in contrast to the noisy and blurred measurements, with different locations and/or shapes at different times (grids are added to help visualize the motion). We also obverse that higher Cr does provide higher temporal resolutions paying little price at the spatial resolution (or image quality).
For Cr ≥ 40, we use PnP-TV-FFDNet to perform the reconstruction, which takes 70s and 88s, respectively. From the results, we obverse that higher Cr provides higher temporal resolution which helps visualize finer movements, but the image quality would degrade, become less uniform in intensity and less sharp in edges, as Cr increases.
B. Results with different algorithms
Through three dynamic scenes and under five different compression ratios, we now compare the performance of E2E-CNN and PnP-TV-FFDNet with another three iterative algorithms: GAP-TV,22 GAP-TV + deep denoising (GAP-TV + DD), and DeSCI (PnP-WNNM).18 The first scene is the falling dominoes which is a relatively slow scene [Fig. 7 (Multimedia view)]; the second one is the fast motion of water balloon falling and bouncing [Fig. 8 (Multimedia view)]; the third one is pendulum balls [Fig. 9 (Multimedia view)], which is fast yet has more detail features. Table I lists the computation time of different methods based on a desktop of 16 CPU cores @ 3.2 GHz and an NVIDIA GTX 1080Ti GPU. For the iterative methods, the time is based on 50 iterations.
. | 10 . | 20 . | 30 . | 40 . | 50 . |
---|---|---|---|---|---|
GAP-TV (s) | 50 | 80 | 120 | 150 | 180 |
GAP-TV+DD (s) | +0.2 | +0.5 | +0.7 | +1.0 | +1.2 |
DeSCI (h) | 0.8 | 1.2 | 1.7 | 2.5 | 3.5 |
PnP-TV-FFDNet (s) | 18 | 36 | 52 | 70 | 88 |
E2E-CNN | 0.1 s | 0.16 s | 0.2 s | Out of | Out of |
(training) | (20 h) | (25 h) | (28 h) | memory | memory |
. | 10 . | 20 . | 30 . | 40 . | 50 . |
---|---|---|---|---|---|
GAP-TV (s) | 50 | 80 | 120 | 150 | 180 |
GAP-TV+DD (s) | +0.2 | +0.5 | +0.7 | +1.0 | +1.2 |
DeSCI (h) | 0.8 | 1.2 | 1.7 | 2.5 | 3.5 |
PnP-TV-FFDNet (s) | 18 | 36 | 52 | 70 | 88 |
E2E-CNN | 0.1 s | 0.16 s | 0.2 s | Out of | Out of |
(training) | (20 h) | (25 h) | (28 h) | memory | memory |
In general, Figs. 7–9 (Multimedia view) and Table I reveal a trade-off between reconstruction speed, image quality, and flexibility. The E2E-CNN method provides the fastest reconstruction speed (tens of milliseconds) but is less flexible than the iterative algorithms; the pre-trained network works only with specific mask patterns, image size, and compression ratio. Although a generalized (also larger and deeper) network can be trained to adapt for various system parameters, this would require a huge GPU memory and is, thus, not practical for general users. The DeSCI provides state-of-the-art image qualities [e.g., comparing the water balloon and the letters “LAB” in Fig. 8 (Multimedia view) in the dashed yellow rectangles], but hours of computation is required. GAP-TV consumes much less computation time (1–3 min), but the results would appear noisy. However, we find that by simply applying a fast (≤1.2 s) deep denoising (FFDNet) process after the GAP-TV iterations, the noise can be significantly reduced. Among these iterative methods, the PnP-TV-FFDNet is the fastest one (0.2–1.5 min) because the denoising process using a deep denoising network is fast. The image quality is similar to DeSCI under Cr ≤ 20. For Cr ≥ 30, artificial distortions [see the regions indicated by dashed red rectangles in Figs. 7–9 (Multimedia view)] start to degrade the image. To reduce the artifacts, the noise estimation parameter in the algorithm can be set to high values, which functions by ignoring more detailed features.
Note that the results of E2E-CNN in Figs. 7–9 (Multimedia view) are inferred by the same trained networks (3 networks for three compression ratios) with the same training dataset. In most cases, E2E-CNN provides the best result on static scenes, regardless of the compression ratio. However, for fast motion scenes with a high compression ratio, E2E-CNN is not good enough, e.g., the top part of the balloon in Fig. 8 (Multimedia view). One reason is that the motion dynamics of the balloon is faster than other two scenes. More training datasets may help the E2E-CNN, or a deeper network is required. This again pops out the concern of using E2E-CNNs for video SCI, i.e., lacking flexibility. By contrast, other algorithms are more robust to different scenes and motions.
VII. SIMULATION RESULTS
Aiming at a quantitative comparison, in this section, we conduct simulations on five motion scenes—Kobe, Runner, Traffic, Aerial, and Crash—used in Refs. 18 and 36. In the simulation, every eight (Cr = 8) video frames with a size of 256 × 256 pixels were modulated by eight different binary masks and then collapsed to a single measurement. For the E2E-CNN model, we generated 25 000 training pairs using the DAVIS56 dataset. Considering the small range of motion in simulation data, we make a minor adjustment on the network and training, i.e., removing the long-span residual connection between the input and output and increasing the weight of MS-SSIM loss to 0.7.
We compared the proposed E2E-CNN and PnP-TV-FFDNet method with GAP-TV and DeSCI by evaluating their peak-signal-to-noise-ratio (PSNR) and the structured similarity index (SSIM)57 in Table II. We can observe that DeSCI performs best on almost all scenes except the Aerial. E2E-CNN has a 0.66 dB higher PSNR in average than PnP-TV-FFDNet. Similar to the real data, DeSCI still provides best results on average, gaining >2.5 dB in PSNR over deep learning methods, again by the price of long running time.
. | Kobe . | Traffic . | Runner . | Aerial . | Crash . | Average . |
---|---|---|---|---|---|---|
GAP-TV | 26.46, 0.8848 | 20.89, 0.7148 | 28.52, 0.9092 | 25.05, 0.8281 | 24.82, 0.8383 | 25.15, 0.8350 |
DeSCI | 33.25, 0.9518 | 28.71, 0.9205 | 38.48, 0.9693 | 25.33, 0.8603 | 27.04, 0.9094 | 30.56, 0.9223 |
PnP-TV-FFDNet | 30.50,0.9256 | 24.18, 0.8279 | 32.15, 0.9332 | 25.27, 0.8291 | 25.42, 0.8493 | 27.50, 0.8730 |
E2E-CNN | 29.02, 0.8612 | 23.45, 0.8375 | 34.43, 0.9580 | 27.52, 0.8822 | 26.40, 0.8858 | 28.16, 0.8850 |
. | Kobe . | Traffic . | Runner . | Aerial . | Crash . | Average . |
---|---|---|---|---|---|---|
GAP-TV | 26.46, 0.8848 | 20.89, 0.7148 | 28.52, 0.9092 | 25.05, 0.8281 | 24.82, 0.8383 | 25.15, 0.8350 |
DeSCI | 33.25, 0.9518 | 28.71, 0.9205 | 38.48, 0.9693 | 25.33, 0.8603 | 27.04, 0.9094 | 30.56, 0.9223 |
PnP-TV-FFDNet | 30.50,0.9256 | 24.18, 0.8279 | 32.15, 0.9332 | 25.27, 0.8291 | 25.42, 0.8493 | 27.50, 0.8730 |
E2E-CNN | 29.02, 0.8612 | 23.45, 0.8375 | 34.43, 0.9580 | 27.52, 0.8822 | 26.40, 0.8858 | 28.16, 0.8850 |
Figure 10 visualizes some representative frames, from which we can see that E2E-CNN performs better on static details (e.g., buildings in Aerial), while PnP-TV-FFDNet recovers clearer motions (e.g., feet in Runner) than E2E-CNN.
In summary, we obtain similar conclusions to the experimental data from these simulations. This further validates our end-to-end study.
VIII. CONCLUSIONS
We have built a video snapshot compressive imaging system using a digital micromirror device as the dynamic modulator and developed an end-to-end convolutional neural network and a plug-and-play algorithm that jointly used TV and FFDNet denoising priors for reconstruction. Various plug-and-play algorithms using different priors have been compared with the developed methods under different dynamic scenes and compression ratios.
Given a pre-trained network for a specific system, the E2E-CNN method provides the fastest reconstruction speed and is able to achieve real-time reconstruction by employing a faster GPU available on the market. The PnP-TV-FFDNet algorithm is the fastest one among the iteration-based algorithms due to the use of the fast deep denoising priors and is able to provide similar image quality to DeSCI (state-of-the-art in terms of image quality) at low compression ratios (≤20). Considering speed, quality, flexibility, cost, and ease-of-use, we believe that the PnP-TV-FFDNet method is a preferable baseline for video SCI reconstruction although it starts to suffer from artifacts when Cr ≥ 30. The artifacts might be reduced in future by employing more relevant deep denoising priors. For a fixed system, the E2E-CNN method, however, would be more preferable since it provides better image quality and much faster speed.
Due to the high capturing speed and the near real-time reconstruction (by the E2E-CNN method), we expect our end-to-end SCI system to find applications in traffic surveillance, sports photography, and autonomous vehicles. Regarding future work, we expect to exploit the correlation between video frames to train a video-wise deep denoising prior, rather than the frame-wise prior presented in this paper, for the proposed PnP framework to improve the reconstruction quality. For the E2E-CNN method, a flexible network compatible with varying image size and compression ratio is expected.
AUTHOR’S CONTRIBUTIONS
M.Q. and Z.M. contributed equally to this work.