Learning To Efficiently See In The Dark
This article describes the binarization of the network in the paper 'Learning to See in the Dark' [1] using the approach of the ABC-net presented in 'Towards Accurate Binary Convolutional Neural Network' [2]
With the ever growing amount of papers published, the field of Computer Vision by Deep Learning is more relevant than ever. Many new solutions are presented every day. Models get bigger and more complex. Together with computer vision, the market of smart-phone and smart-wearables is growing. Estimates have been made that in 2020, around 1.4 trillion photo's will be made worldwide from which 90.9% will originate from smartphones [4].
However, the differences in computing power between the two sectors is enormous. Deep learning models require powerfull GPU's to train and evaluate a model. Smartphones don't offer this. To accomodate that the two sectors are growing towards each other, two paths are being followed: smartphones are equipped with more powerfull hardware every year making that smartphones can offer more services. On the other hand, deep learning models are made more and more efficient, so that evaluation can be done on a wearable device.
In this article, we want to take a step towards the direction of efficient neural networks. We will combine methods from different active fields of study and present an efficient method to process low-light images. We showcase the implementation details of applying the concept of network binarization using "Accurate Binary Convolutional Network" (ABC-Net) in "Learning To See In The Dark" (LSID) [1] paradigm. In doing so, we want to find out whether the full accuracy network can be approximated by binarization. We aim for limited performance loss as compared to the full precision model while driving up the efficiency. We intend to highlight the theoretical speedup that can be achieved and shed light on the limitations that we faced.
Click https://sfalkena.github.io/blogs/_pages/interactive_image.html or scan the QR code below to see an interactive comparison of the results in high resolution! Warning: the images are in high resolution, so the loading of images might be slow ;)
Techniques used
First, let us introduce the problem that we propose to tackle. This problem is presented by Chen et. al in "Learning to see in the dark" [1], where the authors take a raw image file (short-exposure images which are grainy and suffer from noise due to low-lighting and high ISO values) and use a deep neural network to learn the image-processing pipeline to process the low-light raw data. By working directly with raw sensor data, the authors intend to replace the standard imaging pipeline. Intending to achieve this, the authors train a convolutional network with a new dataset involving raw short-exposure low-light images with corresponding long-exposure images as the target reference or ground truth.
The pipeline suggested by the authors can be seen in the figure below. First, the raw image is unpacked, the black levels are subtracted after which the image is amplified. Finally, the image is fed through the CNN, and the output is an RGB image.
For examples of the results achieved by in the paper, look at https://cchen156.github.io/SID.html for an interactive comparison.
To make the ConvNet in the figure shown above more efficient, binarization is applied to the ConvNet. For this, we follow the fundamental techniques of binary convolution using XNOR and bitcount operations by Rastegari et al. in "XNOR-Net" [5]. In applying binarization, we explore two schemes for implementing a binary network namely, XNOR-Net and ABC-Net. In the following section, we first explain the XNOR-Net, after which we dive into the ABC-Net implementation in detail.
For the interested reader, we have written a chapter where we explain the basics of Binary Networks in more detail. For the rest of the paper, we assume that the reader has knowledge of this chapter and knows the fundamentals of binarization.
XNOR-Net
The first binarization technique that we explored is the XNOR-Net implementation presented in [5]. In XNOR-Networks, both the filters and the input to the convolutional layers are binarized which allows these networks to approximate the convolutions primarily using binary operations. The efficiency and speedup can be achieved by leveraging the XNOR and bit-counting operations to estimate the convolutions.
We replace the LSID layers with the binarized convolutions which are mentioned in detail in the ABC-Net section. Reading further into the implementation we realised that ABC-Net is a generalisation of the XNOR-Net. Being a generalisation, the ABC-net extends better representational power by approximating the convolutions using multiple binarized weights and activations, but still has the option to evaluate the performance of the XNOR-Net when using only a single bit for both weights and activations.
ABC-Net
We implemented this by using the follow-up work shown by Lin et al. [2] to achieve higher accuracy than XNOR-Net. Lin et al. developed a new type of convolutional network called the Accurate Binary Convolutional Network or ABC-Net. Explained very briefly, ABC-Net approximates the full precision convolution using $M\cdot N$ binary convolutions from XNOR-Nets, where $M$ is the number of bits used to approximate the weights and $N$ the number of bits to approximate the activations.
Again, in our chapter, ABC-Net is explained in more detail. However, we intend to provide some intuition regarding the ABC-Net as mentioned below.
The ABC-Net approximates each real-valued by using multiple binary values per weight, giving it better representational power. The approximation is achieved as follows: \begin{equation} W \approx \alpha_1\mathbf{B}_1 + \alpha_2\mathbf{B}_2 + ... + \alpha_M\mathbf{B}_M \end{equation}
Finding a good value for $\mathbf{B}_i$ and $\alpha_i$ comes down to the following optimization problem:
\begin{equation} \min\limits_{\mathbf{\alpha}, \mathbf{B}}J(\mathbf{\alpha}, \mathbf{B}) = ||\mathbf{w} - \mathbf{B} \mathbf{\alpha}||^2,\;\;\;\; s.t.\:\mathbf{B}_{ij}\in \{-1,+1\} \end{equation}Where $\mathbf{B} = [vec(\mathbf{B}_1), vec(\mathbf{B}_2), ..., vec(\mathbf{B}_M)]$, $\mathbf{w} = vec(\mathbf{W})$ and $\mathbf{\alpha} = [\alpha_1, \alpha_2, ..., \alpha_M]^T$. Trying to find the optimum to this problem numerically, makes it quite difficult as it is no longer possible to compute the derivative w.r.t. $\textbf{W}$ using the Straight-Through-Estimator [add reference]. Therefore the optimum is approximated by first calculating each $B_i$ in the following way:
\begin{equation} \mathbf{B}_i = sign(\overline{\textbf{W}} + u_istd(\mathbf{W})), i = 1,2,...,M \end{equation}Where $\overline{\textbf{W}} = \textbf{W} - mean(\textbf{W})$ and $u_i$ is picked evenly over the the range $[-std(\mathbf{W}), std(\mathbf{W})]$. This is done because experimental observations show that real-valued weights often look like they are drawn from a Gaussian distribution.
Importantly, to be able to take full advantage of the more efficient XNOR and bitcount operations for the convolutions, the activations need to be binarized in addition to the network weights. The ABC-Net does this by using multiple bits per activation to get a better approximation of the original i.e. full precision activations. However, it is not desirable to use the same method for binarizing the activations as we do for the weights, because the alpha values (and the beta values for the activations) need to be calculated using linear regression. For the weights, this is not a big problem, since this only needs to happen when they need to be updated, which only happens during training. It thus has no impact on the inference speed. However, this is not the case for the activations since these always vary at test time.
The way they implemented this, was by first using batch normalization, so that the mean and standard deviation of the activation didn't need to be calculated, but could be assumed to be $0$ and $1$. They then replaced the linear regression that was used to calculate alpha by a learnable parameter beta. This way the there is no need anymore to do linear regression or calculate the mean and standard deviation during inference.
Theoretical speedup
Before we dive into the speedup that would theoretically be possible, we need to understand more about the factors that influence this speedup. The speedup is achieved by binarizing the weights and activations of the convolution. For binarizing the weights, the full precision weights are approximated with $\boldsymbol{W} \approx \alpha_{1} \boldsymbol{B}_{1}+\alpha_{2} \boldsymbol{B}_{2}+\cdots+\alpha_{M} \boldsymbol{B}_{M}$. So $M$ BinConvs are used as can be seen on the left part of the figure above. Furthermore, the activations $R$ can also be binarized by approximating $\boldsymbol{R} \approx \beta_{1} A_{1}+\beta_{2} A_{2}+\cdots+\beta_{N}\boldsymbol{A}_{N}$.
According to [5] the speedup of using XNOR-Net is the ratio of the operations carried out by a normal convolution and the binarized XNOR convolution: $S=\frac{64 c N_{W}}{c N_{W} + 64}$, where $N_{\mathbf{W}}=w h$ (the number of entries in the weights) and $c$ is the number of channels. For justification of this equation, please refer to the original paper. The alert reader will notice that the speedup does not depend on the size of the input.
Because there are now $N\times M$ of these BinConvs, the speedup per convolutional layer linearly decreases with this number and becomes $S=\frac{64 c N_{W}}{M N (c N_{W} + 64)}$. To draw conclusions about the total speedup of the network, we have to look into the architecture and take the sum of all operations done over all layers when not using BinConvs and dividing this by the amount operations that are needed when BinConvs are being used. This will then give the theoretical speedup of the entire network. In the results section, this number will be shown for all the different networks used in the experiments.
Even when not binarizing the activations, still some speedup could be achieved. This is because when using binary weights there is no need to do multiplication and addition of the activation with weight anymore, it can be immediately added or subtracted. The speedup of this can be roughly estimated to be 2, since only half of the operations need to be done (ignoring overhead like adding the bias). The speedup is then given by $S \approx \frac{2}{M N}$, since again the speedup linearly decreases with the amount of times the BinConv is used. Note here that when using a value larger than $1$ for either $M$ or $N$, all speedup is immediately lost. However, there is at that point still the possibility of reduced memory usage.
Technical constraints
In the introduction, as an alert reader must have noticed, we stress that we are aiming for limiting the performance loss rather than only aiming for increase in efficiency. The reason for this is the fact that binarization is still a fairly new field of study. To convert the theoretical speedup of the network into practice, special hardware or software needs to be used. The speedup is realized by using XNOR and bitcount operations [1]. Intel researchers in [6] have compared running a binarized network on different architectures, and they have pointed out that specialised hardware like FPGAs have a greater potential for running binarized networks. For now, the platform that we use i.e. PyTorch does not offer possibilities to achieve and leverage this practical speedup to which end we feel that it is currently out of the scope of this project.
Architecture
Below, the architecture of the binarized LTSITD (ABCLSID) is shown. Some remarks about the architecture:
- The first and last layer are not binarized and uses normal convolution. The speedup of binarizing is smaller here because of the low amount of channels or small kernel size.
- Rest of the architectural decisions (channels, kernel size, etc.) are kept the same i.e. as per the original LTSITD implementation. The layer order from LTSITD adhered according to the layer order recommended by [5].
class ABCLSID(nn.Module):
def __init__(self, inchannel=4, block_size=2, M=3, N=None, binary_transposed_conv=True):
super(ABCLSID, self).__init__()
self.M = M
self.N = N
self.binary_transposed_conv = binary_transposed_conv
self.block_size = block_size
self.conv1_1 = nn.Conv2d(inchannel, 32, kernel_size=3, stride=1, padding=1, bias=True, )
self.lrelu = nn.LeakyReLU(negative_slope=0.2, inplace=True)
self.conv1_2 = ABCConv2d(32, 32, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0, ceil_mode=True)
self.conv2_1 = ABCConv2d(32, 64, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv2_2 = ABCConv2d(64, 64, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv3_1 = ABCConv2d(64, 128, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv3_2 = ABCConv2d(128, 128, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv4_1 = ABCConv2d(128, 256, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv4_2 = ABCConv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv5_1 = ABCConv2d(256, 512, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv5_2 = ABCConv2d(512, 512, kernel_size=3, stride=1, padding=1, bias=True, M=M)
if self.binary_transposed_conv:
self.up6 = ABCConvTranspose2d(512, 256, 2, stride=2, bias=False, M=M)
else:
self.up6 = nn.ConvTranspose2d(512, 256, 2, stride=2, bias=False)
self.conv6_1 = ABCConv2d(512, 256, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv6_2 = ABCConv2d(256, 256, kernel_size=3, stride=1, padding=1, bias=True, M=M)
if self.binary_transposed_conv:
self.up7 = ABCConvTranspose2d(256, 128, 2, stride=2, bias=False, M=M)
else:
self.up7 = nn.ConvTranspose2d(256, 128, 2, stride=2, bias=False)
self.conv7_1 = ABCConv2d(256, 128, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv7_2 = ABCConv2d(128, 128, kernel_size=3, stride=1, padding=1, bias=True, M=M)
if self.binary_transposed_conv:
self.up8 = ABCConvTranspose2d(128, 64, 2, stride=2, bias=False, M=M)
else:
self.up8 = nn.ConvTranspose2d(128, 64, 2, stride=2, bias=False)
self.conv8_1 = ABCConv2d(128, 64, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv8_2 = ABCConv2d(64, 64, kernel_size=3, stride=1, padding=1, bias=True, M=M)
if self.binary_transposed_conv:
self.up9 = ABCConvTranspose2d(64, 32, 2, stride=2, bias=False, M=M)
else:
self.up9 = nn.ConvTranspose2d(64, 32, 2, stride=2, bias=False)
self.conv9_1 = ABCConv2d(64, 32, kernel_size=3, stride=1, padding=1, bias=True, M=M)
self.conv9_2 = ABCConv2d(32, 32, kernel_size=3, stride=1, padding=1, bias=True, M=M)
out_channel = 3 * self.block_size * self.block_size
self.conv10 = nn.Conv2d(32, out_channel, kernel_size=1, stride=1, padding=0, bias=True)
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
m.bias.data.zero_()
elif isinstance(m, nn.ConvTranspose2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
Datasets and data processing
In the section "Techniques used", we already mentioned briefly that the authors of [1] created a new dataset to train and evaluate the model. In this section, we want to explain a bit more about the importance of the dataset.
Let's start with explaining what kind of data we are we dealing with. The authors have created a dataset with low-light images. The original dataset has been shot with two cameras: A Sony $\alpha$7S II and a Fujifilm X-T2. The reason behind using two different cameras has to do with the different ways in which the images get stored. The Sony uses a Bayer filter, whereas the Fujifilm uses an X-Trans filter. We leave the exploration of the X-Trans filter to the reader as we have chosen to focus solely on the Sony camera for this project endeavour. The reason for this is that Sony is the leading company in the smartphone camera domain and holds 50.1% of the market share as of 2019. As mentioned above, the Sony images use a Bayer filter. We would suggest reading this blog to develop a better understanding about Bayer filters. The first image of our blog shows the input image being in its Bayer form. The de-mosaicing step in the aforementioned blog is included in the ConvNet pipeline.
So, going back to the dataset, it contains groups of images shot in exactly the same way, but with different exposure times, meaning that we have bright, ground truth images, together with dark, low-exposure images. To test the rigidity of the network, images with exposure times of $0.033s, 0.04s, 0.1s$ have been captured.
The said raw images are of size 4256x2848. However, these images are still in the Bayer form, which means that they need to be unpacked into 2128x1424 sized images with 4 colour channels. Additionally, these images are considerably big to be loaded into memory as a whole. This memory contraint meant that it was not possible to perform backpropagation with the entire image as input, because then our GPU would run out of memory (8 GiB). This, and the fact that the authors of LSID also train using patches, is why we train the network on patches of the images. The patch size used is 512x512 pixels.
Another problem was that, the training was very slow. At first it would have taken around 6 days to train the network. The reason for this was the way the images were loaded and processed. The way it was done, was that in each iteration, an image was loaded from the disk, processing was done (by the python rawpy module) and then a random patch was taken from that image. The bottleneck at that point was the loading of the images from disk to memory. Even with 8 data loading workers that simultaneously loaded images, it wasn't enough. Furthermore, using more workers caused overheads in other places and hence did not improve performance. The way the authors overcame this problem was by keeping all the data in memory. The problem for us, however, was that the dataset was 64 Gigabytes and we only had 16 Gigabytes of RAM storage. The way we fixed this issue was by preprocessing the data. We first loaded the images, did the processing, but then instead of taking a random patch, we took 15 patches in a grid of 3 by 5. This was the smallest number of patches that together (with a bit of overlap) covered the entire image. We then stored these patches. By performaing the mentioned steps, during training, we could directly load the patch instead of the entire image. This improved our training speed significantly, which meant that we could train the network in just 22 hours which was quite a considerable reduction in time. This method does have some disadvantages, because this way we always use the same set of patches instead of random ones, which could, in theory, affect the result after training. However we still achieve highly comparable results on the test set as the original model. This led us to conclude that it probably did not matter much. Another problem that we dealt with was that the raw images had an 14-bit encoding, which after processing gets converted to 32-bit tensors. Plus, the patches were slightly overlapping. Together this caused our newly processed training set to be 353GB, which is not very practical to store. In the end our epochs were changed from taking a random patch of the 1865 images, to using $15\cdot1865=27975$ fixed patches.
Binary convolution
One problem we encountered during the project was because we did not have access to efficient implementations of the binary convolutions implying that we had to use the standard convolutions for 32-bit Floating-point numbers. This meant that instead of a getting a speedup as mentioned in Section "Theoretical speedup", it took about $M\cdot N$ times longer to do a forward pass, since that is the amount of convolutions we now had to do in each layer. This would mean that training would again take too long. To remedy this, we slightly deviated from the procedure that is described in the ABC-Net paper. We still calculated the binary tensors and their multipliers in the same way, but instead of using them in separate convolutions, we used them to approximate the original weights again. In theory this should give the same results. In practice, both these methods had small differences in their outputs. This is probably due to floating-point errors. These errors, however, were minute enough to not impact a real difference during training or in performance. In the code snippet below, we first have two functions implementing the shift operation as in equation 2 from [2].
import torch.nn as nn
from torch import Tensor
import torch
def shift_parameter_binarization(tensor, shift_parameters, empty_cache):
if empty_cache:
torch.cuda.empty_cache()
return torch.cat(
[((tensor - tensor.mean()) + shift_parameter).sign() for shift_parameter in shift_parameters])
def shift_parameter_binarization_activation(tensor, shift_parameters, empty_cache):
if empty_cache:
torch.cuda.empty_cache()
return torch.cat(
[((tensor - tensor.mean()) + shift_parameter).sign() for shift_parameter in shift_parameters])
class ABCConv2d(nn.Conv2d):
def __init__(self, in_channels, out_channels, kernel_size, M=3, N=None, estimated_weights=True, stride=1,
padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros'):
super(ABCConv2d, self).__init__(in_channels, out_channels, kernel_size, stride=stride,
padding=padding, dilation=dilation, groups=groups,
bias=bias, padding_mode=padding_mode)
self.M = M
self.N = N
self.use_estimated_tensors = estimated_weights
# Calculates the alphas based on the binarized and real weights using linear regression
def get_alphas(self, B):
vectorized_B = B.view(self.M, -1).t()
return vectorized_B.t().mm(vectorized_B).inverse().mm(vectorized_B.t()).mv(self.weight.data.view(-1))
# Binarize the weights in M matrices
def get_B(self):
weights = self.weight.data
return shift_parameter_binarization(weights, [(-1 + i * 2 / (self.M - 1)) * weights.std() for i in
range(self.M)], empty_cache=(not self.training))
# Binarize the weights in M matrices
def get_binary_activation(self, input):
binary_input = shift_parameter_binarization_activation(input,
[(-1 + i * 2 / (self.N - 1)) * input.std() for i in
range(self.N)],
empty_cache=(not self.training))
vectorized_binary_input = binary_input.view(self.N, -1).t()
betas = vectorized_binary_input.t().mm(vectorized_binary_input).pinverse().mm(vectorized_binary_input.t()).mv(
input.view(-1))
estimated_binary_input = torch.mv(vectorized_binary_input, betas)
# del binary_input
return estimated_binary_input.view(input.shape)
def get_estimated_weights(self, B, alpha):
vectorized_B = B.view(self.M, -1).t()
estimated_weights = torch.mv(vectorized_B, alpha)
return estimated_weights.view(self.weight.shape)
def forward(self, x: Tensor) -> Tensor:
# if self.training or self.B is None or self.alpha is None:
B = self.get_B()
alphas = self.get_alphas(B)
if self.N:
x.data = self.get_binary_activation(x)
if self.use_estimated_tensors:
x = nn.functional.conv2d(x, self.get_estimated_weights(B, alphas), self.bias, self.stride,
self.padding, self.dilation, self.groups)
else:
x = sum(alphas[i] * nn.functional.conv2d(x, B[i],
stride=self.stride,
padding=self.padding,
dilation=self.dilation,
groups=self.groups) for i in range(self.M))
if self.bias:
x += self.bias[None, ..., None, None].repeat(1, 1, x.shape[2], x.shape[3])
return x
Initialization
A very useful aspect of ABC-Net is the fact that it uses multiple bits to approximate the original real-valued weights. This means that if you have a trained model and you want to replace the normal convolutions with the approximated convolutions of ABC-Net, you can use the weights of the original model as the initialization of the new one. Some training is still needed to finetune the weights to the new architecture, but much less than with random initialization. Once we implemented this, we only trained each network for 1 epoch in our testing. This helped with testing and debugging the network, since when using binarization for each layer in the network the linear regression had to be performed every iteration. This made the training 2 to 12 times slower, depending on the number of bits used for the binarization. This would have meant that without this initialization some networks would have needed to train for 12 days, which wouldn't be feasible for this project.
Bias
The ABC-Net paper makes no mention of using biases in their convolutions. However as biases were used by the LSID network, we did need to implement this. Since biases only account for a small part of the memory used and the computation needed, we decided to not binarize the biases in the network.
Normalization
As mentioned in section "ABC-Net", one of the things that is needed to be able to binarize the activations is normalization. In ABC-Net this is done using batch normalization. However, this was not possible in our case since the GPU we used, did not have enough memory. When only binarizing the weights, batches of 8 could be used, which is already very small for batch normalization to work well. When we also binarized the activations even more memory was needed, since the binarized tensors still needed to be stored as 32-bit floats. This meant that we could only use a batch size of 1. Therefore we looked into 3 other types of normalization, namely, instance normalization, group normalization and layer normalization. Unfortunately, all of the normalization techniques hurt the performance of the model, with really low PSNR's even when not even binarizing the activations yet. We speculate that this might have to do with the fact that we use the weights of the trained LSID model as initialization. Adding normalization in can cause the activations to change significantly, which means that the weights also need to adapt.
Activation Binarization
Since we couldn't get normalization to work and the binarization of the activations in ABC-Net depend on this, we needed to find a new way to do this. In the end we chose to do it the same way as with the binarization of the weights. By picking the thresholds for binarizing equally distributed from minus the standard deviation to plus the standard deviation around the mean of the activation. After which the betas are calculated using linear regression again. This cannot lead to good efficiency during inference as the linear regression still has to be done at each layer every iteration. We thought this is an acceptable compromise, since, without the efficient operators, we couldn't show the actual efficiency benefits in practice anyway and this should still give comparable results in terms of accuracy.
Training details
To improve the reproducibility of the project, we will provide some details about our training process. We use the weight of the model trained by the authors as initialization. Then we train the model for 1 epoch using the 27975 image patches of size 512 by 512. The learning rate we used was $10e-4$. We tested multiple different combinations of $M$, $N$ (real-valued, $1$, $3$ and $5$).
Results
The results are shown in the table below and the visual representation can be seen by running the following code cell. When a cell in the table contains a "x", that means the specific part has not been binarized and thus contains the full-precision values. Just as a reminder: $M$ are the amount of bits used for the weight approximation, $N$ is the amount of bits used for approximating the activations. The theoretical speedup has been calculated per setting as described earlier in this blog.
M | N | PSNR | Theoretical speedup | Memory Reduction | |
---|---|---|---|---|---|
Results by [1], in TensorFlow | x | x | 28.88 | 1x | 1x |
Results of [3], in PyTorch | x | x | 28.55 | 1x | 1x |
ABCLSID | 3 | x | 27.6 | 0.66x | 10.67x |
ABCLSID | 1 | x | 25.70 | 2x | 32x |
ABCLSID | 1 | 1 | 20.21 | 55.11x | 32x |
ABCLSID | 3 | 3 | 20.93 | 6.77x | 10.67x |
ABCLSID | 5 | 3 | 21.19 | 4.08x | 6.4x |
ABCLSID | 3 | 5 | 21.74 | 4.08x | 10.67x |
ABCLSID | 5 | 5 | 21.86 | 2.46x | 6.4x |
from IPython.display import Image
from google.colab import widgets
from matplotlib import pylab
tb = widgets.TabBar(["M=1, N=x" ,"M=1, N=1", "M=3, N=x" ,"M=3, N=3" ,"M=5, N=3" ,"M=5, N=5"])
image_ids = ["1XZuoIFNmI65Wuhnv4lEVDMP5YvVfy2fV","1Ypb9LUoJt5V44qu9zHXobbNwoa4NOZcn","1ihb4AvT8Jjj8g5N0zfLrss7-Bms0Vr5g","1niA76Jwfp6cIdWmnv70omRFBtAaut1Ub","1CK_rFLYvjPqkw2j4RIng8grU6ovAATWc","1BN66SlR4UVNOn09txj-9RQKpC-aKtVPX" ]
for t, i in enumerate(tb):
display(Image(url="https://docs.google.com/u/0/uc?id="+image_ids[i], width = 1000, height = 700))
Discussion
When looking at the results in the table and the images, you can see that the network still performs quite well when only the weights are binarized, the difference between $M = 3$ with real valued activations is almost indistinguishable from the original model (even when looking at the direct interactive comparison). Even using 1-bit $(M = 1)$ the results are still passable considering 32x less memory is needed. However, when we start to binarize the activations, the performance of the network starts to drop significantly. Even the network with real-valued activations and 1-bit weights performs much better than the network with 5-bit activations and 5-bit weights. We are not exactly sure about the reason for this. It could be that binarizing the activations simply causes too much loss of information for the network to function properly. For example in the images (refer to images with binary activations), you can see that a lot of the colour of the image is lost. Especially for $N=1$, there the image has almost become black and white.
One possible explanation for the loss in information might be that we noticed that for $M = N = 3$, the tensors of the estimated weights and activations only contained $4$ or $5$ unique values. Having a 3-bit accuracy, it should be possible to achieve 8 unique levels instead. When $M = N = 5$, it should be possible to output 32 different values, but instead, we found that the tensors only contained $5$ or $6$ unique values.
One explanation for this could be that the shifting of parameters to form $\boldsymbol{B}$ is not done optimally. Right now the shifts, which set the thresholds for binarization, are uniformly distributed over a range from minus the standard deviation to plus the standard deviation, which is the way it was described in the ABC-Net paper. However, since the distribution of weights often resembles a Gaussian distribution, it might be better to put the thresholds around the mean closer together and the thresholds further from the mean further apart, to better match the Gaussian distribution. Currently, in the interest of time for this project, we were unable to research further into this problem, but we expect that if more of the possible weights occur, that this will lead to a smaller approximation error, which in term helps with improving performance.
We deviated in one aspect from the original ABC-Net implementation, by implementing the binarization of the activations the same as for the weights. This was done because we could not get the ABCLSID network to work with normalization, which probably interfered with our initialization. Another technique that could be studied in the future that could make this work is taking the original LTSITD network and add normalization to it. This network could then be trained to match the performance of the original model. This new model could then be used to initialize the network when binarizing the activations and this means that when binarizing the activations the network does not also need to adapt to the normalization.
This also means that the network could use the learnable parameter beta, to possibly find better threshold values than the ones we used. This can make the approximations and thus also the performance of the network better.
Another reason for the loss in performance could be that there simply goes something wrong in our implementation of binarizing the activations. For example, it is possible that when binarizing the activations, the gradient is not calculated correctly anymore. This could cause the network to not properly learn and adapt to the new structure. We have seen, for instance, during some initial testing that the training and validation error would go up after the first epoch when training on a small subset of the training data.
It could also be that there is something wrong with the calculation of the backpropagation as described in ABC-Net. The definition in the paper, the one we use, uses the STE and is as follows:
$$\frac{\delta c}{\delta\textbf{W}} = \sum\limits_{m=1}^M \alpha_m \frac{\delta c}{\delta \textbf{B}_m}$$
However, according to the author the following ABC-Net implementation cow8/ABC-Net-pytorch (https://github.com/cow8/ABC-Net-pytorch), this is not correct. They state that the correct formulation is:
$$\frac{\delta c}{\delta\textbf{W}} = \frac{\delta c}{\delta\textbf{O}}\cdot \left( \sum\limits_{m=1}^M \left( \frac{\delta \textbf{O}}{\delta \alpha_m Conv(B_m, A)} \cdot \left( \frac{\delta\alpha_m}{\delta W} \cdot Conv(B_m, A) + \frac{\delta Conv(B_m, A)}{\delta W} \cdot \alpha_m \right) \right) \right) $$
We are not 100\% that this is indeed the case, especially since it is not from a peer-reviewed paper, but from a public Github repository that contains an implementation that does not yet work. However, it might be interesting to explore this further in the future.
References
[1] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark”, CoRR, vol. abs/1805.01934,2018. [Online]. Available: http://arxiv.org/abs/1805.019341
[2] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network”, CoRR, vol.abs/1711.11294, 2017. [Online]. Available: http://arxiv.org/abs/1711.112941
[3] Cydonia, "Learning to See in the Dark in PyTorch", 2018. [Online]. Available: https://github.com/cydonia999/Learning_to_See_in_the_Dark_PyTorch
[4] Mylio,2020.[Online].Available:https://focus.mylio.com/tech-today/how-many-photos-will-be-taken-in-2020
[5] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:Imagenet classification using binary convolutional neural networks”, CoRR, vol. abs/1603.05279, 2016. [Online]. Available:http://arxiv.org/abs/1603.052791
[6] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating bina-rized neural networks: Comparison of fpga, cpu, gpu, and asic,” in2016 International Conference onField-Programmable Technology (FPT), 2016, pp. 77–84.
Want to see more reproductions of this and other papers? Check https://reproduced-papers.firebaseapp.com/