Image Colorization using Regression, Classification and GAN

Classic old photos could be limited by the technology of the era and passed down only in the grayscale form. Although black and white photos can sometimes reveal special poetic feelings, colorized photos give people a stronger sense of immersion. Restoring black and white photos is a popular and interesting topic in society. We try to explore ways to do the restoration to colorize the grayscale input images with plausible color versions. Instead of training models to restore the original colors, we aim to achieve the colorization that is considered reasonable by people.


  • Adopted a ResNet structure in regression and classification to learn related information from the grayscale images, and then applied deconvolutional layers to upscale the extracted features to predict the potential colors for the images in the regression model or pick a possible color for each pixel from a color probability distribution in the classification model.
  • Implemented GAN’s generator by a ResNet structure to output colored images from grayscale images. And GAN’s discriminator is trained to output false for predicting an un-real colored image and true for the true colored image.


  • The dataset is chosen from colorful images in ImageNet-ILSVRC2014 Dataset. We randomly pick images and adjust the resolution to 128×128 pixels in LAB color space. Within the randomly selected images, some are not as colorful. There is little difference between the original images and the converted grayscale images and thus they are not helpful for training purposes. To solve this problem, as demonstrated in the figure below, we apply the Hasler and Susstrunk’s approach, calculating a colorfulness score for each image and picking out colorful images.



  • Inspired by semantic segmentation, we want to use an encoder-decoder style architecture to tackle this regression problem. We first use several convolutional layers to extract semantic information from the input images and then apply deconvolutional layers to upscale the extracted information. Specifically, the beginning of our model is a ResNet-18, an image classification network with 18 layers and residual connections. We modify the first layer of the network to accept grayscale input images and cut it off after the 6th set of layers. It predicts a two-element vector (AB channel) for each pixel of the image at the end of the network, as demonstrated in the figure below.


  • We want to minimize the Euclidean error between the AB channel we estimate and the ground truth. However, this loss function is problematic for colorization due to the multimodality of the problem since there may be multiple plausible solutions. As a result, our model will usually choose desaturated colors which are less likely to be ”very wrong” than bright, vibrant colors.


  • We improve our regression method into a classification to solve the multimodality problem. We quantize the AB space of the LAB color space into 313 bins and find a bin number between 0 and 312 for every pixel. The color prediction task is now a multinomial classification problem where every gray pixel can choose its AB channel from 313 classes. Inspired by the architecture proposed by Zhang et al., we use a single-stream, VGG-styled network with added depth, dilated convolutions, and deconvolutional layers. Each block has two or three convolutional layers followed by a Rectified Linear Unit and terminating in a Batch Normalization layer, as demonstrated in the figure below.


  • We use a multinomial cross-entropy loss as our objective function. Let the output of the CNN be Z given an input image X. We transform all color images Y in the training set to their corresponding Z value. For every pixel in the original colored image Y_h,w, we find the nearest quantized ab bin and represent Z_h,w as a one-hot vector. Since soft-encoding works well for training, we find the 5-nearest neighbors to Y_h,w using KNN and weight them proportionally to their distance from the ground truth using a Gaussian kernel with σ = 5. Since colors’ distribution in ImageNet is heavy around the gray line, we need to modify the standard cross-entropy loss into


where the color rebalancing term v(.) is used to rebalance the loss based on the rarity of the color class. This contributes towards getting more vibrant and saturated colors in the output.


  • Unlike the previous two models, this model contains two neural networks: a generator and a discriminator. The generator works like previous models, takes in a black-white image and outputs the predicted colored image. The discriminator is trained to classify the predicted color image as false and the ground-truth color image as true. The pipeline of the model on a black-white image and the corresponding colored image is shown in the figure below.


  • In terms of loss function, there are two main loss functions applied here. First, the loss of the discriminator is computed as the average of its loss on a true image and its loss on a fake image. Each loss is a MSE loss for each pixel on the image (labeled 0 or 1). This loss could effectively capture whether this discriminator could distinguish between ground-truth image and generated image. Second, the loss of the generator is λ times L1 loss between predicted image and true image, plus discriminator loss on predicted image with label true. This loss could make the generator generate images that have less difference with true image and less loss from the discriminator.


  • For colorization problems, the ultimate goal is to colorize a grayscale image, making it plausible to a human observer. Therefore, to test each model’s performance, it is essential to let humans manually evaluate the quality.