Deep learning: How much can image augmentation do?

5 min readMar 19, 2019

A while ago, when using image augmentation (if you’re unfamiliar, check this post), I started to wonder: how far can I take this image augmentation?

Of course, there’s always the “it depends” per every domain, model, task, yadda yadda yadda. Instead, let’s just run some tests.

We will be using the Kaggle cats vs. dogs dataset. Only a subset of the full set will be used: 1000 for training (well, more about that later), 1000 for testing, half cats, half dogs.

We will create a number of data sets, for each data set we will pick a number of images from the training set, and use augmentation to create a total of 1000 images.

╔══════════╦════════════════╦═════════╦════════════╦══════════╗
║ N source ║ N augmentation ║ N total ║ Multiplier ║ Fraction ║
╠══════════╬════════════════╬═════════╬════════════╬══════════╣
║       10 ║            990 ║    1000 ║ 99x        ║     0.99 ║
║       20 ║            980 ║    1000 ║ 49x        ║     0.98 ║
║       50 ║            950 ║    1000 ║ 19x        ║     0.95 ║
║      100 ║            900 ║    1000 ║ 9x         ║     0.90 ║
║      200 ║            800 ║    1000 ║ 4x         ║     0.80 ║
║      500 ║            500 ║    1000 ║ 1x         ║     0.50 ║
║     1000 ║              0 ║    1000 ║ 0x         ║     0.00 ║
╚══════════╩════════════════╩═════════╩════════════╩══════════╝

The above table shows the structure of each dataset in terms of the number of different source images.

N source is the number of unique images from the original set, also the name of the dataset referred to later
N augmentation is the number of augmented images created and added to the set
N total is the total number of images in the set
Multiplier is the number of times each source image is augmented
Fraction is the fraction of augmented images in the set

Note that the final dataset 1000 contains no augmented images whatsoever.

The 10 images used as a source for augmentation for the first dataset.

I used the imgaug library for python, the augmentation sequence is as follows:

from imgaug import augmenters as iaaseq = iaa.Sequential([
    iaa.Affine(rotate=(-45, 45)), # rotate by -45 to 45 degrees
    iaa.Crop(((0, 100), (0, 100), (0, 100), (0, 100))),
    iaa.Pad(((0, 40), (0, 40), (0, 40), (0, 40))),
    iaa.Fliplr(0.5), # flip over vertical axis, 0.5 chance
    iaa.GaussianBlur((0.1, 2.0)),
    iaa.Multiply((0.8, 1.2), per_channel=True)  # change brightness
], random_order=True)

Or in words: rotation, cropping, padding, horizontal flipping, blurring, and brightness modification. Note that the brightness modification (Multiply) is done with a separately drawn number for each channel, meaning it can change colors. Below is an example of six of the source images and a few of their augmented versions.

Examples for six source images (first column) and five different augmentations for each source image.

We will create a small CNN using Keras as follows:

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Densemodel = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

Input shape is whatever you choose to resize the images to, I’ve chosen 150 by 150 pixels.

I’ve trained the same network for each dataset for a total of 20 epochs.

Accuracy on the training set and the validation set after each epoch of training.

As expected, it overfits very quickly. On the training data, the model gets pretty close to perfect accuracy while the validation accuracy is barely better than random, which would have an accuracy of 0.5 for two classes.

Now what we’ve been waiting for: the results of training on the other six datasets. I’ve not included the training accuracy for the other datasets, as they all go pretty much to 1.0.

Accuracy on the validation set after each epoch of training for each of seven datasets.

Yep, you’re right, that’s a pretty terrible plot, let’s run the same training for each dataset a couple of times and average the results to smooth out the graph (and also for science).

Accuracy on the validation set after each epoch for each of seven datasets averaged over twenty runs done for each set.

So that’s slightly more readable. A pretty large jump in performance can be seen between the 10 and 20 datasets, and then again between 500 and 1000.

While I was definitely expecting the super bad performance of the 10 dataset, I didn’t expect the 20 dataset to already be making such a leap.

The accuracy of the 10 and 20 datasets is the highest after a single epoch. Less diverse data means it will converge faster on the data available, so this was to be expected.

As you need more data to effectively represent more features, the jumps could be explained by exceeding a threshold necessary to successfully enter another level of abstraction, i.e. another feature useful for successful classification.

As you need more data to effectively represent more features, the jumps could be explained by exceeding a threshold necessary to successfully enter another level of abstraction, i.e. another feature useful for successful classification.

For example, there may be a number of features which distinctly identify a cat. For every feature the network is able to successfully recognize, the accuracy of the network will go up. When an earlier undetected feature is getting detected, a jump can be expected in the accuracy.

I could also look further into this by looking at the newly correctly classified images: images that were classified correctly, that were classified incorrectly by a network trained on less diverse data. What do these images have in common? Does this explain a jump in accuracy?

Augmentation could create more diverse example data to represent these features, we would just need smarter augmentation. But then, how could this augmentation method know what to do without actually looking at more data? Maybe something like Deep Local Features trained on similar data could help?

As usual, trying to figure something out leads to more questions than answers, but that’s all part of the fun.

Final note: overall accuracy is pretty low (0.90+ is definitely possible), using the same datasets with a different network or classifier might make a (big) difference, but that is for another time.

Deep learning: How much can image augmentation do?

Written by Sijmen van der Willik

No responses yet