idx.py: Loading IDX Files Into NumPy Arrays

I've been playing with the MNIST handwritten digit dataset, which can be downloaded from Dr. Yann LeCun's website here. The data is provided in the form of gzipped IDX files, which are not supported by NumPy¹, so I wrote a small module to import and export these files. You can download idx.py here.

The module is relatively simple to use:

>>> import idx
>>> train_images = idx.loadidx("train-images-idx3-ubyte.gz")
>>> train_images.shape
(6000, 28, 28)
>>> train_labels = idx.loadidx("train-labels-idx1-ubyte.gz")
>>> train_labels.shape
(60000,)

Augment data by adding a little bit of normally distributed noise and save:

>>> import numpy as np
>>> noisy_images = (np.random.normal(size=train_images.shape) + train
... ).clip(0, 255).astype('uint8')
>>> augmented_images = np.vstack((train_images, noisy_images))
>>> idx.saveidx("augmented-images-udx3-ubyte.gz", augmented_images)
>>> augmented_labels = np.vstack((train_labels, train_labels))
>>> idx.saveidx("augmented-labels-udx1-ubyte.gz", augmented_labels)

Note that augmented-images-udx3-ubyte.gz is a 23.7 MiB file, so it'll take a little while to write to disc.

Both loadidx and saveidx automatically recognize and handle gzipped files with the .gz file extension. If for some reason you need support for one of the other compressed file types supported by the standard library or support for file streams, let me know and I'll add it.

Incidentally, I have started using Black to format my Python source files. One reason I like Black is because it automatically breaks long lines (80 columns in my case). In Emacs, I use blacken (available through MELPA) to conveniently format whole buffers.

There is at least one package available on PyPI, but I thought I could do better. ↩