DNN with backpropagation in C++, part 7

Putting it all together

Previous posts can be found at:

“DNN with backpropagation in C++, part 6. Cross-entropy Error”

“DNN with backpropagation in C++, part 5. Adding Softmax layer”

“DNN with backpropagation in C++, part 4. Adding sigmoid activation”

“DNN with backpropagation in C++, part 3. Dense layer with backpropagation and bias”

“DNN with backpropagation in C++, part 2. Two layer NN with backpropagation”

“DNN with backpropagation in C++, part 1. Dense layer with backpropagation”

So far we have build a simple DNN consisting of two Dense layers, Sigmoid activation layer and output Softmax layer.

Now let’s train this DNN to classify handwritten digit images.

MNIST dataset

For training I’ll use MNIST dataset of handwritten digit images. Tbe dataset contains 60000 training images and labels, and 10000 images and labels for validation.

I’ll focus on C++ implementation only. Python equivalent won’t be added because there are many existing MNIST classification examples available in Python for all major deep learning frameworks.

First, let’s download the dataset and extract it into data folder.

$ mkdir data && cd data
$ wget https://github.com/alexgl-github/alexgl-github.github.io/raw/main/mnist/t10k-images-idx3-ubyte.gz
$ wget https://github.com/alexgl-github/alexgl-github.github.io/raw/main/mnist/t10k-labels-idx1-ubyte.gz
$ wget https://github.com/alexgl-github/alexgl-github.github.io/raw/main/mnist/train-images-idx3-ubyte.gz
$ wget https://github.com/alexgl-github/alexgl-github.github.io/raw/main/mnist/train-labels-idx1-ubyte.gz
$ gunzip *
$ cd ..

Dataset images and labels binary files have the following format.

Labes binary file format:

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801       magic number
                                          (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label

The header contains magic number and total number of labels in the dataset. Each label has size of 1 byte and is between 0 and 9 indicating handwritten digit value.

Images binary file format:

[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
                                          (MSB first)
0004     32 bit integer  10000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

File header contains magic number, total number of images in the dataset, image height and width. Each image record contains pixel values for one image of size rows by pixels per row, in row scan order format.

Pixels are one pixel per byte with values between 0 and 255. All images are of height 28 rows and width of 28 pixels per row.

Example of MNIST image for label 6:

mnist_image

C++ implementaiton of MNIST dataset parser

Using MNIST format information, we can implement C++ code for reading MNIST images and labels. The values in MNIST binary files are in MSB (big-endian) format, so I’ll need to add MSB to LSB conversion to the implementation.

Include header files used in the code.

#include <cstdio>
#include <vector>
#include <array>
#include <algorithm>
#include <cassert>
#include <array>
#include <chrono>
#include <sstream>
#include <string>
#include <iterator>
#include <variant>
#include <random>
#include "mnist.h"
#include "f1.h"

using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::microseconds;
using std::chrono::milliseconds;
using std::chrono::seconds;

Add comment with specification reference and format of the binaries.

/*
 * Full dataset format specification is at http://yann.lecun.com/exdb/mnist/
 *
 * TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
 * [offset] [type]          [value]          [description]
 * 0000     32 bit integer  0x00000801(2049) magic number (MSB first)
 * 0004     32 bit integer  60000            number of items
 * 0008     unsigned byte   ??               label
 * 0009     unsigned byte   ??               label
 * ........
 * xxxx     unsigned byte   ??               label
 * The labels values are 0 to 9.
 *
 * TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
 * [offset] [type]          [value]          [description]
 * 0000     32 bit integer  0x00000803(2051) magic number
 * 0004     32 bit integer  60000            number of images
 * 0008     32 bit integer  28               number of rows
 * 0012     32 bit integer  28               number of columns
 * 0016     unsigned byte   ??               pixel
 * 0017     unsigned byte   ??               pixel
 * ........
 * xxxx     unsigned byte   ??               pixel
 *
 * Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
 *
 */

Template C++ class for reading MNIST dataset.

It contains:

dataset object constructor from provided image and labels binary file path;
read_next(image, label) for reading next image and next label values, and scaling pixel values to floating point [0, 1) range;
rewind() to reset dataset file pointer to the beginning;
other helper functions.

/*
 * MNIST dataset class
 *
 * num_classes=10, image size=28x28 are as per dataset specifiction
 *
 * typename T=float is type of the data returned by read_next_image call
 *
 */
template<size_t num_classes=10, size_t image_size=28*28, typename T=float>
struct mnist
{
  enum error
    {
      MNIST_OK = 0,
      MNIST_EOF = -1,
      MNIST_ERROR = -2
    };

  /*
   * Header of images binary file
   */
  struct images_header
  {
    int32_t magic;
    int32_t count;
    int32_t num_rows;
    int32_t num_cols;
  };

  images_header images;

  /*
   * Header of labels binary file
   */
  struct labels_header
  {
    int32_t magic;
    int32_t count;
  };

  labels_header labels;

  /* images file descriptor */
  int fd_images = -1;
  /* labels file descriptor */
  int fd_labels = -1;
  /* images magic number */
  const int32_t magic_image_value = 0x00000803;
  /* labels magic number */
  const int32_t magic_label_value = 0x00000801;

  /*
   * Initialize MNIST dataset from image file and labels file
   * Read and verify magic numbers
   * Verify number if images matches number of labels
   * Verify read image size matches template parameter image_size
   */
  mnist(const char* path_images, const char* path_labels)
  {
    int ret;

    /*
     * Read and verify images header
     */
    fd_images = open(path_images, O_RDONLY);
    assert(fd_images != -1);

    ret = read(fd_images, &images, sizeof(images));
    assert(ret == sizeof(images));
    assert(endian_swap<int32_t>(images.magic) == magic_image_value);
    images.count = endian_swap<int32_t>(images.count);
    images.num_rows = endian_swap<int32_t>(images.num_rows);
    images.num_cols = endian_swap<int32_t>(images.num_cols);
    assert(images.num_rows * images.num_cols == image_size);

    /*
     * Read labels header
     */
    fd_labels = open(path_labels, O_RDONLY);
    assert(fd_labels != -1);

    read(fd_labels, &labels, sizeof(labels));
    assert(endian_swap<int32_t>(labels.magic) == magic_label_value);
    labels.count = endian_swap<int32_t>(labels.count);
    assert(labels.count == images.count);
  }

  /*
   * Close file descriptors in dtor
   */
  ~mnist()
  {
    close(fd_images);
    close(fd_labels);
  }

  /*
   * Read image pixel values
   */
  int read_next_image(std::array<T, image_size>& data)
  {
    /*
     * Read uint_8 pixel values
     */
    std::array<uint8_t, image_size> raw_data;
    ssize_t ret = read(fd_images, raw_data.data(), image_size);
    if (ret == 0)
      {
        return MNIST_EOF;
      }

    assert(ret == image_size);

    /*
     * convert to floating point and normalize
     */
    std::transform(raw_data.begin(), raw_data.end(), data.begin(),
              [](const uint8_t x) {
                return static_cast<T>(x) / 256.0;
              }
              );
    return MNIST_OK;
  }

  /*
   * Read image label and return on one-hot encoded formt
   */
  int read_next_label(std::array<T, num_classes>& label)
  {
    uint8_t label_index = 0;

    /*
     * Fill one hot array with zeros
     */
    std::fill(label.begin(), label.end(), 0);

    /*
     * Read label value
     */
    ssize_t ret = read(fd_labels, &label_index, sizeof(label_index));

    if (ret == 0)
      {
        return MNIST_EOF;
      }

    assert(ret == sizeof(label_index));
    assert(label_index < num_classes);

    /*
     * Set one hot array at label value index to 1
     */
    label[label_index] = 1.0;
    return MNIST_OK;
  }

  /*
   * Wrapper for read_next_image() and read_next_label()
   */
  int read_next(std::array<T, image_size>& image, std::array<T, num_classes>& label)
  {
    auto ret1 = read_next_image(image);
    auto ret2 = read_next_label(label);
    if (ret1 == MNIST_EOF || ret2 == MNIST_EOF)
      {
        return MNIST_EOF;
      }
    assert(ret1 == MNIST_OK);
    assert(ret2 == MNIST_OK);
    return MNIST_OK;
  }

  /*
   * Reset file offsets to the beginning of data
   * skipping headers
   */
  void rewind()
  {
    if (fd_images != -1 && fd_labels != -1)
      {
        /*
         * seek to data offsets in labels and images. For offsets see start of mnist.h
         */
        lseek(fd_images, sizeof(images), SEEK_SET);
        lseek(fd_labels, sizeof(labels), SEEK_SET);
      }
  }

  /*
   * Utility to convert betwwen big/little endiannes
   */
  template<typename S = uint32_t>
  static S endian_swap(S val)
  {
    typedef union
    {
      S val;
      unsigned char array[sizeof(S)];
    } swap_t;
    swap_t src, dst;
    src.val = val;
    for (size_t i = 0; i < sizeof(S); i++)
      {
        dst.array[i] = src.array[sizeof(S) - i - 1];
      }
    return dst.val;
  }

};

F1 score

To validate training results I’ll be using multi-class F1 score metric. F1 score is a very common metric used to rate accuracy of a classification DNN.

Each class score is harmonic mean of precision and recall scores. Output score is mean of all class scores.

\[F1(class) = \frac {2 * Precision(class) * Recall(class)} {Precision(class) + Recall(class)}\]

where

\[Precision(class) = \frac {TruePositives(class)} {TruePositives(class) + FalsePositives(class)}\] \[Recall(class) = \frac {TruePositives(class)} {TruePositives(class) + FalseNegatives(class)}\]

F1 score calculation

Let’s consider the following prediction counters:

image of a plane was classified as plane 5 times, as car 1 time, as boat 0 times;
image of a car was classified as car 4 times, as plane 1 time, as boat 2 times;
image of a boat was classified as boat 3 times, as plane 1 time, as car 0 times.

f1_image

Let’s calculate F1 scores for class “plane”:

\[TP(plane) = 5 \\ FP(plane) = 2 \\ FN(plane) = 1\] \[Precision(plane) = \frac {TP(plane)} {TP(plane) + FP(plane)} = \frac {5} {(5 + 2)} = \frac {5} {7}\] \[Recall(plane) = \frac {TP(plane)} {TP(plane) + FN(plane)} = \frac {5} {(5 + 1)} = \frac {5} {6}\] \[F1(plane) = \frac {2 * Precision(plane) * Recall(plane)} {Precision(plane) + Fecall(plane)} = \frac {2 * \frac {5} {7} * \frac {5} {6}} {\frac {5} {7} + \frac {5} {6}} \approx 0.76\]

Repeating similar calculation for “car” and “boat”:

\[F1(car) \approx 0.66 \\ F1(boat) \approx 0.66\]

F1 is average of class F1 scores

\[F1 = \frac {F1(plane) + F1(car) + F1(boat)} {3} \approx 0.7\]

C++ implementation of F1 score calculation

#pragma once
#include <array>
#include <algorithm>

/*
 * F1 score class
 *
 * Computes F1 score for multi-class classification
 *
 * f1_score = 2 * Preision * Recall / (Precision + Recall)
 *
 * TP = TruePositive
 * FP = FalsePositive
 * FN = FalseNegative
 *
 * Precision(class) = TP(class) / (TP(class) + FP(class))
 * Recall(class) = TP(class) / (TP(class) + FN(class))
 *
 *
 */
template<size_t num_classes=10, typename T=float>
struct f1
{
  std::array<int, num_classes> tp_fp;
  std::array<int, num_classes> tp_fn;
  std::array<int, num_classes> tp;
  static constexpr float eps = 1e-7;

  /*
   * Default f1 class constructor
   */
  f1()
  {
    reset();
  }

  /*
   * Reset accumulated TP, FP, FN counters
   */
  void reset()
  {
    tp_fp.fill({});
    tp_fn.fill({});
    tp.fill({});
  }

  /*
   * Update TP, FP, FN counters
   *  y_true is one-hot label
   *  y_pred is predicted probabilites vector output from softmax
   */
  void update(const std::array<T, num_classes>& y_true, const std::array<T, num_classes>& y_pred)
  {
    auto idx_true = std::distance(y_true.begin(), std::max_element(y_true.begin(), y_true.end()));
    auto idx_pred = std::distance(y_pred.begin(), std::max_element(y_pred.begin(), y_pred.end()));
    tp_fp[idx_pred] += 1;
    tp_fn[idx_true] += 1;
    if (idx_true == idx_pred)
      {
        tp[idx_true] += 1;
      }
  }

  /*
   * Get f1 score value as average of f1 class scores
   */
  float score()
  {

    std::array<float, num_classes> precision;
    std::array<float, num_classes> recall;
    std::array<float, num_classes> scores;

    std::transform(tp.begin(), tp.end(), tp_fp.begin(), precision.begin(),
                   [](const int& tp_i, const int& fp_i)
                   {
                     return static_cast<float>(tp_i) / static_cast<float>(fp_i + eps);
                   });
    std::transform(tp.begin(), tp.end(), tp_fn.begin(), recall.begin(),
                   [](const int& tp_i, const int& fn_i)
                   {
                     return static_cast<float>(tp_i) / static_cast<float>(fn_i + eps);
                   });
    std::transform(precision.begin(), precision.end(), recall.begin(), scores.begin(),
                   [](const float& precision_i, const float& recall_i)
                   {
                     auto score =  2.0 * precision_i * recall_i / (precision_i + recall_i + eps);
                     return score;
                   });

    auto score_total = std::accumulate(scores.begin(), scores.end(), static_cast<float>(0.0));
    return score_total / static_cast<float>(num_classes);
  }
};

Training loop

C++ module contains code for DNN layers, training and validation loops. It is also reporting loss and F1 score for train and validation datasets.

I have added some optimizations to speed up dense layer backward path:

Dense backward() function has been split into gradient backpropagation and weight update parts. Gradient backpropagation is done on each iteration, and weight update is done once per batch, this improves compute time.
Also I keep transposed matrix of weight pointers in Dense object. It is popuulated once in the Dense class constructor, and is used in backward() function, therefore we don’t need to transpose Dense layer weights in each backward() function call.

Include required heared files:

#include <cstdio>
#include <vector>
#include <array>
#include <algorithm>
#include <cassert>
#include <array>
#include <chrono>
#include <sstream>
#include <string>
#include <iterator>
#include <variant>
#include <random>
#include "mnist.h"
#include "f1.h"

using namespace std;

Clocks will be used to benchmark training loop performance.

using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::microseconds;
using std::chrono::milliseconds;
using std::chrono::seconds;

Constant and random initializers used for dense layer weights and biases. In this example dense layer weights are initialized with random uniform initializer, and biases are initailized with zero initializer.

/*
 * Constant weight intializer
 */
const float const_one = 1.0;
const float const_zero = 0.0;
const float const_one_half = 0.0;

template<float const& value = const_zero>
constexpr auto const_initializer = []() -> float
{
  return value;
};

/*
 * Random uniform weights initializer
 */
constexpr auto random_uniform_initializer = []() -> float
{
  /*
   * Return random values in the range [-1.0, 1.0]
   */
  return 2.0 * static_cast<float>(rand()) / static_cast<float>(RAND_MAX) - 1.0;
};

Dense class template with parameters:

num_inputs: length of layer input vector
num_outputs: length of dense layer output vector
T: type of input, output, weights, biases
weights_initializer: function to use to initialize layer weights
bias_initializer: function to use to initialize layer biases

/*
 * Dense layer class template
 *
 * Parameters:
 *  num_inputs: number of inputs to Dense layer
 *  num_outputs: number of Dense layer outputs
 *  T: input, output, and weights type in the dense layer
 *  initializer: weights initializer function
 */
template<size_t num_inputs, size_t num_outputs, typename T = float,
         T (*weights_initializer)() = random_uniform_initializer,
         T (*bias_initializer)() = const_initializer<const_zero> >
struct Dense
{

  /*
   * input outut vector type definitions
   */
  typedef array<T, num_inputs> input_vector;
  typedef array<T, num_outputs> output_vector;

  /*
   * dense layer weights matrix W, used in y = X * W
   * weights are transposed to speed up forward() computation
   */
  vector<input_vector> weights;

  /*
   * bias vector
   */
  output_vector bias;

  /*
   * Flag to enable/disable bias
   */
  bool use_bias = true;

  /*
   * Matrix of transposed weights W pointers, used to speed up backward() path
   */
  vector<array<T*, num_outputs>> weights_transposed;

  /*
   * dw is accumulating weight updates in backward() pass.
   */
  vector<input_vector> dw;

  /*
   * db is accumulating bias updates in backward() pass.
   */
  output_vector db;

  /*
   * x is for saving input to forward() call, used later in the backward() pass.
   */
  input_vector x;

  /*
   * Default dense layer constructor
   */
  Dense(bool _use_bias=true)
  {
    /*
     * Create num_outputs x num_inputs weights matrix
     */
    weights.resize(num_outputs);
    for (input_vector& w: weights)
      {
        generate(w.begin(), w.end(), *weights_initializer);
      }

    /*
     * Ctreate transposed array of weighst pointers
     */
    weights_transposed.resize(num_inputs);
    for (size_t i = 0; i < num_inputs; i++)
      {
        for (size_t j = 0; j < num_outputs; j++)
          {
            weights_transposed[i][j] = &weights[j][i];
          }
      }

    /*
     * Initialize bias vector
     */
    use_bias = _use_bias;
    generate(bias.begin(), bias.end(), *bias_initializer);

    /*
     * Initialize dw, db
     */
    dw.resize(num_outputs);
    reset_gradients();
  }

  /*
   * Dense layer constructor from provided weigths and biases
   * Note: weights are stored transposed
   */
  Dense(const array<array<T, num_inputs>, num_outputs>& weights_init,
        const array<T, num_outputs>& biases_init)
  {
    /*
     * Create num_outputs x num_inputs weights matrix
     */
    for (auto weights_row: weights_init)
      {
        weights.push_back(weights_row);
      }

    /*
     * Initialize bias vector
     */
    bias = biases_init;
  }

  /*
   * Dense layer constructor from provided weigths. Bias is not used
   * Note: weights are stored transposed
   */
  Dense(const array<array<T, num_inputs>, num_outputs>& weights_init)
  {
    /*
     * Create num_outputs x num_inputs weights matrix
     */
    for (auto weights_row: weights_init)
      {
        weights.push_back(weights_row);
      }

    use_bias = false;
  }

  /*
   * Dense layer forward pass
   * Computes X * W + B
   *  X - input row vector
   *  W - weights matrix
   *  B - bias row wector
   */
  output_vector forward(const input_vector& input_x)
  {
    /*
     * Save the last input X
     */
    x = input_x;

    /*
     * Check for input size mismatch
     */
    assert(x.size() == weights[0].size());

    /*
     * Layer output is dot product of input with weights
     */
    output_vector y;
    transform(weights.begin(), weights.end(), bias.begin(), y.begin(),
              [this](const input_vector& w, T bias)
              {
                T y_i = inner_product(w.begin(), w.end(), x.begin(), 0.0);
                if (use_bias)
                  {
                    y_i += bias;
                  }
                return y_i;
              }
              );

    return y;
  }


  /*
   * Dense layer backward pass
   */
  input_vector backward(const output_vector& grad)
  {
    /*
     * Weight update according to SGD algorithm with momentum = 0.0 is:
     *  w = w - learning_rate * d_loss/dw
     *
     * d_loss/dw = dloss/dy * dy/dw
     * d_loss/dbias = dloss/dy * dy/dbias
     *
     * dloss/dy is input gradient grad
     *
     * dy/dw is :
     *  y = w[0]*x[0] + w[1] * x[1] +... + w[n] * x[n] + bias
     *  dy/dw[i] = x[i]
     *
     * dy/dbias is :
     *  dy/dbias = 1
     *
     * For clarity we:
     *  first compute dw
     *  second update weights by subtracting dw
     */

    /*
     * Compute backpropagated gradient
     */
    input_vector grad_out;
    transform(weights_transposed.begin(), weights_transposed.end(), grad_out.begin(),
              [grad](array<T*, num_outputs>& w)
              {
                T val = inner_product(w.begin(), w.end(), grad.begin(), 0.0,
                                      [](const T& l, const T& r)
                                      {
                                        return l + r;
                                      },
                                      [](const T* l, const T& r)
                                      {
                                        return *l * r;
                                      }
                                      );
                return val;
              });

    /*
     * accumulate weight updates
     * compute dw = dw + outer(x, grad)
     */
    transform(dw.begin(), dw.end(), grad.begin(), dw.begin(),
              [this](input_vector& left, const T& grad_i)
              {
                /* compute outer product for each row */
                auto row = x;
                for_each(row.begin(), row.end(), [grad_i](T &xi){ xi *= grad_i;});

                /* accumulate into dw */
                std::transform(left.begin(), left.end(),
                               row.begin(), left.begin(), std::plus<T>());
                return left;
              });

    /*
     * accumulate bias updates
     */
    std::transform(db.begin(), db.end(), grad.begin(), db.begin(), std::plus<T>());

    return grad_out;
  }

  /*
   * Update Dense layer weigts/biases using accumulated dw/db
   */
  void train(float learning_rate)
  {
    /*
     * compute w = w - learning_rate * dw
     */
    transform(weights.begin(), weights.end(), dw.begin(), weights.begin(),
              [learning_rate](input_vector& left, input_vector& right)
              {
                transform(left.begin(), left.end(), right.begin(), left.begin(),
                          [learning_rate](const T& w_i, const T& dw_i)
                          {
                            return w_i - learning_rate * dw_i;
                          });
                return left;
              });


    /*
     * compute bias = bias - learning_rate * db
     */
    if (use_bias)
      {
        /*
         * compute bias = bias - grad
         */
        transform(bias.begin(), bias.end(), db.begin(), bias.begin(),
                  [learning_rate](const T& bias_i, const T& db_i)
                  {
                    return bias_i - learning_rate * db_i;
                  });
      }

    /*
     * Reset accumulated dw and db
     */
    reset_gradients();
  }

  /*
   * Reset weigth and bias gradient accumulators
   */
  void reset_gradientss()
  {
    for (input_vector& dw_i: dw)
      {
        generate(dw_i.begin(), dw_i.end(), const_initializer<const_zero>);
      }
    generate(db.begin(), db.end(), const_initializer<const_zero>);
  }

  /*
   * Helper function to convert Dense layer to string
   * Used for printing the layer weights and biases
   */
  operator std::string() const
  {
    std::ostringstream ret;
    ret.precision(7);

    /*
     * output weights
     */
    ret << "weights:" << std::endl;
    for (int y=0; y < weights[0].size(); y++)
      {
        for (int x=0; x < weights.size(); x++)
          {
            if (weights[x][y] >= 0)
              ret << " ";
            ret << std::fixed << weights[x][y] << " ";
          }
        ret << std::endl;
      }

    if (use_bias)
      {
        /*
         * output biases
         */
        ret << "bias:" << std::endl;
        for (auto b: bias)
          {
            if (b >= 0)
              ret << " ";
            ret << std::fixed << b << " ";
          }
        ret << std::endl;
      }

    return ret.str();
  }

  /*
   * Helper function to cout Dense layer object
   */
  friend ostream& operator<<(ostream& os, const Dense& dense)
  {
    os << (string)dense;
    return os;
  }

};

Sigmoid layer class template includes forward() and backward() functions.

Forward path computes sigmoid(x) = 1 / (1 + exp(-x))

backward() computes mutiple of gradient and sigmoid derivative.

/*
 * Sigmoid layer class template
 */
template<size_t num_inputs, typename T = float>
struct Sigmoid
{
  typedef array<T, num_inputs> input_vector;
  typedef array<T, num_inputs> output_vector;

  /*
   * x is for saving input to forward() call, used later in the backward() pass.
   */
  input_vector x;

  /*
   * Sigmoid forward pass
   */
  output_vector forward(const input_vector& input_x)
  {
    output_vector y;
    x = input_x;
    transform(x.begin(), x.end(), y.begin(),
              [](const T& yi)
              {
                T out = 1.0  / (1.0 + expf(-yi));
                return out;
              });
    return y;
  }

  /*
   * Sigmoid backward pass
   */
  input_vector backward(const output_vector grad)
  {
    input_vector grad_output;

    const output_vector y = forward(x);
    transform(y.begin(), y.end(), grad.begin(), grad_output.begin(),
              [](const T& y_i, const T& grad_i)
              {
                T out = grad_i * y_i * (1 - y_i);
                return out;
              });
    return grad_output;
  }

  /*
   * No trainabele weights in Sigmoid
   */
  void train(float lr)
  {
  }

};

Softmax layer implementation includes forward and backward path.

Forward path computes Softmax(x) = exp(x) / sum (exp(x))

Backward path is dot product of input gradient and Softmax Jacobian as per math section above.

/*
 * Softmax layer class template
 */
template<size_t num_inputs, typename T = float>
struct Softmax
{
  typedef array<T, num_inputs> input_vector;
  typedef array<T, num_inputs> output_vector;

  /*
   * x is for saving input to forward() call, and is used in the backward() pass.
   */
  input_vector x;

  /*
   * Softmax forward function
   */
  output_vector forward(const input_vector& input_x)
  {
    output_vector y;

    /*
     * Save input to forward()
     */
    x = input_x;

    /*
     * compute exp(x_i) / sum(exp(x_i), i=1..N)
     */
    transform(x.begin(), x.end(), y.begin(),
              [](const T& yi)
              {
                T out = expf(yi);
                return out;
              });

    T sum = accumulate(y.begin(), y.end(), static_cast<T>(0.0));
    for_each(y.begin(), y.end(), [sum](T &yi){ yi /= sum;});

    return y;
  }

  /*
   * Softmax backward function
   */
  input_vector backward(const output_vector& grad_inp)
  {
    input_vector grad_out;
    vector<input_vector> J;

    /*
     * Compute Jacobian of Softmax
     */
    const input_vector y = forward(x);
    int s_i_j = 0;
    for (auto y_i: y)
      {
        auto row = y;
        for_each(row.begin(), row.end(), [y_i](T& y_j){ y_j = -y_i * y_j;});
        row[s_i_j] += y_i;
        s_i_j++;
        J.push_back(row);
      }

    /*
     * Compute dot product of gradient and Softmax Jacobian
     */
    transform(J.begin(), J.end(), grad_out.begin(),
              [grad_inp](const input_vector& j)
              {
                T val = inner_product(j.begin(), j.end(), grad_inp.begin(), 0.0);
                return val;
              }
              );

    return grad_out;

  }

  /*
   * No trainable weights in softmax
   */
  void train(float lr)
  {
  }

};

Crossentropy class computes categorical cross-entopy loss for onehot labels input y and probabilities input yhat.

/*
 * Categorical Crossentropy loss
 *
 * Forward:
 *  E = - sum(y * log(yhat))
 *
 * Parameters:
 *  num_inputs: number of inputs to loss function.
 *  T: input type, float by defaut.
 */
template<size_t num_inputs, typename T = float>
struct CCE
{

  typedef array<T, num_inputs> input_vector;

  /*
   * Forward pass computes CCE loss for inputs y (label) and yhat (predicted)
   */
  static T forward(const input_vector& y, const input_vector& yhat)
  {
    T loss = transform_reduce(y.begin(), y.end(), yhat.begin(), 0.0, plus<T>(),
                              [](const T& y_i, const T& yhat_i)
                              {
                                return y_i * logf(yhat_i);
                              }
                              );
    return -1 * loss;
  }

  /*
   * Backward pass computes dloss/dy for inputs yhat (label) and y (predicted):
   *
   * dloss/dy = - yhat/y
   *
   */
  static input_vector backward(const input_vector& y, const input_vector& yhat)
  {
    array<T, num_inputs> de_dy;

    transform(y.begin(), y.end(), yhat.begin(), de_dy.begin(),
              [](const T& y_i, const T& yhat_i)
              {
                return -1 * y_i / yhat_i;
              }
              );
    return de_dy;
  }

};

Sequential class template constructs DNN from template parameters. For example 4 layer DNN can be created as follows:

Sequential<Dense<image_size, 128>,
           Sigmoid<128>,
           Dense<128, num_classes>,
           Softmax<num_classes>> net;

Sequential provides high level forward(), backward(), and train() functions:

y = net.forward(x)
net.backward(loss_gradient)
net.train(learning_rate);

/*
 * Sequential class template
 *
 *  Creates DNN layer from template list
 *  Implements higher level forward(), backward() and train() functions
 */
template<typename... T>
struct Sequential
{
  /*
   * layers array hosld DN layer objects
   */
  std::array<std::variant<T...>, sizeof...(T)> layers;

  /*
   * Sequential constructor
   */
  Sequential()
  {
    /*
     * Create DNN layers from the template list
     */
    auto create_layers = [this]<std::size_t... I>(std::index_sequence<I...>)
      {
        (void(layers[I].template emplace<I>(T())),...);
      };
    create_layers(std::make_index_sequence <sizeof...(T)>());
  }

  /*
   * Sequential forward pass will call each layer forward() function
   */
  template<size_t index=sizeof...(T)-1, typename Tn>
  auto forward(Tn& x)
  {
    if constexpr(index == 0)
       {
         return std::get<index>(layers[index]).forward(x);
       }
    else
      {
	auto y_prev = forward<index-1>(x);
	return std::get<index>(layers[index]).forward(y_prev);
      }
  }


  /*
   * Sequential backward pass will call each layer backward() function
   */
  template<size_t index=0, typename Tn>
  auto backward(Tn& dy)
  {
    if constexpr(index == sizeof...(T)-1)
       {
         return std::get<index>(layers[index]).backward(dy);
       }
    else
      {
	auto dy_prev = backward<index+1>(dy);
	return std::get<index>(layers[index]).backward(dy_prev);
      }
  }

  /*
   * Sequential class train() invokes each layer train() function
   */
  void train(float learning_rate)
  {
    [this, learning_rate]<std::size_t... I> (std::index_sequence<I...>)
      {
	(void(std::get<I>(layers[I]).train(learning_rate)), ...);
      }(std::make_index_sequence <sizeof...(T)>());
  }
};

The main() function:

creates instanses of train and validation dataset;
constructs a DNN consisting of two dense layers, sigmoid layer, softmax layer;
starts training loop for 5 epochs;
iterates over entire train datased in each epoch;
computes f1 score for validation datased after each epoch.

/*
 * DNN train and validation loops are implemented in the main() function.
 */
int main(void)
{
  /*
   * Path to MNIST train and validation data
   */
  const char* train_images_path = "./data/train-images-idx3-ubyte";
  const char* train_labels_path = "./data/train-labels-idx1-ubyte";
  const char* validation_images_path = "./data/t10k-images-idx3-ubyte";
  const char* validation_labels_path = "./data/t10k-labels-idx1-ubyte";

  /*
   * Number of classes in MNIST dataset
   */
  const int num_classes = 10;

  /*
   * MNIST image dimentions
   */
  const int num_rows = 28;
  const int num_cols = 28;
  const int image_size = num_rows * num_cols;

  /*
   * Train for 5 epochs
   */
  const int num_epochs = 5;

  /*
   * Print report every 20000 iterations
   */
  const int report_interval = 30000;

  /*
   * Storage for next image and label
   */
  std::array<float, image_size> image;
  std::array<float, num_classes> label;

  /*
   * Training learning rate and batch size
   */
  float learning_rate = 0.001;
  int batch_size = 100;

  /*
   * MNIST train dataset
   */
  mnist train_dataset(train_images_path, train_labels_path);

  /*
   * MNIST validation dataset
   */
  mnist validation_dataset(validation_images_path, validation_labels_path);


  printf("train dataset: %d images and %d labels\n", train_dataset.images.count, train_dataset.labels.count);
  printf("validation dataset: %d images and %d labels\n", validation_dataset.images.count, validation_dataset.labels.count);

  /*
   * Number of iterations in epoch is number of records in the dataset
   */
  const auto iterations = train_dataset.images.count;

  /*
   * Create DNN layers and the loss
   */
  Sequential<Dense<image_size, 128>, Sigmoid<128>, Dense<128, num_classes>, Softmax<num_classes>> net;

  CCE<num_classes> loss_fn;
  f1<num_classes, float> f1_train;
  f1<num_classes, float> f1_validation;

  /*
   * shortcut for mnust error code
   */
  using mnist_error = mnist<10, 28*28, float>::error;

  /*
   * Training loop
   * Train for num_epochs
   */
  for (auto epoch=0; epoch < num_epochs; epoch++)
    {
      /*
       * Reset dataset positons, loss accuracy counters, and clocks
       */
      train_dataset.rewind();
      float loss_epoch = 0;
      f1_train.reset();
      auto ts = high_resolution_clock::now();

      /*
       * Train for number of iterations per each epoch
       */
      for (auto iter = 0; iter < iterations / batch_size; iter++)
        {

          /*
           * Repeat forward path for batch_size
           */
          for (auto batch = 0; batch < batch_size; batch++)
            {

              auto ret = train_dataset.read_next(image, label);
              if (ret == mnist_error::MNIST_EOF)
                {
                  break;
                }

              /*
               * Compute Dense layer output y for input x
               */
              auto y4 = net.forward(image);
              auto loss = loss_fn.forward(label, y4);

              /*
               * Update f1 score
               */
              f1_train.update(label, y4);

              /*
               * Back propagate loss
               */
              auto dloss_dy4 = loss_fn.backward(label, y4);
              net.backward(dloss_dy4);

              /*
               * Accumulate loss for reporting
               */
              loss_epoch += loss;
            }

          /*
           * Update dense layers weights once per batch
           */
          net.train(learning_rate);
          /*
           * Print loss and accuracy per each reporting interval
           */
          if (epoch == 0 && iter == 0)
            {
              printf("epoch=%d/%d iteration=%d/%d loss=%.5f f1=%.5f\n",
                     epoch+1,
                     num_epochs,
                     (iter+1), iterations/batch_size,
                     loss_epoch / ((iter + 1)*batch_size),
                     f1_train.score());
            }
          if ( (((iter+1) * batch_size) % report_interval) == 0)
            {
              printf("epoch=%d/%d iteration=%d/%d loss=%.5f f1=%.5f\n",
                     epoch+1,
                     num_epochs,
                     (iter+1), iterations/batch_size,
                     loss_epoch / ((iter + 1)*batch_size),
                     f1_train.score());
            }
        }

      /*
       * Capture run time per epoch
       */
      auto te = high_resolution_clock::now();
      auto dt = (float)duration_cast<seconds>(te - ts).count();

      /*
       * Average loss per epoch
       */
      loss_epoch = loss_epoch / iterations;

      /*
       * Print epoch stats
       */
      printf("epoch %d/%d time/epoch=%.5f sec; time left=%.4f hr; avg loss=%.5f; f1=%.5f\n",
             epoch+1,
             num_epochs,
             dt, dt * (num_epochs - epoch - 1) / (60*60),
             loss_epoch,
             f1_train.score());


      /*
       * Validation loop
       */
      float loss_validation = 0;
      validation_dataset.rewind();
      f1_validation.reset();

      /*
       * Repeat each epoch for entire validation dataset
       */
      for (auto iter = 0; iter < validation_dataset.images.count; iter++)
        {
          /*
           * Read next image and label
           */
          auto ret = validation_dataset.read_next(image, label);
          if (ret == mnist_error::MNIST_EOF)
            {
              break;
            }

          /*
           * Forward path
           */
          auto y4 = net.forward(image);
          auto loss = loss_fn.forward(label, y4);

          /*
           * Update validation loss and counters
           */
          loss_validation += loss;
          f1_validation.update(label, y4);
        }

      /*
       * Report validation loss and f1 score
       */
      loss_validation = loss_validation / validation_dataset.images.count;
      printf("epoch %d/%d validation loss: %f  f1: %.5f\n",
             epoch+1, num_epochs, loss_validation, f1_validation.score());
    }

  return 0;
}

Let’s build the source code and run the MNIST training example

It will run for 5 epochs, processing 60000 training images and 10000 validation images in each epoch. Cross-entropy loss should be decrementing each epoch, and accuracy F1 score should be going up.

$ g++ -o dense7 -Wall -std=c++2a dense7.cpp && ./dense7
train dataset: 60000 images and 60000 labels
validation dataset: 10000 images and 10000 labels
epoch=1/5 iteration=1/600 loss=5.35573 f1=0.06373
epoch=1/5 iteration=300/600 loss=0.66855 f1=0.82115
epoch=1/5 iteration=600/600 loss=0.48756 f1=0.86409
epoch 1/5 time/epoch=88.00000 sec; time left=0.0978 hr; avg loss=0.48756; f1=0.86409
epoch 1/5 validation loss: 0.269416  f1: 0.91865
epoch=2/5 iteration=300/600 loss=0.23929 f1=0.92960
epoch=2/5 iteration=600/600 loss=0.22786 f1=0.93199
epoch 2/5 time/epoch=90.00000 sec; time left=0.0750 hr; avg loss=0.22786; f1=0.93199
epoch 2/5 validation loss: 0.215759  f1: 0.93523
epoch=3/5 iteration=300/600 loss=0.18506 f1=0.94489
epoch=3/5 iteration=600/600 loss=0.18007 f1=0.94619
epoch 3/5 time/epoch=91.00000 sec; time left=0.0506 hr; avg loss=0.18007; f1=0.94619
epoch 3/5 validation loss: 0.190085  f1: 0.94221
epoch=4/5 iteration=300/600 loss=0.15368 f1=0.95358
epoch=4/5 iteration=600/600 loss=0.15124 f1=0.95460
epoch 4/5 time/epoch=91.00000 sec; time left=0.0253 hr; avg loss=0.15124; f1=0.95460
epoch 4/5 validation loss: 0.174170  f1: 0.94573
epoch=5/5 iteration=300/600 loss=0.13183 f1=0.96054
epoch=5/5 iteration=600/600 loss=0.13087 f1=0.96092
epoch 5/5 time/epoch=92.00000 sec; time left=0.0000 hr; avg loss=0.13087; f1=0.96092
epoch 5/5 validation loss: 0.163480  f1: 0.94946

As one can see, after 5 epochs cross-entropy error decreased from 6.68 to 0.13. At the same time accuracy F1 metric incerased from the initial 0.06 to 0.96 (96% of accurate classifications).

C++ implementation is available at dense7.cpp, mnist.h, and f1.h

Written on June 6, 2021