DNN with backpropagation in C++, part 2

Two layer NN with backpropagation in C++

Now let’s implement a two layer neural-network in C++.

As a staring point we’ll use source code from “DNN with backpropagation in C++”, part 1

Again, to keep it simple we’ll assume:

there’s no bias in the dense layers;
there’s no non-linear activation;
loss function is Mean Squared Error.

Forward path for input vector X of size M, and 2 layer neural net with \(M\) inputs and \(N\) outputs

twolayer

Two layer neural net output \(Y\) is:

\[Y_{1} = X * W_{1}\] \[\hat Y = Y_{1} * W_{2}\]

where

\(X\) is input vector

\(W_{1}\) are weights for dense layer 1

\(Y_{1}\) is output of dense layer 1

\(W_{2}\) are weights for dense layer 2

\(Y\) is expected (true) output of neural net:

\[Y = \left( \begin{array}{ccc} y_{0} & y_{1} & \ldots & y_{N-1} \\ \end{array} \right)\]

\(\hat Y\) is predicted output vector:

\[\hat Y = \left( \begin{array}{ccc} \hat y_{0} & \hat y_{1} & \ldots & \hat y_{N-1} \\ \end{array} \right)\]

Mean Squared Error (MSE) loss between predicted \(\hat Y\) and expected \(Y\) is

\[E = MSE(Y, \hat Y) = \frac {1} {N} \sum_{i=0}^{N-1} ( Y_{i} - \hat Y_{i} )^2\]

Error backpropagation.

For input \(X\), we want to minimize the MSE difference between out network output and expected output, by adjusting weights of both dense layers:

\[\frac {\partial E} {\partial W_{1}}\] \[\frac {\partial E} {\partial W_{2}}\]

Adjustment for dense layer 2 weights is going to be the same as in the previous post “Dense layer with backpropagation in C++”

Let’s find weight adjustment for the weights of dense layer 1.

\[\hat Y = X * W_{1} * W_{2}\]

Using chain rule

\[\frac {\partial E} {\partial W_{1}} = \frac {\partial E} {\partial \hat Y} * \frac {\partial \hat Y} {\partial W_{1}}\]

\[\frac {\partial E} {\partial W_{1}} = \frac {\partial E} {\partial \hat Y} * \frac {\partial \hat Y} {\partial Y_{1}} * \frac {\partial Y_{1}} {\partial W_{1}}\]

where

\[\frac {\partial E} {\partial \hat Y} = \frac {2 * ( \hat Y - {Y} )} {N}\] \[\frac {\partial \hat Y} {\partial Y_{1}} = W_{2}^T\] \[\frac {\partial Y_{1}} {\partial W_{1}} = X^T\]

Finally, the weight updates are:

\[\frac {\partial E} {\partial W_{1}} = \frac {2 * ( \hat Y - {Y} )} {N} * W_{2}^T * X^T\]

First, let’s write Python implementation with TF2. We’ll use it to validate C++ code in the consecutive section.

For this experiment I’ve used the following software versions:

$ python3 -m pip freeze | grep "numpy\|tensorflow"
numpy==1.19.5
tensorflow==2.5.0rc2

$ g++ --version
g++ 9.3.0

Python code will be very similar to Python sample from the prevoius post.

Import TF and Keras. We’ll define a network with 2 inputs and 2 outpus.

import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop
import numpy as np

num_inputs = 2
num_outputs = 2

# Create two layer model
model = tf.keras.Sequential()

# No bias, no activation, initialize weights with 1.0
layer1 = Dense(units=num_inputs, use_bias=False, activation=None, kernel_initializer=tf.keras.initializers.ones())
layer2 = Dense(units=num_outputs, use_bias=False, activation=None, kernel_initializer=tf.keras.initializers.ones())
model.add(layer1)
model.add(layer2)

Use mean square error for the loss function.

# use MSE as loss function
loss_fn = tf.keras.losses.MeanSquaredError()

Hardcode model input and expected model output. We’ll use the same array values later in C++ implementation.

# Arbitrary model input
x = np.array([2.0, 0.5])

# Expected output
y_true = np.array([2.0, 1.0])

Use Stochastic Gradient Decent (SGD) optimizer.

SGD weight update rule is \(W = W - LR * \nabla\)

\(\nabla\) is weight gradient and \(LR\) is learning rate.

For now we’ll assume learning rate equal to 1.0

# SGD update rule for parameter w with gradient g when momentum is 0 is as follows:
#   w = w - learning_rate * g
#
#   For simplicity make learning_rate=1.0
optimizer = tf.keras.optimizers.SGD(learning_rate=1.0, momentum=0.0)

In the training loop we’ll compute model output for input X, compute and backpropagate the loss.

# Get model output y for input x, compute loss, and record gradients
with tf.GradientTape(persistent=True) as tape:

    # get model output y for input x
    # add newaxis for batch size of 1
    xt = tf.convert_to_tensor(x[np.newaxis, ...])
    tape.watch(xt)
    y = model(xt)

    # loss gradient with respect to loss input y
    dy_dw = tape.gradient(y, model.trainable_variables)

    # obtain MSE loss
    loss = loss_fn(y_true, y)

    # loss gradient with respect to loss input y
    dloss_dy = tape.gradient(loss, y)

    # adjust Dense layer weights
    grad = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grad, model.trainable_variables))

Finally we’ll print inputs, outputs, gradients, and updated Dense layer weights.

# print model input and output excluding batch dimention
print(f"input x={x}")
print(f"output y={y[0]}")
print(f"expected output y_true={y_true}")

# print MSE loss
print(f"loss={loss}")

# print loss gradients
print("dloss_dy={}".format(*[v.numpy() for v in dloss_dy]))

# print weight gradients d_loss/d_w
print("grad=\n{}".format(*[v.numpy() for v in grad]))

# print updated dense layer weights
print("updated weights=")
print(*[v.numpy() for v in model.trainable_variables], sep="\n")

After running Python example we get:

$ python3 dense2.py
input x=[2.  0.5]
output y=[5. 5.]
expected output y_true=[2. 1.]
loss=12.5
dloss_dy=[3. 4.]
grad=
[[14.  14. ]
 [ 3.5  3.5]]
updated weights=
[[-13.  -13. ]
 [ -2.5  -2.5]]
[[-6.5 -9. ]
 [-6.5 -9. ]]

Let’s code the same example in C++

in C++ implementation we’ll compute

\[\frac {2 * ( Y - \hat {Y} )} {N} * W_{2}^T\]

as output of backward() function of the dense layer.

This will allow us to feed backward() output of a layer as input to backward() function for the previous layer.

We’ll implement it using C++ STL. Chrono headers are included for clocks used to benchmark run time.

#include <cstdio>
#include <vector>
#include <algorithm>
#include <cassert>
#include <numeric>
#include <array>
#include <chrono>
#include <iostream>
#include <string>
#include <functional>
#include <array>
#include <iterator>

using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::microseconds;

Lambda will be used to pretty print inputs, outputs, and layer weights.

/*
 * Print helper function
 */
auto print_fn = [](const float& x)  -> void {printf("%.1f ", x);};

Dense layer weights initializer.

/*
 * Constant 1.0 weight intializer
 */
static auto ones_initializer = []() -> float
{
  return 1.0;
};

Dense layer class template includes forward() and backward() functions. Backward() function was modified from the previous example to return backpropagated gradient.

/*
 * Dense layer class template
 *
 * Parameters:
 *  num_inputs: number of inputs to Dense layer
 *  num_outputs: number of Dense layer outputs
 *  T: input, output, and weights type in the dense layer
 *  initializer: weights initializer function
 */
template<size_t num_inputs, size_t num_outputs, typename T = float,
         T (*initializer)()=ones_initializer>
struct Dense
{

  typedef array<T, num_inputs> input_vector;
  typedef array<T, num_outputs> output_vector;

  vector<input_vector> weights;

  /*
   * Dense layer constructor
   */
  Dense()
  {
    /*
     * Create num_outputs x num_inputs matric
     */
    weights.resize(num_outputs);
    for (input_vector& w: weights)
      {
        generate(w.begin(), w.end(), *initializer);
      }
  }

  /*
   * Dense layer backward pass
   */
  input_vector backward(input_vector& input, output_vector dloss_dy)
  {
    /*
     * Weight update according to SGD algorithm with momentum = 0.0 is:
     *  w = w - learning_rate * d_loss/dw
     *
     * For simplicity assume learning_rate = 1.0
     *
     * d_loss/dw = dloss/dy * dy/dw
     *
     * dy/dw is :
     *  y = w[0]*x[0] + w[1] * x[1] +... + w[n] * x[n]
     *  dy/dw[i] = x[i]
     *
     * For clarity we:
     *  assume learning_rate = 1.0
     *  first compute dw
     *  second update weights by subtracting dw
     */

    /*
     * compute dw
     * dw = outer(x, de_dy)
     */
    vector<input_vector> dw;
    for (auto dloss_dy_i: dloss_dy)
      {
        auto row = input;
        for_each(row.begin(), row.end(), [dloss_dy_i](T &xi){ xi *= dloss_dy_i;});
        dw.push_back(row);
      }

    /*
     * Compute backpropagated gradient
     */
    input_vector ret;
    transform(weights.begin(), weights.end(), ret.begin(),
              [dloss_dy](input_vector& w)
              {
                T val = inner_product(w.begin(), w.end(), dloss_dy.begin(), 0.0);
                return val;
              });

    /*
     * compute w = w - dw
     * assume learning rate = 1.0
     */
    transform(weights.begin(), weights.end(), dw.begin(), weights.begin(),
              [](input_vector& left, input_vector& right)
              {
                transform(left.begin(), left.end(), right.begin(),
                          left.begin(), minus<T>());
                return left;
              });

    return ret;
  }

  /*
   * Helper function to convert Dense layer to string
   * Used for printing the layer
   */
  operator std::string() const
  {
    std::ostringstream ret;
    ret.precision(1);

    for (int y=0; y < weights[0].size(); y++)
      {
        for (int x=0; x < weights.size(); x++)
          {
            if (weights[x][y] >= 0)
              ret << " ";
            ret << std::fixed << weights[x][y] << " ";
          }
        ret << std::endl;
      }
    return ret.str();
  }

  /*
   * Helper function to cout Dense layer object
   */
  friend ostream& operator<<(ostream& os, const Dense& dense)
  {
    os << (string)dense;
    return os;
  }

};

Mean Squared Error class is the same s in the previous post

/*
 * Mean Squared Error loss class
 * Parameters:
 *  num_inputs: number of inputs to MSE function.
 *  T: input type, float by defaut.
 */
template<size_t num_inputs, typename T = float>
struct MSE
{
  /*
   * Forward pass computes MSE loss for inputs y (label) and yhat (predicted)
   */
  static T forward(const array<T, num_inputs>& y, const array<T, num_inputs>& yhat)
  {
    T loss = transform_reduce(y.begin(), y.end(), yhat.begin(), 0.0, plus<T>(),
                              [](const T& left, const T& right)
                              {
                                return (left - right) * (left - right);
                              }
                              );
    return loss / num_inputs;
  }

  /*
   * Backward pass computes dloss/dy for inputs y (label) and yhat (predicted)
   *
   * loss = sum((yhat[i] - y[i])^2) / N
   *   i=0...N-1
   *   where N is number of inputs
   *
   * d_loss/dy[i] = 2 * (yhat[i] - y[i]) * (-1) / N
   * d_loss/dy[i] = 2 * (y[i] - yhat[i]) / N
   *
   */
  static array<T, num_inputs> backward(const array<T, num_inputs>& y,
                                       const array<T, num_inputs>& yhat)
  {
    array<T, num_inputs> de_dy;

    transform(y.begin(), y.end(), yhat.begin(), de_dy.begin(),
              [](const T& left, const T& right)
              {
                return 2 * (right - left) / num_inputs;
              }
              );
    return de_dy;
  }

};

Finally, in the main function, we’ll declare input x and expecetd output y_true arrays, containing the same values as in out Python example. Then we’ll compute forward and backward passes, and print the network output and updated weights.

int main(void)
{
  const int num_inputs = 2;
  const int num_outputs = 2;
  const int num_iterations = 1000;

  array<float, num_inputs> x = {2.0, 0.5};
  array<float, num_outputs> ytrue = {2.0, 1.0};

  /*
   * Create dense layer and MSE loss
   */
  Dense<num_inputs, num_outputs> dense1;
  Dense<num_inputs, num_outputs> dense2;
  MSE<num_outputs> mse_loss;

  /*
   * Compute Dense layer output y for input x
   */
  auto y1 = dense1.forward(x);
  auto y2 = dense2.forward(y1);

  /*
   * Copute MSE loss for output y and labe yhat
   */
  auto loss = mse_loss.forward(ytrue, y2);

  /*
   * Benchmark Dense layer inference latency
   */
  auto ts = high_resolution_clock::now();
  for (auto iter = 0;  iter < num_iterations; iter++)
    {
      y1 = dense1.forward(x);
      y2  = dense2.forward(y1);
    }
  auto te = high_resolution_clock::now();
  auto dt_us = (float)duration_cast<microseconds>(te - ts).count() / num_iterations;

  /*
   * Print DNN input x
   */
  printf("input x=");
  for_each(x.begin(), x.end(), print_fn);
  printf("\n");

  /*
   * Print DNN output y
   */
  printf("output y1=");
  for_each(y1.begin(), y1.end(), print_fn);
  printf("\n");

  printf("output y2=");
  for_each(y2.begin(), y2.end(), print_fn);
  printf("\n");

  /*
   * Print loss for output y and label ytrue
   */
  printf("loss: %f\n", loss);

  /*
   * Compute dloss/dy gradients
   */
  auto dloss_dy = mse_loss.backward(ytrue, y2);

  /*
   * Back propagate loss
   */
  auto bw2 = dense2.backward(y1, dloss_dy);
  dense1.backward(x, bw2);

  /*
   * print dloss/dy
   */
  printf("d(loss)/dy: ");
  for_each(dloss_dy.begin(), dloss_dy.end(), print_fn);
  printf("\n");

  /*
   * Print updated Dense layer weights
   */
  printf("updated dense 1 layer weights:\n%s", ((string)dense1).c_str());
  printf("updated dense 2 layer weights:\n%s", ((string)dense2).c_str());

  /*
   * Print average latency
   */
  printf("time dt=%f usec\n", dt_us);

  return 0;
}

After compiling and running C++ example we get:

$ g++ -o dense2 -std=c++2a dense2.cpp && ./dense2
input x=2.0 0.5
output y1=2.5 2.5
output y2=5.0 5.0
loss: 12.500000
d(loss)/dy: 3.0 4.0
updated dense 1 layer weights:
-13.0 -13.0
-2.5 -2.5
updated dense 2 layer weights:
-6.5 -9.0
-6.5 -9.0
time dt=0.187000 usec

As one can verify, forward path output of the C++ implementation matches the Python code. Also, gradients and Dense layer weights after backpropagation match in Python and C++ code.

Python source code for this example is at dense2.py

C++ implementation is at dense2.cpp

Written on April 28, 2021