Dense layer with backpropagation in C++
Let’s code a DNN in C++.
Initially we’ll try to keep it as simple as possible.
The network will consist of a single fully connected Dense layer.
It should support forward and back propagation.
We
assume:
there’s no bias in the dense layer;
there’s no non-linear activation;
loss function is Mean Squared Error.
Dense layer output Y is matrix multiplication of X by W:
\[\hat Y = X W\]
Where:
\(\hat Y\) is predicted output vector:
\[\hat Y = \left( \begin{array}{ccc}
\hat y_{0} & \hat y_{1} & \ldots & \hat y_{N-1} \\
\end{array} \right)\]
\(X\) is input vector:
\[X = \left( \begin{array}{ccc}
x_{0} & x_{1} & \ldots & x_{M-1} \\
\end{array} \right)\]
\(W\) is weights matrix:
\[W = \left( \begin{array}{ccc}
w_{0,0} & w_{0,1} & \ldots & w_{0,N-1} \\
w_{1,0} & w_{1,1} & \ldots & w_{1,N-1} \\
\vdots & \vdots & \ldots & \vdots \\
w_{M-1,0} & w_{M-1,1} & \ldots & w_{M-1,N-1} \\
\end{array} \right)\]
Mean Squared Error (MSE) loss between predicted output \(\hat Y\) and expected output \(Y\) :
\[E (Y, \hat Y) = \frac {1} {N} \sum_{i=0}^{N-1} ( Y_{i} - \hat Y_{i} )^2\]
Where \(Y\) is expected output vector:
\[Y = \left( \begin{array}{ccc}
y_{0} & y_{1} & \ldots & y_{N-1} \\
\end{array} \right)\]
Error backpropagation
For input \(X\), we want to minimize the MSE difference between out network output and expected output,
by adjusting dense layer weights by error gradient \(\frac {\partial E} {\partial W}\)
\[W_{t+1} = W_{t} - \alpha * \frac {\partial E} {\partial W}\]
Here \(\alpha\) is learning rate
and \(\frac {\partial E} {\partial W}\) is error gradient with regards to weights.
Lets find error gradient \(\frac {\partial E} {\partial W}\)
Using chain rule
\[\hat Y = X * W\]
\[\frac {\partial E (\hat Y) } {\partial W} = \frac {\partial E} {\partial \hat Y} * \frac {\partial \hat Y} {\partial W}\]
where
\[\frac {\partial E} {\partial \hat Y} = \frac {2} {N} ( \hat {Y} - Y )\]
and
\[\frac {\partial \hat Y} {\partial W} = \frac {\partial (X W)} {\partial W} = X^T\]
Finally, weights update \(\frac {\partial E} {\partial W}\) is:
\[\frac {\partial E} {\partial W} = \frac {2} {N} ( \hat {Y} - Y ) * X^T\]
\[\frac {\partial E} {\partial W} = \frac {2} {N} ( \hat {Y} - Y ) \otimes X\]
Let’s write Python implementation with TF2/Keras. We’ll use it to validate C++ code in the consecutive section.
For this experiment I’ve used the following software versions:
$ python3 - m pip freeze | grep "numpy\|tensorflow"
numpy == 1.19 . 5
tensorflow == 2.5 . 0 rc2
$ g ++ -- version
g ++ 9.3 . 0
Import TF and Keras. We’ll define a network with 3 inputs and 2 outpus.
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import RMSprop
import numpy as np
num_inputs = 3
num_outputs = 2
Define Keras sequential network with single Dense layer.
# Create one layer model
model = tf . keras . Sequential ()
# No bias, no activation, initialize weights with 1.0
model . add ( Dense ( units = num_outputs , use_bias = False , activation = None , kernel_initializer = tf . keras . initializers . ones ()))
Use mean square error for the loss function.
# use MSE as loss function
loss_fn = tf . keras . losses . MeanSquaredError ()
Hardcode model iput and expected model output. We’ll use the same array values later in C++ implementation.
# Arbitrary model input
x = np . array ([ 2.0 , 0.5 , 1 ])
# Expected output
y_true = np . array ([ 1.5 , 1.0 ])
Use Stochastic Gradient Decent (SGD) optimizer.
SGD weight update rule is
\(W = W - LR * \nabla\)
\(\nabla\) is weight gradient and \(LR\) is learning rate.
For now we’ll assume learning rate equal to 1.0
# SGD update rule for parameter w with gradient g when momentum is 0 is as follows:
# w = w - learning_rate * g
#
# For simplicity make learning_rate=1.0
optimizer = tf . keras . optimizers . SGD ( learning_rate = 1.0 , momentum = 0.0 )
In the training loop we’ll compute model output for input X, compute and backpropagate the loss.
# Get model output y for input x, compute loss, and record gradients
with tf . GradientTape ( persistent = True ) as tape :
# get model output y for input x
# add newaxis for batch size of 1
xt = tf . convert_to_tensor ( x [ np . newaxis , ...])
tape . watch ( xt )
y = model ( xt )
# obtain MSE loss
loss = loss_fn ( y_true , y )
# loss gradient with respect to loss input y
dloss_dy = tape . gradient ( loss , y )
# adjust Dense layer weights
grad = tape . gradient ( loss , model . trainable_variables )
optimizer . apply_gradients ( zip ( grad , model . trainable_variables ))
Finally we’ll print inputs, outputs, gradients, and updated Dense layer weights.
# print model input and output excluding batch dimention
print ( f "input x= { x } " )
print ( f "output y= { y [ 0 ] } " )
print ( f "expected output y_true= { y_true } " )
# print MSE loss
print ( f "loss= { loss } " )
# print loss gradients
print ( f "dloss_dy= { dloss_dy [ 0 ]. numpy () } " )
# print weight gradients d_loss/d_w
print ( f "grad= \n { grad [ 0 ]. numpy () } " )
# print updated dense layer weights
print ( f "vars= \n { model . trainable_variables [ 0 ]. numpy () } " )
After running Python example we get:
$ python3 dense.py
input x =[ 2. 0.5 1. ]
output y =[ 3.5 3.5]
expected output y_true =[ 1.5 1. ]
loss = 5.125
dloss_dy =[ 2. 2.5]
grad =
[[ 4. 5. ]
[ 1. 1.25]
[ 2. 2.5 ]]
vars =
[[ -3 . -4 . ]
[ 0. -0 .25]
[ -1 . -1 .5 ]]
Let’s code the same example in C++
We’ll implement it using C++ STL.
Chrono headers are included for clocks used to benchmark run time.
#include <cstdio>
#include <vector>
#include <algorithm>
#include <cassert>
#include <numeric>
#include <array>
#include <chrono>
#include <iostream>
#include <string>
#include <functional>
#include <array>
#include <iterator>
using namespace std ;
using std :: chrono :: high_resolution_clock ;
using std :: chrono :: duration_cast ;
using std :: chrono :: microseconds ;
Dense layer weights initializer
/*
* Constant 1.0 weight intializer
*/
static auto ones_initializer = []() -> float
{
return 1.0 ;
};
Dense layer class template includes forward() and backward() functions.
/*
* Dense layer class template
*
* Parameters:
* num_inputs: number of inputs to Dense layer
* num_outputs: number of Dense layer outputs
* T: input, output, and weights type in the dense layer
* initializer: weights initializer function
*/
template < size_t num_inputs , size_t num_outputs , typename T = float ,
T ( * initializer )() = ones_initializer >
struct Dense
{
typedef array < T , num_inputs > input_vector ;
typedef array < T , num_outputs > output_vector ;
/*
* Layer weights
*/
vector < input_vector > weights ;
/*
* Dense class constructor
*/
Dense ()
{
/*
* Create num_outputs x num_inputs matric
*/
weights . resize ( num_outputs );
for ( input_vector & w : weights )
{
generate ( w . begin (), w . end (), * initializer );
}
}
/*
* Dense forward pass
*/
array < T , num_outputs > forward ( const input_vector & x )
{
/*
* Check for input size mismatch
*/
assert ( x . size () == weights [ 0 ]. size ());
/*
* Layer output is dot product of input with weights
*/
array < T , num_outputs > activation ;
transform ( weights . begin (), weights . end (), activation . begin (),
[ x ]( const input_vector & w )
{
T val = inner_product ( x . begin (), x . end (), w . begin (), 0.0 );
return val ;
}
);
return activation ;
}
/*
* Dense layer backward pass
*/
void backward ( const input_vector & input , const output_vector & dloss_dy )
{
/*
* Weight update according to SGD algorithm with momentum = 0.0 is:
* w = w - learning_rate * d_loss/dw
*
* For simplicity assume learning_rate = 1.0
*
* d_loss/dw = dloss/dy * dy/dw
*
* dy/dw is :
* y = w[0]*x[0] + w[1] * x[1] +... + w[n] * x[n]
* dy/dw[i] = x[i]
*
* For clarity we:
* assume learning_rate = 1.0
* first compute dw
* second update weights by subtracting dw
*/
/*
* compute dw
* dw = outer(x, de_dy)
*/
vector < input_vector > dw ;
for ( auto const & dloss_dyi : dloss_dy )
{
auto row = input ;
for_each ( row . begin (), row . end (), [ dloss_dyi ]( T & xi ){ xi *= dloss_dyi ;});
dw . emplace_back ( row );
}
/*
* compute w = w - dw
* assume learning rate = 1.0
*/
transform ( weights . begin (), weights . end (), dw . begin (), weights . begin (),
[]( input_vector & left , const input_vector & right )
{
transform ( left . begin (), left . end (),
right . begin (),
left . begin (),
minus < T > ());
return left ;
}
);
}
/*
* Helper function to convert Dense layer to string
* Used for printing the layer weights
*/
operator std :: string () const
{
string ret ;
for ( int y = 0 ; y < weights [ 0 ]. size (); y ++ )
{
for ( int x = 0 ; x < weights . size (); x ++ )
{
ret += to_string ( weights [ x ][ y ]) + " " ;
}
ret += " \n " ;
}
return ret ;
}
/*
* Helper function to cout Dense layer weights
*/
friend ostream & operator << ( ostream & os , const Dense & dense )
{
os << ( string ) dense ;
return os ;
}
};
Mean Squared Error class will need it’s own forward and backward functions.
/*
* Mean Squared Error loss class
* Parameters:
* num_inputs: number of inputs to MSE function.
* T: input type, float by defaut.
*/
template < size_t num_inputs , typename T = float >
struct MSE
{
/*
* Forward pass computes MSE loss for inputs y (label) and yhat (predicted)
*/
static T forward ( const array < T , num_inputs >& y , const array < T , num_inputs >& yhat )
{
T loss = transform_reduce ( y . begin (), y . end (), yhat . begin (), 0.0 , plus < T > (),
[]( const T & left , const T & right )
{
return ( left - right ) * ( left - right );
}
);
return loss / num_inputs ;
}
/*
* Backward pass computes dloss/dy for inputs y (label) and yhat (predicted)
*
* loss = sum((yhat[i] - y[i])^2) / N
* i=0...N-1
* where N is number of inputs
*
* d_loss/dy[i] = 2 * (yhat[i] - y[i]) * (-1) / N
* d_loss/dy[i] = 2 * (y[i] - yhat[i]) / N
*
*/
static array < T , num_inputs > backward ( const array < T , num_inputs >& y ,
const array < T , num_inputs >& yhat )
{
array < T , num_inputs > de_dy ;
transform ( y . begin (), y . end (), yhat . begin (), de_dy . begin (),
[]( const T & left , const T & right )
{
return 2 * ( right - left ) / num_inputs ;
}
);
return de_dy ;
}
};
Finally, in the main function, we’ll declare input x and expecetd output y_true arrays, containing the same values as in out Python example.
Then we’ll compute forward and backward passes, and print the network output and updated weights.
int main ( void )
{
const int num_inputs = 3 ;
const int num_outputs = 2 ;
const int num_iterations = 1000 ;
auto print_fn = []( const float & x ) -> void { printf ( "%.5f " , x );};
array < float , num_inputs > x = { 2.0 , 0.5 , 1.0 };
array < float , num_outputs > y_true = { 1.5 , 1.0 };
/*
* Create dense layer and MSE loss
*/
Dense < num_inputs , num_outputs > dense ;
MSE < num_outputs > mse_loss ;
/*
* Compute Dense layer output y for input x
*/
auto yhat = dense . forward ( x );
/*
* Copute MSE loss for output y and expected y_true
*/
auto loss = mse_loss . forward ( y_true , yhat );
/*
* Run inference 1000 times and benchmark dense layer latency
*/
auto ts = high_resolution_clock :: now ();
for ( auto iter = 0 ; iter < num_iterations ; iter ++ ) [[ likely ]] dense . forward ( x );
auto te = high_resolution_clock :: now ();
auto dt_us = ( float ) duration_cast < microseconds > ( te - ts ). count () / num_iterations ;
/*
* Print DNN input x
*/
printf ( "input x=" );
for_each ( x . begin (), x . end (), print_fn );
printf ( " \n " );
/*
* Print DNN output y
*/
printf ( "outut y=" );
for_each ( yhat . begin (), yhat . end (), print_fn );
printf ( " \n " );
/*
* Print expected output y_true
*/
printf ( "expected outut y=" );
for_each ( y_true . begin (), y_true . end (), print_fn );
printf ( " \n " );
/*
* Print loss for output y and label y_true
*/
printf ( "loss: %f \n " , loss );
/*
* Compute dloss/dy gradients
*/
auto dloss_dy = mse_loss . backward ( y_true , yhat );
/*
* Back propagate loss
*/
dense . backward ( x , dloss_dy );
/*
* print dloss/dy
*/
printf ( "loss gradient: " );
for_each ( dloss_dy . begin (), dloss_dy . end (), print_fn );
printf ( " \n " );
/*
* Print updated Dense layer weights
*/
printf ( "updated dense layer weights: \n %s" , (( string ) dense ). c_str ());
/*
* Print average latency
*/
printf ( "time dt=%f usec \n " , dt_us );
return 0 ;
}
After compiling and running C++ example we get:
input x = 2.00000 0.50000 1.00000
outut y = 3.50000 3.50000
expected outut y = 1.50000 1.00000
loss: 5.125000
loss gradient: 2.00000 2.50000
updated dense layer weights:
-3 .000000 -4 .000000
0.000000 -0 .250000
-1 .000000 -1 .500000
time dt = 0.093000 usec
As one can verify, forward path output of the C++ implementation matches the Python code.
Also, gradients and Dense layer weights after backpropagation match in Python and C++ code.
Python source code for this example is at dense.py
C++ implementation is at dense.cpp