[
  {
    "path": ".clang-format",
    "content": "AccessModifierOffset: -4\nAlignAfterOpenBracket: true\nAlignConsecutiveAssignments: true\nAlignConsecutiveDeclarations: false\nAlignEscapedNewlinesLeft: true\nAlignTrailingComments: true\nAllowAllParametersOfDeclarationOnNextLine: false\nAllowShortBlocksOnASingleLine: false\nAllowShortCaseLabelsOnASingleLine: false\nAllowShortFunctionsOnASingleLine: false\nAllowShortIfStatementsOnASingleLine: false\nAllowShortLoopsOnASingleLine: false\nAlwaysBreakAfterReturnType: None\nAlwaysBreakBeforeMultilineStrings: true\nAlwaysBreakTemplateDeclarations: true\nBinPackArguments: false\nBinPackParameters: false\nBreakBeforeBraces: Custom\nBraceWrapping:\n        AfterClass: true\n        AfterControlStatement: true\n        AfterEnum: true\n        AfterFunction: true\n        AfterNamespace: true\n        AfterObjCDeclaration: true\n        AfterStruct: true\n        AfterUnion: true\n        AfterExternBlock: true\n        BeforeCatch: true\n        BeforeElse: true\n        IndentBraces: false\n        SplitEmptyFunction: false\n        SplitEmptyRecord: false\n        SplitEmptyNamespace: false\nBreakBeforeBinaryOperators: All\nBreakBeforeTernaryOperators: true\nBreakConstructorInitializers: BeforeComma\nBreakStringLiterals: true\nColumnLimit: 80\nCommentPragmas: ''\nCompactNamespaces: false\nConstructorInitializerAllOnOneLineOrOnePerLine: false\nConstructorInitializerIndentWidth: 4\nContinuationIndentWidth: 4\nCpp11BracedListStyle: true\nDerivePointerBinding: false\nFixNamespaceComments: true\nIndentCaseLabels: false\nIndentPPDirectives: AfterHash\nIndentWidth: 4\nIndentWrappedFunctionNames: false\nKeepEmptyLinesAtTheStartOfBlocks: false\nLanguage: Cpp\nMaxEmptyLinesToKeep: 1\nNamespaceIndentation: Inner\nPenaltyBreakBeforeFirstCallParameter: 0\nPenaltyBreakComment: 0\nPenaltyBreakFirstLessLess: 0\nPenaltyBreakString: 1\nPenaltyExcessCharacter: 10\nPenaltyReturnTypeOnItsOwnLine: 20\nPointerAlignment: Left\nSortIncludes: true\nSortUsingDeclarations: true\nSpaceAfterTemplateKeyword: true\nSpaceBeforeAssignmentOperators: true\nSpaceBeforeParens: ControlStatements\nSpaceInEmptyParentheses: false\nSpacesBeforeTrailingComments: 1\nSpacesInAngles: false\nSpacesInCStyleCastParentheses: false\nSpacesInContainerLiterals: false\nSpacesInParentheses: false\nSpacesInSquareBrackets: false\nStandard: C++11\nTabWidth: 4\nUseTab: Never\n"
  },
  {
    "path": ".gitignore",
    "content": "# System\n.DS_Store\n\n# IDE/Editor\n.ccls-cache\n.vscode\n\n# Build\nbuild\nbuild-clang\nbuild-gcc\nbuild-release\n.cache\n\n# Statically generated documentation\nsite/\n"
  },
  {
    "path": "CMakeLists.txt",
    "content": "cmake_minimum_required(VERSION 3.16)\n\nproject(nn_in_a_weekend LANGUAGES CXX)\n\nset(CMAKE_EXPORT_COMPILE_COMMANDS ON)\n\nadd_subdirectory(src)"
  },
  {
    "path": "README.md",
    "content": "# C++ Neural Network in a Weekend\n\nThis repository is the companion code to the article \"Neural Network in a Weekend.\" Readers are welcome to clone the repository and use the code herein as a reference if following along the article. Pull requests and issues filed for errors and bugs in both code and/or documentation are welcome and appreciated. However, pull requests that introduce new features are unlikely to be considered, as the ultimate goal of this code is to be tractable for a newer practitioner getting started with deep learning architectures.\n\n[Article pdf link](https://github.com/jeremyong/cpp_nn_in_a_weekend/raw/master/doc/DOC.pdf)\n\n## Compilation and Usage\n\n    mkdir build\n    cd build\n    # substitute Ninja for your preferred generator\n    cmake .. -G Ninja\n    ninja\n    # trains the network and writes the learned parameters to disk\n    ./src/nn train ../data/train\n    # evaluate the model loss and accuracy based on the trained parameters\n    ./src/nn evaluate ../data/test ./ff.params\n\nNote that the actual location of the `nn` executable may depend on your build system and build type. For performance reasons, it recommended to run the training itself with an optimized build, reverting to a development/debug build only when debugging is needed.\n\n## Conventions\n\n1.  Member variables have a single underscore suffix (e.g. `member_variable_`)\n2.  The `F.T.R.` acroynym stands for \"For the reader\" and precedes suggestions for experimentation, improvements, or alternative implementations\n3.  Throughout, you may see the type aliases `num_t` and `rne_t`. These aliases refer to `float` and `std::mt199837` respectively and are defined in `Model.hpp` to easily experiment with alternative precisions and random number engines. The reader may wish to make these parameters changeable by other means.\n\n## General Code Structure\n\nThe neural network is modeled as a computational graph. The graph itself is the `Model` defined in `Model.hpp`. Nodes in the computational graph override the `Node` base class and must implement various methods to explain how data flows through the node (forwards and backwards).\n\nThe fully-connected feedforward node in this example is implemented as `FFNode` in `FFNode.hpp`. The cross-entropy loss node is implemented in `CELossNode.hpp`. Together, these two nodes are all that is needed to train our example on the MNIST dataset.\n\n## Data\n\nFor your convenience, the MNIST data used to train and test the network is provided uncompressed in the `data/` subdirectory. The data is structured like so:\n\n### Images\n\nImage data can be parsed using code provided in the `MNIST.hpp` header, but the data is described here as well. Multi-byte integers are stored with the MSB first, meaning that on a little-endian architecture, the bytes must be flipped. Image pixel data is stored in row-major order and packed contiguously one after another.\n\n     Bytes\n    [00-03] 0x00000803 (Magic Number: 2051)\n    [04-07] image count\n    [08-11] rows\n    [12-15] columns\n    [16]    pixel[0, 0]\n    [17]    pixel[0, 1]\n    ...\n\n### Labels\n\nLabel data is parsed according to the following byte layout:\n\n     Bytes\n    [00-03] 0x00000801 (Magic Number: 2049)\n    [04-07] label count\n    [8]     label 1\n    [9]     label 2\n    ...\n\nThe parser provided by the `MNIST` input node validates the magic numbers to ensure the machine endianness is as expected, and also validates that the image data and label data sizes match.\n"
  },
  {
    "path": "doc/DOC.html",
    "content": "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"\" xml:lang=\"\">\n<head>\n  <meta charset=\"utf-8\" />\n  <meta name=\"generator\" content=\"pandoc\" />\n  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, user-scalable=yes\" />\n  <meta name=\"author\" content=\"Jeremy Ong\" />\n  <title>C++ Neural Network in a Weekend</title>\n  <style>\n    code{white-space: pre-wrap;}\n    span.smallcaps{font-variant: small-caps;}\n    span.underline{text-decoration: underline;}\n    div.column{display: inline-block; vertical-align: top; width: 50%;}\n    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}\n    ul.task-list{list-style: none;}\n    pre > code.sourceCode { white-space: pre; position: relative; }\n    pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }\n    pre > code.sourceCode > span:empty { height: 1.2em; }\n    code.sourceCode > span { color: inherit; text-decoration: inherit; }\n    div.sourceCode { margin: 1em 0; }\n    pre.sourceCode { margin: 0; }\n    @media screen {\n    div.sourceCode { overflow: auto; }\n    }\n    @media print {\n    pre > code.sourceCode { white-space: pre-wrap; }\n    pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }\n    }\n    pre.numberSource code\n      { counter-reset: source-line 0; }\n    pre.numberSource code > span\n      { position: relative; left: -4em; counter-increment: source-line; }\n    pre.numberSource code > span > a:first-child::before\n      { content: counter(source-line);\n        position: relative; left: -1em; text-align: right; vertical-align: baseline;\n        border: none; display: inline-block;\n        -webkit-touch-callout: none; -webkit-user-select: none;\n        -khtml-user-select: none; -moz-user-select: none;\n        -ms-user-select: none; user-select: none;\n        padding: 0 4px; width: 4em;\n        color: #aaaaaa;\n      }\n    pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }\n    div.sourceCode\n      {   }\n    @media screen {\n    pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }\n    }\n    code span.al { color: #ff0000; font-weight: bold; } /* Alert */\n    code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */\n    code span.at { color: #7d9029; } /* Attribute */\n    code span.bn { color: #40a070; } /* BaseN */\n    code span.bu { } /* BuiltIn */\n    code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */\n    code span.ch { color: #4070a0; } /* Char */\n    code span.cn { color: #880000; } /* Constant */\n    code span.co { color: #60a0b0; font-style: italic; } /* Comment */\n    code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */\n    code span.do { color: #ba2121; font-style: italic; } /* Documentation */\n    code span.dt { color: #902000; } /* DataType */\n    code span.dv { color: #40a070; } /* DecVal */\n    code span.er { color: #ff0000; font-weight: bold; } /* Error */\n    code span.ex { } /* Extension */\n    code span.fl { color: #40a070; } /* Float */\n    code span.fu { color: #06287e; } /* Function */\n    code span.im { } /* Import */\n    code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */\n    code span.kw { color: #007020; font-weight: bold; } /* Keyword */\n    code span.op { color: #666666; } /* Operator */\n    code span.ot { color: #007020; } /* Other */\n    code span.pp { color: #bc7a00; } /* Preprocessor */\n    code span.sc { color: #4070a0; } /* SpecialChar */\n    code span.ss { color: #bb6688; } /* SpecialString */\n    code span.st { color: #4070a0; } /* String */\n    code span.va { color: #19177c; } /* Variable */\n    code span.vs { color: #4070a0; } /* VerbatimString */\n    code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */\n  </style>\n  <script src=\"https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.11.1/katex.min.js\"></script>\n  <script>document.addEventListener(\"DOMContentLoaded\", function () {\n   var mathElements = document.getElementsByClassName(\"math\");\n   var macros = [];\n   for (var i = 0; i < mathElements.length; i++) {\n    var texText = mathElements[i].firstChild;\n    if (mathElements[i].tagName == \"SPAN\") {\n     katex.render(texText.data, mathElements[i], {\n      displayMode: mathElements[i].classList.contains('display'),\n      throwOnError: false,\n      macros: macros,\n      fleqn: false\n     });\n  }}});\n  </script>\n  <link rel=\"stylesheet\" href=\"https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.11.1/katex.min.css\" />\n  <!--[if lt IE 9]>\n    <script src=\"//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js\"></script>\n  <![endif]-->\n  \n</head>\n<body>\n<header id=\"title-block-header\">\n<h1 class=\"title\">C++ Neural Network in a Weekend</h1>\n<p class=\"author\">Jeremy Ong</p>\n</header>\n<h2 id=\"introduction\">Introduction</h2>\n<p>Would you like to write a neural network from start to finish? Are you perhaps shaky on some of the fundamental concepts and derivations, such as categorical cross-entropy loss or backpropagation? Alternatively, would you like an introduction to machine learning without relying on “magical” frameworks that seem to perform AI miracles with only a few lines of code (and just as little intuition)? If so, this article was written for you.</p>\n<p>Deep learning as a technology and discipline has been booming. Nearly every facet of deep learning is teeming with progress and healthy competition to achieve state of the art performance and efficiency. It’s no surprise that resources tend to emphasize the “latest and greatest” in feats such as object recognition, natural language parsing, “deep fakes”, and more. In contrast, fewer resources expand as much on the practical <em>engineering</em> aspects of deep learning. That is, how should a deep learning framework be structured? How do you go about rolling your own infrastructure instead of relying on Keras, Pytorch, Tensorflow, or any of the other dominant frameworks? Whether you wish to write your own for learning purposes, or if you need to deploy a neural network on a constrained (i.e. embedded) device, there is plenty to be gained from authoring a neural network from scratch.</p>\n<p>The neural network outlined here is hosted on <a href=\"https://github.com/jeremyong/cpp_nn_in_a_weekend\">github</a> and has enough abstractions to vaguely resemble a production network, without being overly engineered as to be indigestible in a sitting or two. The training and test data provided is the venerable <a href=\"http://yann.lecun.com/exdb/mnist/\">MNIST</a> dataset of handwritten digits. While more exotic (and original) datasets exist, MNIST is chosen here because its sheer ubiquity guarantees you can find corresponding literature to help drive further experimentation, or troubleshoot when things go wrong.</p>\n<h2 id=\"background\">Background</h2>\n<p>This section serves as a moderately high-level description of the major mathematical underpinnings of neural networks and may be safely skipped by those who prefer to jump straight to the code.</p>\n<p>Suppose we have a task we would like a machine learning model to complete (e.g. recognizing handwritten digits). At a high level, we need to perform the following tasks:</p>\n<ol type=\"1\">\n<li>First, we must conceptualize the task as a “function” such that the inputs and outputs of the task can be described in a concrete mathematical sense (amenable for programmability).</li>\n<li>Second, we need a way to quantify the degree to which our model is performing poorly against a known set of correct answers. This is typically denoted as the <em>loss</em> or <em>objective</em> function of the model.</li>\n<li>Third, we need an <em>optimization strategy</em> which will describe how to adjust the model after feedback is provided regarding the model’s performance as per the loss function described above.</li>\n<li>Fourth, we need a <em>regularization strategy</em> to address inadvertently tuning the model with a high degree of specificity to our training data, at the cost of generalized performance when handling inputs not yet encountered.</li>\n<li>Fifth, we need an <em>architecture</em> for our model, including how inputs are transformed into outputs and an enumaration of all the adjustable parameters the model supports.</li>\n<li>Finally, we need a robust <em>implementation</em> that executes the above within memory and execution budgets, accounting for floating-point stability, reproducibility, and a number of other engineering-related matters.</li>\n</ol>\n<p><em>Deep learning</em> is distinct from other machine learning models in that the architecture is heavily over-parameterized and based on simpler <em>building blocks</em> as opposed to bespoke components. The building blocks used are neurons, or particular arrangements of neurons, typically organized as layers. Over the course of training a deep learning model, it is expected that <em>features</em> of the inputs are learned and manifested as various parameter values in these neurons. This is in contrast to traditional machine learning, where features are not learned, but implemented directly.</p>\n<h3 id=\"categorical-cross-entropy-loss\">Categorical Cross-Entropy Loss</h3>\n<p>More concretely, the task at hand is to train a model to recognize a 28 by 28 pixel handwritten greyscale digit. For simplicity, our model will interpret the data as a flattened 784-dimensional vector. Instead of describing the architecture of the model first, we’ll start with understanding what the model should output and how to assess the model’s performance. The output of our model will be a 10-dimensional vector, representing the probability distribution of the supplied input. That is, each element of the output vector indicates the model’s estimation of the probability that the digit’s value matches the corresponding element index. For example, if the model outputs:</p>\n<p><span class=\"math display\">M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]</span></p>\n<p>for some input image <span class=\"math inline\">\\mathbf{I}</span>, we interpret this to mean that the model believes there is an equal chance of the examined digit to be a 2 or a 3.</p>\n<p>Next, we should consider how to quantify the model’s loss. Suppose, for example, that the image <span class=\"math inline\">\\mathbf{I}</span> actually corresponded to the digit “7” (our model made a horrible prediction!), how might we penalize the model? In this case, we know that the <em>actual</em> probability distribution is the following:</p>\n<p><span class=\"math display\">\\left[0, 0, 0, 0, 0, 0, 0, 1, 0, 0\\right]</span></p>\n<p>This is known as a “one-hot” encoded vector, but it may be helpful to think of it as a probability distribution given a set of events that are mutually exclusive (a digit cannot be both a “7” <em>and</em> a “3” for instance).</p>\n<p>Fortunately, information theory provides us some guidance on defining an easy-to-compute loss function which quantifies the dissimilarities between two probability distributions. If the probability of of an event <span class=\"math inline\">E</span> is given as <span class=\"math inline\">P(E)</span>, then the <em>entropy</em> of this event is given as <span class=\"math inline\">-\\log P(E)</span>. The negation ensures that this is a positive quantity, and by inspection, the entropy increases as an event becomes less likely. Conversely, in the limit as <span class=\"math inline\">P(E)</span> approaches <span class=\"math inline\">1</span>, the entropy shrinks to <span class=\"math inline\">0</span>. While several interpretations of entropy are possible, the pertinent interpretation here is that entropy is a <em>measure of the information conveyed when a particular event occurs</em>. That the “sun rose this morning” is a fairly mundane observation but being told “the sun exploded” is sure to pique your attention. Because we are reasonably certain that the sun rises each morning (with near 100% confidence), that “the sun rises” is an event that conveys little additional information when it occurs.</p>\n<p>Let’s consider next entropy in the context of a probability distribution. Given a discrete random variable <span class=\"math inline\">X</span> which can take on values <span class=\"math inline\">x_0, \\dots, x_{n-1}</span> with probabilities <span class=\"math inline\">p(x_0), \\dots, p(x_{n-1})</span>, the entropy of the random variable <span class=\"math inline\">X</span> is defined as:</p>\n<p><span class=\"math display\">H(X) = -\\sum_{x \\in X} p(x) \\log p(x)</span></p>\n<p>For example, suppose <span class=\"math inline\">W</span> is a binary random variable that represents today’s weather which can either be “sunny” or “rainy” (a binary random variable). The entropy <span class=\"math inline\">H(W)</span> can be given as:</p>\n<p><span class=\"math display\">H(W) = -S\\log S - (1 - S) \\log (1 - S)</span></p>\n<p>where <span class=\"math inline\">S</span> is the probability of a sunny day, and hence <span class=\"math inline\">1 - S</span> is the probability of a rainy day. As a binary random variable, the summation over weighted entropies expands to only two terms. What does this quantity mean? If we were to describe it in words, each term of the sum in the entropy calculation corresponds to the information of a particular event, weighted by the probability of the event. Thus, the entropy of the distribution is literally the <em>expected amount of information contained in an event</em> for a given distribution. If we plot <span class=\"math inline\">-S\\log S - (1 - S) \\log(1 - S)</span> as a function of <span class=\"math inline\">S</span>, we will see something like this:</p>\n<figure>\n<img src=\"plots\\6094492350593652429.png\" class=\"matplotlib\" />\n</figure>\n<p>As a minor note, while <span class=\"math inline\">\\log 0</span> is an undefined quantity, information theorists accept that <span class=\"math inline\">\\lim_{p\\rightarrow 0} p\\log p = 0</span> by convention. Intuitively, the expected entropy should be unaffected by the set of impossible events.</p>\n<p>As you might expect, when the distribution is 50-50, the uncertainty of a binary is maximal, and by extension the amount of information contained in each event is maximized too. Put another way, if you lived in an area where it was always sunny, you wouldn’t <em>learn anything</em> if someone told you it was sunny today. However, in a tropical region characterized by capricious weather, information conveyed about the weather is far more meaningful.</p>\n<p>In the previous example, we weighted the event entropies according to the event’s probability distribution. What would happen if, instead, we used weights corresponding to a <em>different</em> probability distribution? This is known as the <em>cross entropy</em>:</p>\n<p><span class=\"math display\">H(p, q) = -\\sum_{x \\in X} p(x)\\log q(x)</span></p>\n<p>To get some intuition about this, first, we note that if <span class=\"math inline\">p(x) = q(x), \\forall x\\in X</span>, the cross entropy trivially matches the self-entropy. Let’s go back to our binary entropy example and visualize what it looks like if we chose a completely <em>incorrect</em> distribution. Specifically, suppose we computed the cross entropy where if the probability of a sunny day is <span class=\"math inline\">S</span>, we weight the entropy with <span class=\"math inline\">1 - S</span> instead of <span class=\"math inline\">S</span> as in the self-entropy formula.</p>\n<figure>\n<img src=\"plots\\-6767785830879840565.png\" class=\"matplotlib\" />\n</figure>\n<p>If you compare the values with the previous figure, you’ll see that the cross entropy diverges from the self-entropy everywhere except <span class=\"math inline\">0.5</span>, where <span class=\"math inline\">S = 1 - S</span>. The difference between the cross entropy <span class=\"math inline\">H(p, q)</span> and entropy <span class=\"math inline\">H(p)</span> provides then, a <em>measure of error</em> between the presumed distribution <span class=\"math inline\">q</span> and the true distribution <span class=\"math inline\">p</span>. This difference is also known as the <a href=\"https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence\">Kullback-Leibler divergence</a> or KL divergence for short.</p>\n<p>Given that the entropy of a given probability distribution <span class=\"math inline\">p</span> is constant, then <span class=\"math inline\">H(p)</span> must be constant as well. This is why in practice, we will generally seek to minimize the cross entropy between <span class=\"math inline\">p</span> and a predicted distribution <span class=\"math inline\">q</span>, which by extension will minimize the Kullback-Leibler divergence as well.</p>\n<p>Now, we have the tools to know if our model is succeeding or not! Given an estimation of a sample’s label as before:</p>\n<p><span class=\"math display\">M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]</span></p>\n<p>we will treat our model’s output as a predicted probability distribution of the sample digit’s classification from 0 to 9. Then, we compute the cross entropy between this predction and the true distribution, which will be in the form of a one-hot vector. Supposing the actual digit is 3 in this particular case (<span class=\"math inline\">P(7) = 1</span>):</p>\n<p><span class=\"math display\"> \\sum_{x\\in \\{0,\\dots, 9\\}} -P(x) \\log Q(x) = -P(3) \\log(Q(3)) = \\log(0.5) \\approx 0.301 </span></p>\n<p>Let’s make a few observations before continuing. First, for a one-hot vector, the entropy is 0 (can you see why?). Second, by pretending the correct digit above is <span class=\"math inline\">3</span> and not, say, <span class=\"math inline\">7</span>, we conveniently avoided <span class=\"math inline\">\\log 0</span> showing up in the final expression. A common method to avoid this is to add a small <span class=\"math inline\">\\epsilon</span> to the log argument to avoid this singularity, but we’ll discuss this in more detail later.</p>\n<h3 id=\"creating-our-approximation-function-with-a-neural-network\">Creating our Approximation Function with a Neural Network</h3>\n<p>Now that we know how to evaluate our model, we’ll need to decide how to go about making predictions in the form of a probability distribution. Our model will need to take as inputs, 28x28 images (which as mentioned before, will be flattened to 784x1 vectors for simplicity). Let’s enumerate the properties our model will need:</p>\n<ol type=\"1\">\n<li>Parameterization - our model will need parameters we can adjust to “fit” the model to the data</li>\n<li>Nonlinearity - it is assuredly not the case that the probability distribution can be modeled with a set of linear equations</li>\n<li>Differentiability - the gradient of our model’s output with respect to any given parameter indicates the <em>impact</em> of that parameter on the final result</li>\n</ol>\n<p>There are an infinite number of functions that fit this criteria, but here, we’ll use a simple feedforward network with a single hidden layer.</p>\n<p><img src=\"C:\\Users\\jeremy\\Developer\\nn_in_a_weekend\\doc/219994768ec294ab99637a7747627aeebd998d41.svg\" /></p>\n<p>A few quick notes regarding notation: a superscript of the form <span class=\"math inline\">[i]</span> is used to denote the <span class=\"math inline\">i</span>th layer. A subscript is used to denote a particular element within a layer or vector. The vector <span class=\"math inline\">\\mathbf{x}</span> is usually reserved for training samples, and the vector <span class=\"math inline\">\\mathbf{y}</span> is typically reserved for sample labels (i.e. the desired “answer” for a given sample). The vector <span class=\"math inline\">\\hat{\\mathbf{y}}</span> is used to denote a model’s predicted labels for a given input.</p>\n<p>On the far left, we have the input layer with <span class=\"math inline\">784</span> nodes corresponding to each of the 28 by 28 pixels in an individual sample. Each <span class=\"math inline\">x_i^{(0)}</span> is a floating point value between 0 and 1 inclusive. Because the data is encoded with 8 bits of precision, there are 256 possible values for each input. Each of the 784 input values fan out to each of the nodes in the hidden layer without modification.</p>\n<p>In the center hidden layer, we have a variable number of nodes that each receive all 784 inputs, perform some processing, and fan out the result to the output nodes on the far right. That is, each node in the hidden layer transforms a <span class=\"math inline\">\\mathbb{R}^{784}</span> vector into a scalar output, so as a whole, the <span class=\"math inline\">n</span> nodes collectively need to map <span class=\"math inline\">\\mathbb{R}^\\rightarrow \\mathbb{R}^n</span>. The simplest way to do this is with an <span class=\"math inline\">n\\times 784</span> matrix (treating inputs as column vectors). Modeling the hidden layer this way, each of the <span class=\"math inline\">n</span> nodes in the hidden layer is associated with a single row in our <span class=\"math inline\">\\mathbb{R}^{n\\times 784}</span> matrix. Each entry of this matrix is referred to as a <em>weight</em>.</p>\n<p>We still have two issues we need to address however. First, a matrix provides a linear mapping between two spaces, and linear maps take <span class=\"math inline\">0</span> to <span class=\"math inline\">0</span> (you can visualize such maps as planes through the origin). Thus, such fully-connected layers typically add a <em>bias</em> to each output node to turn the map into an affine map. This enables the model to respond zeroes in the input. Thus, the hidden layer as a whole has now both a weight matrix, and also a bias vector. A linear mapping with a constant bias is commonly referred to as an <em>affine map</em>.</p>\n<p>The second issue is that our hidden layer’s now-affine mapping still scales linearly with the input, and one of our requirements for our approximation function was nonlinearity (a strict prerequisite for universality). Thus, we perform one final non-linear operation the result of the affine map. This is known as the <em>activation function</em>, and an infinite number of choices present itself here. In practice, the <em>rectifier function</em>, defined below, is a perennial choice.</p>\n<p><span class=\"math display\">f(x) = \\max(0, x)</span></p>\n<figure>\n<img src=\"plots\\-1637788021081228918.png\" class=\"matplotlib\" />\n</figure>\n<p>The rectifier is popular for having a number of desirable properties.</p>\n<ol type=\"1\">\n<li>Easy to compute</li>\n<li>Easy to differentiate (except at 0, which has not been found to be a problem in practice)</li>\n<li>Sparse activation, which aids in addressing model overfitting and “unlearning” useful weights</li>\n</ol>\n<p>As our hidden layer units will use this rectifier just before emitting its final output to the next layer, our hidden units may be called <em>rectified linear units</em> or ReLUs for short.</p>\n<p>Summarizing our hidden layer, the output of each unit in the layer can be written as:</p>\n<p><span class=\"math display\">a_i^{[1]} = \\max(0, W_{i}^{[1]} \\cdot \\mathbf{x}^{[0]} + b_i^{[1]})</span></p>\n<p>It’s common to refer to the final activated output of a neural network layer as the vector <span class=\"math inline\">\\mathbf{a}</span>, and the result of the internal affine map <span class=\"math inline\">\\mathbf{z}</span>. Using this notation and considering the output of the hidden layer as a whole as a vector quantity, we can write:</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\mathbf{z}^{[1]} &amp;= \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]} \\\\\n\\mathbf{a}^{[1]} &amp;= \\max(\\mathbf{0}, \\mathbf{z}^{[1]}) \\\\\n\\mathbf{a}^{[1]}, \\mathbf{b}^{[1]} &amp;\\in \\mathbb{R}^n \\\\\n\\mathbf{W}^{[1]} &amp;\\in \\mathbb{R}^{n\\times 784} \\\\\n\\mathbf{x}^{[0]} &amp;\\in \\mathbb{R}^{784}\n\\end{aligned}\n</span></p>\n<p>The last layer to consider is the output layer. As with the hidden layer, we need a dimensionality transform, in this case, taking vectors in <span class=\"math inline\">\\mathbb{R}^n</span> and mapping them to vectors in <span class=\"math inline\">\\mathbb{R}^{10}</span> (corresponding to the 10 possible digits in the target output). As before, we will use an affine map with the appropriately sized weight matrix and bias vector. Here, however, the rectifier isn’t suitable as an activation function because we want to emit a probability distribution. To be a valid probability distribution, each output of the hidden layer must be in the range <span class=\"math inline\">[0, 1]</span>, and the sum of all outputs must equal <span class=\"math inline\">1</span>. The most common activation function used to achieve this is the <em>softmax function</em>:</p>\n<p><span class=\"math display\">\\mathrm{softmax}(\\mathbf{z})_i = \\frac{\\exp(z_i)}{\\sum_j \\exp(z_j)}</span></p>\n<p>Given a vector input <span class=\"math inline\">z</span>, each component of the softmax output (as a vector quantity) is given as per the expression above. The exponential functions conveniently map negative numbers to positive numbers, and the denominator ensures that all outputs will be between 0 and 1, and sum to 1 as desired. There are other reasons why an exponential function is used here, stemming from our choice of a loss function (based on the underpinning notion of maximum-likelihood estimation), but we won’t get into that in too much detail here (consult the further reading section at the end to learn more). Suffice it to say that an additional benefit of the exponential function is its clean interaction with the logarithm used in our choice of loss function, especially when we will need to compute gradients in the next section.</p>\n<p>Summarizing our neural network architecture, with two weight matrices and two bias vectors, we can construct two affine maps which map vectors in <span class=\"math inline\">\\mathbb{R}^{784}</span> to <span class=\"math inline\">\\mathbb{R}^n</span> to <span class=\"math inline\">\\mathbb{R}^{10}</span>. Prior to forwarding the results of one affine map as the input of the next, we employ an activation function to add non-linearity to the model. First, we use a linear rectifier and second, we use a softmax function, ensuring that we end up with a nice discrete probability distribution with 10 possible events corresponding to the 10 digits.</p>\n<p>Our network is small enough that we can actually write out the entire process as a single function using the notation we’ve built so far:</p>\n<p><span class=\"math display\">f(\\mathbf{x}^{[0]}) = \\mathbf{y}^{[2]} = \\mathrm{softmax}\\left(\\mathbf{W}^{[2]}\\left(\\max\\left(\\mathbf{0}, \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]}\\right) \\right) + \\mathbf{b}^{[1]} \\right)</span></p>\n<h3 id=\"optimizing-our-network\">Optimizing our network</h3>\n<p>We now have a model given above which can turn our 784 dimensional inputs into a 10-element probability distribution, <em>and</em> we have a way to evaluate how accuracy of each prediction. Next, we need a reliable way to improve the model based on the feedback provided by our loss function. This is known as function <em>optimization</em>, and most methods of model optimization are based on the principle of <em>gradient descent</em>.</p>\n<p>The idea is quite simple. Given a function with a set of parameters which we’ll denote <span class=\"math inline\">\\bm{\\theta}</span>, the partial derivative of that function with respect to a given parameter <span class=\"math inline\">\\theta_i \\in \\bm{\\theta}</span> tells us the overall <em>impact</em> of <span class=\"math inline\">\\theta_i</span> on the final result. In our model, we have many parameters; each weight and bias constitutes an individually tunable parameter. Thus, our strategy should be, given a set of input samples, compute the loss our model produces for each sample. Then, compute the partial derivatives of that loss with respect to <em>every parameter</em> in our model. Finally, adjust each parameter in proportion to its impact on the final loss. Mathematically, this process is described below (note that the superscript <span class=\"math inline\">(i)</span> is used to denote the <span class=\"math inline\">i</span>-th sample):</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\mathrm{Total~Loss} &amp;= \\sum_i J(\\mathbf{x}^{(i)}; \\bm\\theta) \\\\\n\\mathrm{Compute}~ &amp;\\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in \\bm\\theta \\\\\n\\mathrm{Adjust}~ &amp; \\theta_j \\rightarrow \\theta_j - \\eta \\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in\\bm\\theta\\\\\n\\end{aligned}\n</span></p>\n<p>Here, there is some flexibility in the choice of <span class=\"math inline\">\\eta</span>, often referred to as the <em>learning rate</em>. A small <span class=\"math inline\">\\eta</span> promotes more conservative and accurate steps, but at the cost of our model being more costly to update. A large <span class=\"math inline\">\\eta</span> on the other hand results in larger updates to our model per training cycle, but may result in instability. Updating in the above fashion should adjust the model such that it will produce a smaller loss given the same inputs.</p>\n<p>In practice, the size of the input set may be very large, rendering it intractable to evaluate the model on every single training sample in the sum above before adjusting parameters. Thus, a common strategy is to use <em>stochastic gradient descent</em> (abbrev. SGD) and perform loss-gradient-based adjustments after evaluating smaller batches of samples. Concretely, the MNIST handwritten digits database contains 60,000 training samples. If we were to train our model using gradient descent in the strictest sense, we would execute the following pseudocode:</p>\n<pre><code>model.init()\n\nfor i in num_training_cycles\n    loss &lt;- 0\n\n    for n in 60000\n        x &lt;- MNIST.data[n]\n        y &lt;- model.predict(x)\n        loss += loss(y, MNIST.labels[n])\n    \n    model.gradient_descent(loss)</code></pre>\n<p>In contrast, SGD pseudocode would look like:</p>\n<pre><code>model.init()\n\nfor i in num_batches\n    loss &lt;- 0\n    for j in batch_size\n        x &lt;- MNIST.data[n]\n        y &lt;- model.predict(x)\n        loss += loss(y, MNIST.labels[n])\n    \n    model.gradient_descent(loss)</code></pre>\n<p>SGD is very similar, but the batch size can be much smaller than the amount of training data available. This enables the model to get more frequent updates and waste fewer cycles especially at the start of training when the model is likely wildly inaccurate.</p>\n<p>When it comes time to compute the gradients, we are fortunate to have made the prescient choice of constructing our model solely with elementary functions in a manner conducive to relatively painless differentiation. However, we still must exercise care as there is plenty of bookkeeping involved. We will evaluate loss-gradients with respect to individual parameters when we walkthrough the implementation later, but for now, let’s establish a few preliminary results.</p>\n<p>Recall that our choice of loss function was the categorical cross entropy function, reproduced below:</p>\n<p><span class=\"math display\">J_{CE}(\\mathbf{\\hat{y}}, \\mathbf{y}) = -\\sum_{i} y_i \\log{\\hat{y}_i}</span></p>\n<p>The index <span class=\"math inline\">i</span> is enumerated over the set of possible outcomes (i.e. the set of digits from 0 to 9). The quantities <span class=\"math inline\">y_i</span> are the elements of the one-hot label corresponding to the correct outcome, and <span class=\"math inline\">\\hat{\\mathbf{y}}</span> is the discrete probability distribution emitted by our model. We compute <span class=\"math inline\">\\partial J_{CE}/\\partial \\hat{y}_i</span> like so:</p>\n<p><span class=\"math display\">\\frac{\\partial J_{CE}}{\\partial \\hat{y}_i} = -\\frac{y_i}{\\hat{y}_i}</span></p>\n<p>Notice that for a one-hot vector, this partial derivative vanishes whenever <span class=\"math inline\">i</span> corresponds to an incorrect outcome.</p>\n<p>Working backwards in our model, we next provide the partial derivative of the softmax function:</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\mathrm{softmax}(\\mathbf{z})_i &amp;= \\frac{\\exp{z_i}}{\\sum_j \\exp{z_j}} \\\\\n\\frac{\\partial \\left(\\mathrm{softmax}(\\mathbf{z})_i\\right)}{\\partial z_k} &amp;=\n\\begin{dcases}\n\\frac{\\left(\\sum_j\\exp{z_j}\\right)\\exp{z_i} - \\exp{2z_i}}{\\left(\\sum_j\\exp{z_j}\\right)^2}&amp; i = k \\\\\n\\frac{-\\exp{z_i}\\exp{z_k}}{\\left(\\sum_j\\exp{z_j}\\right)^2}&amp; i \\neq k\n\\end{dcases} \\\\\n&amp;= \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) &amp; i = k \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_k &amp; i \\neq k\n\\end{cases}\n\\end{aligned}\n</span></p>\n<p>The last set of equations follow from factorizing and rearranging the expressions preceding it. It’s often confusing to newer practitioners that the partial derivative of softmax needs this unique treatment. The key observation is that softmax is a vector-function. It accepts a vector as an input and emits a vector as an output. It also “mixes” the input components, thereby imposing a functional dependence of <em>every output component</em> on <em>every input component</em>. The lone <span class=\"math inline\">\\exp{z_i}</span> in the numerator of the softmax equation creates an asymmetric dependence of the output component on the input components.</p>\n<p>Finally, let’s consider the partial derivative of the linear rectifier.</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\mathrm{ReLU}(z) &amp;= \\max(0, z) \\\\\n\\frac{\\partial \\mathrm{ReLU}(z)}{\\partial z} &amp;=\n\\begin{cases}\n0 &amp; z &lt; 0 \\\\\n\\mathrm{undefined} &amp; z = 0 \\\\\nz &amp; z &gt; 0\n\\end{cases}\n\\end{aligned}\n</span></p>\n<p>While the partial derivative <em>exactly</em> at 0 is undefined, in practice, the derivative is simply assigned to 0. Why the non-differentiability at 0 isn’t an issue has been a subject of practical debate for a long time. Here is a simple line of thinking to justify the apparent issue. Consider a rectifier function that is nudged <em>ever so slightly</em> to the right such that the inflection point is <span class=\"math inline\">\\epsilon / 2</span>, where <span class=\"math inline\">\\epsilon</span> is the smallest positive floating point number the machine can represent. In this case, the model will never produce a value that sits directly on this inflection point, and as far as the computer is concerned, we never encounter a point where this function is non-differentiable. We can even imagine an infinitesimal curve that smooths out the function at that inflection point if we want. Either way, experimentally, the linear rectifier remains one of the most effective activation functions for reasons mentioned, so we have no reason to discredit it over a technicality.</p>\n<p>Now that we can compute partial derivatives of all the nonlinear functions in our neural network (and presumbly the linear functions as well), we are prepared to compute loss gradients with respect to any parameter in the network. Our tool of choice is the venerable chain rule of calculus:</p>\n<p><span class=\"math display\">\\left.\\frac{\\partial f(g(x))}{\\partial x}\\right\\rvert_x = \\left.\\frac{\\partial f}{\\partial g}\\right\\rvert_{g(x)} \\left.\\frac{\\partial g}{\\partial x}\\right\\rvert_x</span></p>\n<p>This gives us the partial derivative of a composite function <span class=\"math inline\">f\\circ g</span> evaluated at a particular value of <span class=\"math inline\">x</span>. Our model itself is a series of composite functions, and as we can now compute the partials of each individual component in the model, we are ready to begin implementation in the next section.</p>\n<h2 id=\"setting-up\">Setting up</h2>\n<p>Our project will leverage <a href=\"https://cmake.org\">CMake</a> as the meta-build system to support as many operating systems and compilers as possible. A modern C++ compiler will also be needed to compile the code. As of this writing, the code has been tested with GCC 10.1.0 and Clang 10.0.0. You should feel free to simply adapt the code to your compiler and build system of choice. To emphasize the independent nature of this project, <em>no further dependencies are needed</em>. At your discretion, you may opt to use external testing frameworks, matrix and math libraries, data structures, or any other external dependency as you see fit. If you’re a newer C++ practitioner, you are welcome to model the structure of the final project hosted on Github <a href=\"https://github.com/jeremyong/nn_in_a_weekend\">here</a>.</p>\n<p>In addition, you will need the data hosted on the MNIST database website linked <a href=\"http://yann.lecun.com/exdb/mnist/\">here</a>. The four files available there consist of training images, training labels, test images, and test labels.</p>\n<p>It is highly recommended that you attempt to clone the repository and get things running (instructions on the README will always be kept up to date). The code presented in this article will not be completely exhaustive, but will touch on all the major points, eschewing only various rudimentary helpers functions or uninteresting details for brevity. Alternatively, a valid approach may be to simply follow along the implementation notes below and attempt to blaze your own trail. Both branches are viable approaches for learning.</p>\n<h2 id=\"implementation\">Implementation</h2>\n<h3 id=\"the-computational-graph\">The Computational Graph</h3>\n<p>The network we will be constructing is purely sequential. Inputs flow from left to right and the only connections made are between one layer and the layer immediately succeeding it. In reality, many production-grade neural networks specialized for computer vision, natural language processing, and other domains rely on architectures that are non-sequential. Examples include ResNet, which introduces connections between layers that are not adjacent, and various recurrent neural networks, which have a cyclic topology (outputs of the model are fed back as inputs to the model). Thus, it’s useful to think of the model as a whole as <em>computational graph</em>. While we won’t be employing any complicated computational graph topologies here, we will still structure the code with this notion in mind. Each layer of our network will be modeled as a <code>Node</code> with data flowing forwards and backwards through the node during training. Providing support for a fully general computational graph (i.e. non-sequential) is outside the scope of this tutorial, but some scaffolding will be provided should you want to extend it yourself in the future. For now, here is the interface we’ll use:</p>\n<div class=\"sourceCode\" id=\"cb3\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb3-1\"><a href=\"#cb3-1\" aria-hidden=\"true\"></a><span class=\"pp\">#include </span><span class=\"im\">&lt;cstdint&gt;</span></span>\n<span id=\"cb3-2\"><a href=\"#cb3-2\" aria-hidden=\"true\"></a><span class=\"pp\">#include </span><span class=\"im\">&lt;string&gt;</span></span>\n<span id=\"cb3-3\"><a href=\"#cb3-3\" aria-hidden=\"true\"></a><span class=\"pp\">#include </span><span class=\"im\">&lt;vector&gt;</span></span>\n<span id=\"cb3-4\"><a href=\"#cb3-4\" aria-hidden=\"true\"></a></span>\n<span id=\"cb3-5\"><a href=\"#cb3-5\" aria-hidden=\"true\"></a><span class=\"kw\">using</span> <span class=\"dt\">num_t</span> = <span class=\"dt\">float</span>;</span>\n<span id=\"cb3-6\"><a href=\"#cb3-6\" aria-hidden=\"true\"></a><span class=\"kw\">using</span> <span class=\"dt\">rne_t</span> = <span class=\"bu\">std::</span>mt19937;</span>\n<span id=\"cb3-7\"><a href=\"#cb3-7\" aria-hidden=\"true\"></a></span>\n<span id=\"cb3-8\"><a href=\"#cb3-8\" aria-hidden=\"true\"></a><span class=\"co\">// To be defined later. This class encapsulates all the nodes in our graph </span></span>\n<span id=\"cb3-9\"><a href=\"#cb3-9\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> Model;</span>\n<span id=\"cb3-10\"><a href=\"#cb3-10\" aria-hidden=\"true\"></a></span>\n<span id=\"cb3-11\"><a href=\"#cb3-11\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> Node</span>\n<span id=\"cb3-12\"><a href=\"#cb3-12\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb3-13\"><a href=\"#cb3-13\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb3-14\"><a href=\"#cb3-14\" aria-hidden=\"true\"></a>    Node(Model&amp; model, <span class=\"bu\">std::</span>string name);</span>\n<span id=\"cb3-15\"><a href=\"#cb3-15\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-16\"><a href=\"#cb3-16\" aria-hidden=\"true\"></a>    <span class=\"co\">// Nodes must describe how they should be initialized</span></span>\n<span id=\"cb3-17\"><a href=\"#cb3-17\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">void</span> init(<span class=\"dt\">rne_t</span>&amp; rne) = <span class=\"dv\">0</span>;</span>\n<span id=\"cb3-18\"><a href=\"#cb3-18\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-19\"><a href=\"#cb3-19\" aria-hidden=\"true\"></a>    <span class=\"co\">// During forward propagation, nodes transform input data and feed results</span></span>\n<span id=\"cb3-20\"><a href=\"#cb3-20\" aria-hidden=\"true\"></a>    <span class=\"co\">// to all subsequent nodes</span></span>\n<span id=\"cb3-21\"><a href=\"#cb3-21\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">void</span> forward(<span class=\"dt\">num_t</span>* inputs) = <span class=\"dv\">0</span>;</span>\n<span id=\"cb3-22\"><a href=\"#cb3-22\" aria-hidden=\"true\"></a></span>\n<span id=\"cb3-23\"><a href=\"#cb3-23\" aria-hidden=\"true\"></a>    <span class=\"co\">// During reverse propagation, nodes receive loss gradients to its previous</span></span>\n<span id=\"cb3-24\"><a href=\"#cb3-24\" aria-hidden=\"true\"></a>    <span class=\"co\">// outputs and compute gradients with respect to each tunable parameter</span></span>\n<span id=\"cb3-25\"><a href=\"#cb3-25\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">void</span> reverse(<span class=\"dt\">num_t</span>* gradients) = <span class=\"dv\">0</span>;</span>\n<span id=\"cb3-26\"><a href=\"#cb3-26\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-27\"><a href=\"#cb3-27\" aria-hidden=\"true\"></a>    <span class=\"co\">// If the node has tunable parameters, this method should be overridden</span></span>\n<span id=\"cb3-28\"><a href=\"#cb3-28\" aria-hidden=\"true\"></a>    <span class=\"co\">// to reflect the quantity of tunable parameters</span></span>\n<span id=\"cb3-29\"><a href=\"#cb3-29\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">size_t</span> param_count() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span> { <span class=\"cf\">return</span> <span class=\"dv\">0</span>; }</span>\n<span id=\"cb3-30\"><a href=\"#cb3-30\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-31\"><a href=\"#cb3-31\" aria-hidden=\"true\"></a>    <span class=\"co\">// Accessor for parameter by index</span></span>\n<span id=\"cb3-32\"><a href=\"#cb3-32\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">num_t</span>* param(<span class=\"dt\">size_t</span> index) { <span class=\"cf\">return</span> <span class=\"kw\">nullptr</span>; }</span>\n<span id=\"cb3-33\"><a href=\"#cb3-33\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-34\"><a href=\"#cb3-34\" aria-hidden=\"true\"></a>    <span class=\"co\">// Access for loss-gradient with respect to a parameter specified by index</span></span>\n<span id=\"cb3-35\"><a href=\"#cb3-35\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">num_t</span>* gradient(<span class=\"dt\">size_t</span> index) { <span class=\"cf\">return</span> <span class=\"kw\">nullptr</span>; }</span>\n<span id=\"cb3-36\"><a href=\"#cb3-36\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-37\"><a href=\"#cb3-37\" aria-hidden=\"true\"></a>    <span class=\"co\">// Human-readable name for debugging purposes</span></span>\n<span id=\"cb3-38\"><a href=\"#cb3-38\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>string <span class=\"at\">const</span>&amp; name() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span> { <span class=\"cf\">return</span> <span class=\"va\">name_</span>; }</span>\n<span id=\"cb3-39\"><a href=\"#cb3-39\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-40\"><a href=\"#cb3-40\" aria-hidden=\"true\"></a>    <span class=\"co\">// Information dump for debugging purposes</span></span>\n<span id=\"cb3-41\"><a href=\"#cb3-41\" aria-hidden=\"true\"></a>    <span class=\"kw\">virtual</span> <span class=\"dt\">void</span> print() <span class=\"at\">const</span> = <span class=\"dv\">0</span>;</span>\n<span id=\"cb3-42\"><a href=\"#cb3-42\" aria-hidden=\"true\"></a></span>\n<span id=\"cb3-43\"><a href=\"#cb3-43\" aria-hidden=\"true\"></a><span class=\"kw\">protected</span>:</span>\n<span id=\"cb3-44\"><a href=\"#cb3-44\" aria-hidden=\"true\"></a>    <span class=\"kw\">friend</span> <span class=\"kw\">class</span> Model;</span>\n<span id=\"cb3-45\"><a href=\"#cb3-45\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb3-46\"><a href=\"#cb3-46\" aria-hidden=\"true\"></a>    Model&amp; <span class=\"va\">model_</span>;</span>\n<span id=\"cb3-47\"><a href=\"#cb3-47\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>string <span class=\"va\">name_</span>;</span>\n<span id=\"cb3-48\"><a href=\"#cb3-48\" aria-hidden=\"true\"></a>    <span class=\"co\">// Nodes that precede this node in the computational graph</span></span>\n<span id=\"cb3-49\"><a href=\"#cb3-49\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;Node*&gt; <span class=\"va\">antecedents_</span>;</span>\n<span id=\"cb3-50\"><a href=\"#cb3-50\" aria-hidden=\"true\"></a>    <span class=\"co\">// Nodes that succeed this node in the computational graph</span></span>\n<span id=\"cb3-51\"><a href=\"#cb3-51\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;Node*&gt; <span class=\"va\">subsequents_</span>;</span>\n<span id=\"cb3-52\"><a href=\"#cb3-52\" aria-hidden=\"true\"></a>};</span></code></pre></div>\n<p>The bulwark of the implementation will consist of implementing this interface for all the nodes in our network. We will need to implement this interface for each of the nodes shown in the diagram below.</p>\n<p><img src=\"C:\\Users\\jeremy\\Developer\\nn_in_a_weekend\\doc/cb34710dce878a77b3fd5f3e7e4746403aaaefcc.svg\" /></p>\n<p>The first node (<code>MNIST</code>) will be responsible for acquiring new training samples and feeding it to the next layer for processing. In addition, it will provide an accessor that the final categorical cross-entropy loss node will use to query the correct label for that sample (the “label query”). The hidden node will perform the affine transform and apply the linear rectification activation. The output node will also perform an affine transform, but will then apply the softmax function. Finally, the loss node will compute the loss of the predicted distribution based on the queried label for a given sample.</p>\n<p>In the figure above, solid arrows from left to right indicate data flow during the <em>feedforward</em> or <em>evaluation</em> portion of the model’s execution. Each solid arrow corresponds to a data vector emitted by the source, and ingested by the destination. The dashed arrows from right to left indicate data flow during the <em>backpropagation</em> or <em>reverse accumulation</em> portion of the algorithm. These arrows correspond to gradient vectors of the evaluated loss with respect to the outputs passed during the feedforward phase. For example, as seen above, the hidden node is expected to forward data to the output node (<span class=\"math inline\">\\mathbf{a}^{[1]}</span>). Later, after the model prediction has been computed and the loss evaluated, the gradient of the loss with respect to those outputs is expected (<span class=\"math inline\">\\partial J_{CE}/\\partial a^{[1]}_i</span> for each <span class=\"math inline\">a_i^{[1]}</span> in <span class=\"math inline\">\\mathbf{a}^{[1]}</span>).</p>\n<p>When simply evaluating the model (without training), the final loss node will simply be omitted from the graph. In addition, no back-propagation of gradients will occur as the model parameters are ossified during evaluation.</p>\n<p>The model class interface shown below will be used to house all the nodes in the computational graph, and provide various routines that are useful for operating over all constituent nodes as a collection.</p>\n<div class=\"sourceCode\" id=\"cb4\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb4-1\"><a href=\"#cb4-1\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> Model</span>\n<span id=\"cb4-2\"><a href=\"#cb4-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb4-3\"><a href=\"#cb4-3\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb4-4\"><a href=\"#cb4-4\" aria-hidden=\"true\"></a>    Model(<span class=\"bu\">std::</span>string name);</span>\n<span id=\"cb4-5\"><a href=\"#cb4-5\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb4-6\"><a href=\"#cb4-6\" aria-hidden=\"true\"></a>    <span class=\"co\">// Add a node to the model, forwarding arguments to the node&#39;s constructor</span></span>\n<span id=\"cb4-7\"><a href=\"#cb4-7\" aria-hidden=\"true\"></a>    <span class=\"kw\">template</span> &lt;<span class=\"kw\">typename</span> <span class=\"dt\">Node_t</span>, <span class=\"kw\">typename</span>... T&gt;</span>\n<span id=\"cb4-8\"><a href=\"#cb4-8\" aria-hidden=\"true\"></a>    <span class=\"dt\">Node_t</span>&amp; add_node(T&amp;&amp;... args)</span>\n<span id=\"cb4-9\"><a href=\"#cb4-9\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb4-10\"><a href=\"#cb4-10\" aria-hidden=\"true\"></a>        <span class=\"va\">nodes_</span>.emplace_back(</span>\n<span id=\"cb4-11\"><a href=\"#cb4-11\" aria-hidden=\"true\"></a>            <span class=\"bu\">std::</span>make_unique&lt;<span class=\"dt\">Node_t</span>&gt;(*<span class=\"kw\">this</span>, <span class=\"bu\">std::</span>forward&lt;T&gt;(args)...));</span>\n<span id=\"cb4-12\"><a href=\"#cb4-12\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> <span class=\"kw\">reinterpret_cast</span>&lt;<span class=\"dt\">Node_t</span>&amp;&gt;(*<span class=\"va\">nodes_</span>.back());</span>\n<span id=\"cb4-13\"><a href=\"#cb4-13\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb4-14\"><a href=\"#cb4-14\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-15\"><a href=\"#cb4-15\" aria-hidden=\"true\"></a>    <span class=\"co\">// Create a dependency between two constituent nodes</span></span>\n<span id=\"cb4-16\"><a href=\"#cb4-16\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> create_edge(Node&amp; dst, Node&amp; src);</span>\n<span id=\"cb4-17\"><a href=\"#cb4-17\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-18\"><a href=\"#cb4-18\" aria-hidden=\"true\"></a>    <span class=\"co\">// Initialize the parameters of all nodes with the provided seed. If the</span></span>\n<span id=\"cb4-19\"><a href=\"#cb4-19\" aria-hidden=\"true\"></a>    <span class=\"co\">// seed is 0, a new random seed is chosen instead. Returns the seed used.</span></span>\n<span id=\"cb4-20\"><a href=\"#cb4-20\" aria-hidden=\"true\"></a>    <span class=\"dt\">rne_t</span>::<span class=\"dt\">result_type</span> init(<span class=\"dt\">rne_t</span>::<span class=\"dt\">result_type</span> seed = <span class=\"dv\">0</span>);</span>\n<span id=\"cb4-21\"><a href=\"#cb4-21\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-22\"><a href=\"#cb4-22\" aria-hidden=\"true\"></a>    <span class=\"co\">// Adjust all model parameters of constituent nodes using the</span></span>\n<span id=\"cb4-23\"><a href=\"#cb4-23\" aria-hidden=\"true\"></a>    <span class=\"co\">// provided optimizer (shown later)</span></span>\n<span id=\"cb4-24\"><a href=\"#cb4-24\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> train(Optimizer&amp; optimizer);</span>\n<span id=\"cb4-25\"><a href=\"#cb4-25\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-26\"><a href=\"#cb4-26\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>string <span class=\"at\">const</span>&amp; name() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span></span>\n<span id=\"cb4-27\"><a href=\"#cb4-27\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb4-28\"><a href=\"#cb4-28\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> <span class=\"va\">name_</span>;</span>\n<span id=\"cb4-29\"><a href=\"#cb4-29\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb4-30\"><a href=\"#cb4-30\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-31\"><a href=\"#cb4-31\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> print() <span class=\"at\">const</span>;</span>\n<span id=\"cb4-32\"><a href=\"#cb4-32\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-33\"><a href=\"#cb4-33\" aria-hidden=\"true\"></a>    <span class=\"co\">// Routines for saving and loading model parameters to and from disk</span></span>\n<span id=\"cb4-34\"><a href=\"#cb4-34\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> save(<span class=\"bu\">std::</span>ofstream&amp; out);</span>\n<span id=\"cb4-35\"><a href=\"#cb4-35\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> load(<span class=\"bu\">std::</span>ifstream&amp; in);</span>\n<span id=\"cb4-36\"><a href=\"#cb4-36\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-37\"><a href=\"#cb4-37\" aria-hidden=\"true\"></a><span class=\"kw\">private</span>:</span>\n<span id=\"cb4-38\"><a href=\"#cb4-38\" aria-hidden=\"true\"></a>    <span class=\"kw\">friend</span> <span class=\"kw\">class</span> Node;</span>\n<span id=\"cb4-39\"><a href=\"#cb4-39\" aria-hidden=\"true\"></a></span>\n<span id=\"cb4-40\"><a href=\"#cb4-40\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>string <span class=\"va\">name_</span>;</span>\n<span id=\"cb4-41\"><a href=\"#cb4-41\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"bu\">std::</span>unique_ptr&lt;Node&gt;&gt; <span class=\"va\">nodes_</span>;</span>\n<span id=\"cb4-42\"><a href=\"#cb4-42\" aria-hidden=\"true\"></a>};</span></code></pre></div>\n<h3 id=\"training-data-and-labels\">Training Data and Labels</h3>\n<p>All machine learning pipelines must consider how to ingest data and labels. Data refers to the information the model is expected to use to make inferences and predictions. Labels correspond to the “correct answer” for each data sample, used to compute losses and train the model. The interface of the MNIST data parser is shows below as an implemented <code>Node</code> class.</p>\n<div class=\"sourceCode\" id=\"cb5\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb5-1\"><a href=\"#cb5-1\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> MNIST : <span class=\"kw\">public</span> Node</span>\n<span id=\"cb5-2\"><a href=\"#cb5-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb5-3\"><a href=\"#cb5-3\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb5-4\"><a href=\"#cb5-4\" aria-hidden=\"true\"></a>    <span class=\"kw\">constexpr</span> <span class=\"at\">static</span> <span class=\"dt\">size_t</span> DIM = <span class=\"dv\">28</span> * <span class=\"dv\">28</span>;</span>\n<span id=\"cb5-5\"><a href=\"#cb5-5\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-6\"><a href=\"#cb5-6\" aria-hidden=\"true\"></a>    <span class=\"co\">// The constructor receives an input filestream corresponding to the</span></span>\n<span id=\"cb5-7\"><a href=\"#cb5-7\" aria-hidden=\"true\"></a>    <span class=\"co\">// data samples and labels</span></span>\n<span id=\"cb5-8\"><a href=\"#cb5-8\" aria-hidden=\"true\"></a>    MNIST(Model&amp; model, <span class=\"bu\">std::</span>ifstream&amp; images, <span class=\"bu\">std::</span>ifstream&amp; labels);</span>\n<span id=\"cb5-9\"><a href=\"#cb5-9\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-10\"><a href=\"#cb5-10\" aria-hidden=\"true\"></a>    <span class=\"co\">// This is an input node and has no parameters to initialize</span></span>\n<span id=\"cb5-11\"><a href=\"#cb5-11\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> init(<span class=\"dt\">rne_t</span>&amp;) <span class=\"kw\">override</span> {}</span>\n<span id=\"cb5-12\"><a href=\"#cb5-12\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-13\"><a href=\"#cb5-13\" aria-hidden=\"true\"></a>    <span class=\"co\">// Read the next sample and label and forward the data</span></span>\n<span id=\"cb5-14\"><a href=\"#cb5-14\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> forward(<span class=\"dt\">num_t</span>* data = <span class=\"kw\">nullptr</span>) <span class=\"kw\">override</span>;</span>\n<span id=\"cb5-15\"><a href=\"#cb5-15\" aria-hidden=\"true\"></a></span>\n<span id=\"cb5-16\"><a href=\"#cb5-16\" aria-hidden=\"true\"></a>    <span class=\"co\">// No optimization is done in this node so this is a no-op</span></span>\n<span id=\"cb5-17\"><a href=\"#cb5-17\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> reverse(<span class=\"dt\">num_t</span>* gradients = <span class=\"kw\">nullptr</span>) <span class=\"kw\">override</span> {}</span>\n<span id=\"cb5-18\"><a href=\"#cb5-18\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-19\"><a href=\"#cb5-19\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> print() <span class=\"at\">const</span> <span class=\"kw\">override</span>;</span>\n<span id=\"cb5-20\"><a href=\"#cb5-20\" aria-hidden=\"true\"></a></span>\n<span id=\"cb5-21\"><a href=\"#cb5-21\" aria-hidden=\"true\"></a>    <span class=\"co\">// Consume the next sample and label from the file streams</span></span>\n<span id=\"cb5-22\"><a href=\"#cb5-22\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> read_next();</span>\n<span id=\"cb5-23\"><a href=\"#cb5-23\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-24\"><a href=\"#cb5-24\" aria-hidden=\"true\"></a>    <span class=\"co\">// Accessor for the most recently read sample</span></span>\n<span id=\"cb5-25\"><a href=\"#cb5-25\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"at\">const</span>* data() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span></span>\n<span id=\"cb5-26\"><a href=\"#cb5-26\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb5-27\"><a href=\"#cb5-27\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> <span class=\"va\">data_</span>;</span>\n<span id=\"cb5-28\"><a href=\"#cb5-28\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb5-29\"><a href=\"#cb5-29\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-30\"><a href=\"#cb5-30\" aria-hidden=\"true\"></a>    <span class=\"co\">// Accessor for the most recently read label</span></span>\n<span id=\"cb5-31\"><a href=\"#cb5-31\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span>* label() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span></span>\n<span id=\"cb5-32\"><a href=\"#cb5-32\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb5-33\"><a href=\"#cb5-33\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> <span class=\"va\">label_</span>;</span>\n<span id=\"cb5-34\"><a href=\"#cb5-34\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb5-35\"><a href=\"#cb5-35\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-36\"><a href=\"#cb5-36\" aria-hidden=\"true\"></a>    <span class=\"co\">// Quick ASCII visualization of the last digit read</span></span>\n<span id=\"cb5-37\"><a href=\"#cb5-37\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> print_last();</span>\n<span id=\"cb5-38\"><a href=\"#cb5-38\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb5-39\"><a href=\"#cb5-39\" aria-hidden=\"true\"></a><span class=\"kw\">private</span>:</span>\n<span id=\"cb5-40\"><a href=\"#cb5-40\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>ifstream&amp; <span class=\"va\">images_</span>;</span>\n<span id=\"cb5-41\"><a href=\"#cb5-41\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>ifstream&amp; <span class=\"va\">labels_</span>;</span>\n<span id=\"cb5-42\"><a href=\"#cb5-42\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> <span class=\"va\">image_count_</span>;</span>\n<span id=\"cb5-43\"><a href=\"#cb5-43\" aria-hidden=\"true\"></a></span>\n<span id=\"cb5-44\"><a href=\"#cb5-44\" aria-hidden=\"true\"></a>    <span class=\"dt\">char</span> <span class=\"va\">buf_</span>[DIM];</span>\n<span id=\"cb5-45\"><a href=\"#cb5-45\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">data_</span>[DIM];</span>\n<span id=\"cb5-46\"><a href=\"#cb5-46\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">label_</span>[<span class=\"dv\">10</span>];</span>\n<span id=\"cb5-47\"><a href=\"#cb5-47\" aria-hidden=\"true\"></a>};</span></code></pre></div>\n<p>In the constructor, we must verify that the files passed as arguments are valid MNIST data and label files. Both files start with distinct “magic values” as a quick sanity check. The sample file starts with 2051 encoded as a 4-byte big-endian unsigned integer, whereas the label file starts with 2049. For the data file, the magic number is followed by the image count and image dimensions. The label file magic number is followed by the label count (expected to match the image count).</p>\n<p>To consume big-endian unsigned integers from the file stream, we’ll use a simple routine:</p>\n<div class=\"sourceCode\" id=\"cb6\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb6-1\"><a href=\"#cb6-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> read_be(<span class=\"bu\">std::</span>ifstream&amp; in, <span class=\"dt\">uint32_t</span>* out)</span>\n<span id=\"cb6-2\"><a href=\"#cb6-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb6-3\"><a href=\"#cb6-3\" aria-hidden=\"true\"></a>    <span class=\"dt\">char</span>* buf = <span class=\"kw\">reinterpret_cast</span>&lt;<span class=\"dt\">char</span>*&gt;(out);</span>\n<span id=\"cb6-4\"><a href=\"#cb6-4\" aria-hidden=\"true\"></a>    in.read(buf, <span class=\"dv\">4</span>);</span>\n<span id=\"cb6-5\"><a href=\"#cb6-5\" aria-hidden=\"true\"></a></span>\n<span id=\"cb6-6\"><a href=\"#cb6-6\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>swap(buf[<span class=\"dv\">0</span>], buf[<span class=\"dv\">3</span>]);</span>\n<span id=\"cb6-7\"><a href=\"#cb6-7\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>swap(buf[<span class=\"dv\">1</span>], buf[<span class=\"dv\">2</span>]);</span>\n<span id=\"cb6-8\"><a href=\"#cb6-8\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>If you happen to be using a big-endian processor, you will not need to perform the byte swaps, but most desktop and mobile architectures are little-endian.</p>\n<p>The implementation that parses the magic numbers and various other descriptors is produced below:</p>\n<div class=\"sourceCode\" id=\"cb7\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb7-1\"><a href=\"#cb7-1\" aria-hidden=\"true\"></a>MNIST::MNIST(Model&amp; model, <span class=\"bu\">std::</span>ifstream&amp; images, <span class=\"bu\">std::</span>ifstream&amp; labels)</span>\n<span id=\"cb7-2\"><a href=\"#cb7-2\" aria-hidden=\"true\"></a>    : Node{model, <span class=\"st\">&quot;MNIST input&quot;</span>}</span>\n<span id=\"cb7-3\"><a href=\"#cb7-3\" aria-hidden=\"true\"></a>    , <span class=\"va\">images_</span>{images}</span>\n<span id=\"cb7-4\"><a href=\"#cb7-4\" aria-hidden=\"true\"></a>    , <span class=\"va\">labels_</span>{labels}</span>\n<span id=\"cb7-5\"><a href=\"#cb7-5\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb7-6\"><a href=\"#cb7-6\" aria-hidden=\"true\"></a>    <span class=\"co\">// Confirm that passed input file streams are well-formed MNIST data sets</span></span>\n<span id=\"cb7-7\"><a href=\"#cb7-7\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> image_magic;</span>\n<span id=\"cb7-8\"><a href=\"#cb7-8\" aria-hidden=\"true\"></a>    read_be(images, &amp;image_magic);</span>\n<span id=\"cb7-9\"><a href=\"#cb7-9\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (image_magic != <span class=\"dv\">2051</span>)</span>\n<span id=\"cb7-10\"><a href=\"#cb7-10\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb7-11\"><a href=\"#cb7-11\" aria-hidden=\"true\"></a>        <span class=\"cf\">throw</span> <span class=\"bu\">std::</span>runtime_error{<span class=\"st\">&quot;Images file appears to be malformed&quot;</span>};</span>\n<span id=\"cb7-12\"><a href=\"#cb7-12\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb7-13\"><a href=\"#cb7-13\" aria-hidden=\"true\"></a>    read_be(images, &amp;<span class=\"va\">image_count_</span>);</span>\n<span id=\"cb7-14\"><a href=\"#cb7-14\" aria-hidden=\"true\"></a></span>\n<span id=\"cb7-15\"><a href=\"#cb7-15\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> labels_magic;</span>\n<span id=\"cb7-16\"><a href=\"#cb7-16\" aria-hidden=\"true\"></a>    read_be(labels, &amp;labels_magic);</span>\n<span id=\"cb7-17\"><a href=\"#cb7-17\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (labels_magic != <span class=\"dv\">2049</span>)</span>\n<span id=\"cb7-18\"><a href=\"#cb7-18\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb7-19\"><a href=\"#cb7-19\" aria-hidden=\"true\"></a>        <span class=\"cf\">throw</span> <span class=\"bu\">std::</span>runtime_error{<span class=\"st\">&quot;Labels file appears to be malformed&quot;</span>};</span>\n<span id=\"cb7-20\"><a href=\"#cb7-20\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb7-21\"><a href=\"#cb7-21\" aria-hidden=\"true\"></a></span>\n<span id=\"cb7-22\"><a href=\"#cb7-22\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> label_count;</span>\n<span id=\"cb7-23\"><a href=\"#cb7-23\" aria-hidden=\"true\"></a>    read_be(labels, &amp;label_count);</span>\n<span id=\"cb7-24\"><a href=\"#cb7-24\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (label_count != <span class=\"va\">image_count_</span>)</span>\n<span id=\"cb7-25\"><a href=\"#cb7-25\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb7-26\"><a href=\"#cb7-26\" aria-hidden=\"true\"></a>        <span class=\"cf\">throw</span> <span class=\"bu\">std::</span>runtime_error(</span>\n<span id=\"cb7-27\"><a href=\"#cb7-27\" aria-hidden=\"true\"></a>            <span class=\"st\">&quot;Label count did not match the number of images supplied&quot;</span>);</span>\n<span id=\"cb7-28\"><a href=\"#cb7-28\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb7-29\"><a href=\"#cb7-29\" aria-hidden=\"true\"></a></span>\n<span id=\"cb7-30\"><a href=\"#cb7-30\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> rows;</span>\n<span id=\"cb7-31\"><a href=\"#cb7-31\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint32_t</span> columns;</span>\n<span id=\"cb7-32\"><a href=\"#cb7-32\" aria-hidden=\"true\"></a>    read_be(images, &amp;rows);</span>\n<span id=\"cb7-33\"><a href=\"#cb7-33\" aria-hidden=\"true\"></a>    read_be(images, &amp;columns);</span>\n<span id=\"cb7-34\"><a href=\"#cb7-34\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (rows != <span class=\"dv\">28</span> || columns != <span class=\"dv\">28</span>)</span>\n<span id=\"cb7-35\"><a href=\"#cb7-35\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb7-36\"><a href=\"#cb7-36\" aria-hidden=\"true\"></a>        <span class=\"cf\">throw</span> <span class=\"bu\">std::</span>runtime_error{</span>\n<span id=\"cb7-37\"><a href=\"#cb7-37\" aria-hidden=\"true\"></a>            <span class=\"st\">&quot;Expected 28x28 images, non-MNIST data supplied&quot;</span>};</span>\n<span id=\"cb7-38\"><a href=\"#cb7-38\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb7-39\"><a href=\"#cb7-39\" aria-hidden=\"true\"></a></span>\n<span id=\"cb7-40\"><a href=\"#cb7-40\" aria-hidden=\"true\"></a>    printf(<span class=\"st\">&quot;Loaded images file with </span><span class=\"sc\">%d</span><span class=\"st\"> entries</span><span class=\"sc\">\\n</span><span class=\"st\">&quot;</span>, <span class=\"va\">image_count_</span>);</span>\n<span id=\"cb7-41\"><a href=\"#cb7-41\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>Next, let’s implement the <code>MNIST::read_next</code>, which will consume the next sample and label from the file streams:</p>\n<div class=\"sourceCode\" id=\"cb8\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb8-1\"><a href=\"#cb8-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> MNIST::read_next()</span>\n<span id=\"cb8-2\"><a href=\"#cb8-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb8-3\"><a href=\"#cb8-3\" aria-hidden=\"true\"></a>    <span class=\"va\">images_</span>.read(<span class=\"va\">buf_</span>, DIM);</span>\n<span id=\"cb8-4\"><a href=\"#cb8-4\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> inv = <span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>} / <span class=\"dt\">num_t</span>{<span class=\"fl\">255.0</span>};</span>\n<span id=\"cb8-5\"><a href=\"#cb8-5\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != DIM; ++i)</span>\n<span id=\"cb8-6\"><a href=\"#cb8-6\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb8-7\"><a href=\"#cb8-7\" aria-hidden=\"true\"></a>        <span class=\"va\">data_</span>[i] = <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">uint8_t</span>&gt;(<span class=\"va\">buf_</span>[i]) * inv;</span>\n<span id=\"cb8-8\"><a href=\"#cb8-8\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb8-9\"><a href=\"#cb8-9\" aria-hidden=\"true\"></a></span>\n<span id=\"cb8-10\"><a href=\"#cb8-10\" aria-hidden=\"true\"></a>    <span class=\"dt\">char</span> label;</span>\n<span id=\"cb8-11\"><a href=\"#cb8-11\" aria-hidden=\"true\"></a>    <span class=\"va\">labels_</span>.read(&amp;label, <span class=\"dv\">1</span>);</span>\n<span id=\"cb8-12\"><a href=\"#cb8-12\" aria-hidden=\"true\"></a></span>\n<span id=\"cb8-13\"><a href=\"#cb8-13\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"dv\">10</span>; ++i)</span>\n<span id=\"cb8-14\"><a href=\"#cb8-14\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb8-15\"><a href=\"#cb8-15\" aria-hidden=\"true\"></a>        <span class=\"va\">label_</span>[i] = <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb8-16\"><a href=\"#cb8-16\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb8-17\"><a href=\"#cb8-17\" aria-hidden=\"true\"></a>    <span class=\"va\">label_</span>[<span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">uint8_t</span>&gt;(label)] = <span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>};</span>\n<span id=\"cb8-18\"><a href=\"#cb8-18\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>For the labels, note that the label is encoded as a single unsigned digit, but we convert it to a 1-hot encoding for loss computation purposes later. If your application can assume that the labels will be one-hot encoded, this conversion may not be necessary and a more efficient implementation is possible.</p>\n<p>To verify our work, let’s write up a quick-and-dirty ASCII printer for the last read digit and try our parser out. If you have a rendering backend (written in say, Vulkan, D3D12, OpenGL, etc.) at your disposal, you may wish to use that instead for a cleaner visualization.</p>\n<div class=\"sourceCode\" id=\"cb9\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb9-1\"><a href=\"#cb9-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> MNIST::print_last()</span>\n<span id=\"cb9-2\"><a href=\"#cb9-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb9-3\"><a href=\"#cb9-3\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"dv\">10</span>; ++i)</span>\n<span id=\"cb9-4\"><a href=\"#cb9-4\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb9-5\"><a href=\"#cb9-5\" aria-hidden=\"true\"></a>        <span class=\"cf\">if</span> (<span class=\"va\">label_</span>[i] == <span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>})</span>\n<span id=\"cb9-6\"><a href=\"#cb9-6\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb9-7\"><a href=\"#cb9-7\" aria-hidden=\"true\"></a>            printf(<span class=\"st\">&quot;This is a </span><span class=\"sc\">%zu</span><span class=\"st\">:</span><span class=\"sc\">\\n</span><span class=\"st\">&quot;</span>, i);</span>\n<span id=\"cb9-8\"><a href=\"#cb9-8\" aria-hidden=\"true\"></a>            <span class=\"cf\">break</span>;</span>\n<span id=\"cb9-9\"><a href=\"#cb9-9\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb9-10\"><a href=\"#cb9-10\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb9-11\"><a href=\"#cb9-11\" aria-hidden=\"true\"></a></span>\n<span id=\"cb9-12\"><a href=\"#cb9-12\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"dv\">28</span>; ++i)</span>\n<span id=\"cb9-13\"><a href=\"#cb9-13\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb9-14\"><a href=\"#cb9-14\" aria-hidden=\"true\"></a>        <span class=\"dt\">size_t</span> offset = i * <span class=\"dv\">28</span>;</span>\n<span id=\"cb9-15\"><a href=\"#cb9-15\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"dv\">28</span>; ++j)</span>\n<span id=\"cb9-16\"><a href=\"#cb9-16\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb9-17\"><a href=\"#cb9-17\" aria-hidden=\"true\"></a>            <span class=\"cf\">if</span> (<span class=\"va\">data_</span>[offset + j] &gt; <span class=\"dt\">num_t</span>{<span class=\"fl\">0.5</span>})</span>\n<span id=\"cb9-18\"><a href=\"#cb9-18\" aria-hidden=\"true\"></a>            {</span>\n<span id=\"cb9-19\"><a href=\"#cb9-19\" aria-hidden=\"true\"></a>                <span class=\"cf\">if</span> (<span class=\"va\">data_</span>[offset + j] &gt; <span class=\"dt\">num_t</span>{<span class=\"fl\">0.9</span>})</span>\n<span id=\"cb9-20\"><a href=\"#cb9-20\" aria-hidden=\"true\"></a>                {</span>\n<span id=\"cb9-21\"><a href=\"#cb9-21\" aria-hidden=\"true\"></a>                    printf(<span class=\"st\">&quot;#&quot;</span>);</span>\n<span id=\"cb9-22\"><a href=\"#cb9-22\" aria-hidden=\"true\"></a>                }</span>\n<span id=\"cb9-23\"><a href=\"#cb9-23\" aria-hidden=\"true\"></a>                <span class=\"cf\">else</span> <span class=\"cf\">if</span> (<span class=\"va\">data_</span>[offset + j] &gt; <span class=\"dt\">num_t</span>{<span class=\"fl\">0.7</span>})</span>\n<span id=\"cb9-24\"><a href=\"#cb9-24\" aria-hidden=\"true\"></a>                {</span>\n<span id=\"cb9-25\"><a href=\"#cb9-25\" aria-hidden=\"true\"></a>                    printf(<span class=\"st\">&quot;*&quot;</span>);</span>\n<span id=\"cb9-26\"><a href=\"#cb9-26\" aria-hidden=\"true\"></a>                }</span>\n<span id=\"cb9-27\"><a href=\"#cb9-27\" aria-hidden=\"true\"></a>                <span class=\"cf\">else</span></span>\n<span id=\"cb9-28\"><a href=\"#cb9-28\" aria-hidden=\"true\"></a>                {</span>\n<span id=\"cb9-29\"><a href=\"#cb9-29\" aria-hidden=\"true\"></a>                    printf(<span class=\"st\">&quot;.&quot;</span>);</span>\n<span id=\"cb9-30\"><a href=\"#cb9-30\" aria-hidden=\"true\"></a>                }</span>\n<span id=\"cb9-31\"><a href=\"#cb9-31\" aria-hidden=\"true\"></a>            }</span>\n<span id=\"cb9-32\"><a href=\"#cb9-32\" aria-hidden=\"true\"></a>            <span class=\"cf\">else</span></span>\n<span id=\"cb9-33\"><a href=\"#cb9-33\" aria-hidden=\"true\"></a>            {</span>\n<span id=\"cb9-34\"><a href=\"#cb9-34\" aria-hidden=\"true\"></a>                printf(<span class=\"st\">&quot; &quot;</span>);</span>\n<span id=\"cb9-35\"><a href=\"#cb9-35\" aria-hidden=\"true\"></a>            }</span>\n<span id=\"cb9-36\"><a href=\"#cb9-36\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb9-37\"><a href=\"#cb9-37\" aria-hidden=\"true\"></a>        printf(<span class=\"st\">&quot;</span><span class=\"sc\">\\n</span><span class=\"st\">&quot;</span>);</span>\n<span id=\"cb9-38\"><a href=\"#cb9-38\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb9-39\"><a href=\"#cb9-39\" aria-hidden=\"true\"></a>    printf(<span class=\"st\">&quot;</span><span class=\"sc\">\\n</span><span class=\"st\">&quot;</span>);</span>\n<span id=\"cb9-40\"><a href=\"#cb9-40\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>On my machine, consuming the evaluation data and printing it produces the following result (the first sample from the test data is shown):</p>\n<pre><code>This is a 7:\n                            \n       *..                  \n      *#####********.       \n          .*#*####*##.      \n                   ##       \n                   #*       \n                  ##        \n                 .##        \n                 ##         \n                .#*         \n                *#          \n                #*          \n               ##           \n              *#.           \n             *#*            \n             ##             \n            *#              \n           .##              \n           ###              \n           ##*              \n           #*</code></pre>\n<p>so we can be somewhat confident that our MNIST data ingestor is working properly. The only remaining routine we need to implement is <code>MNIST::forward</code> which should consume the next sample, and forward the data to all subsequent nodes in the graph.</p>\n<div class=\"sourceCode\" id=\"cb11\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb11-1\"><a href=\"#cb11-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> MNIST::forward(<span class=\"dt\">num_t</span>* data)</span>\n<span id=\"cb11-2\"><a href=\"#cb11-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb11-3\"><a href=\"#cb11-3\" aria-hidden=\"true\"></a>    read_next();</span>\n<span id=\"cb11-4\"><a href=\"#cb11-4\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (Node* node : <span class=\"va\">subsequents_</span>)</span>\n<span id=\"cb11-5\"><a href=\"#cb11-5\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb11-6\"><a href=\"#cb11-6\" aria-hidden=\"true\"></a>        node-&gt;forward(<span class=\"va\">data_</span>);</span>\n<span id=\"cb11-7\"><a href=\"#cb11-7\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb11-8\"><a href=\"#cb11-8\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>Such an interface ensures our <code>MNIST</code> node will be interoperable with networks that aren’t purely sequential.</p>\n<h3 id=\"the-feedforward-node\">The Feedforward Node</h3>\n<p>The hidden and output nodes have much in common and so will be implemented in terms of a single feedforward node class. The feedforward node will need a configurable activation function and dimensionality. Here’s the interface for the <code>FFNode</code>:</p>\n<div class=\"sourceCode\" id=\"cb12\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb12-1\"><a href=\"#cb12-1\" aria-hidden=\"true\"></a><span class=\"kw\">enum</span> <span class=\"kw\">class</span> Activation</span>\n<span id=\"cb12-2\"><a href=\"#cb12-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb12-3\"><a href=\"#cb12-3\" aria-hidden=\"true\"></a>    ReLU,</span>\n<span id=\"cb12-4\"><a href=\"#cb12-4\" aria-hidden=\"true\"></a>    Softmax</span>\n<span id=\"cb12-5\"><a href=\"#cb12-5\" aria-hidden=\"true\"></a>};</span>\n<span id=\"cb12-6\"><a href=\"#cb12-6\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-7\"><a href=\"#cb12-7\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> FFNode : <span class=\"kw\">public</span> Node</span>\n<span id=\"cb12-8\"><a href=\"#cb12-8\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb12-9\"><a href=\"#cb12-9\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb12-10\"><a href=\"#cb12-10\" aria-hidden=\"true\"></a>    <span class=\"co\">// A feedforward node is defined by the activation</span></span>\n<span id=\"cb12-11\"><a href=\"#cb12-11\" aria-hidden=\"true\"></a>    <span class=\"co\">// function and input/output dimensionality</span></span>\n<span id=\"cb12-12\"><a href=\"#cb12-12\" aria-hidden=\"true\"></a>    FFNode(Model&amp; model,</span>\n<span id=\"cb12-13\"><a href=\"#cb12-13\" aria-hidden=\"true\"></a>           <span class=\"bu\">std::</span>string name,</span>\n<span id=\"cb12-14\"><a href=\"#cb12-14\" aria-hidden=\"true\"></a>           Activation activation,</span>\n<span id=\"cb12-15\"><a href=\"#cb12-15\" aria-hidden=\"true\"></a>           <span class=\"dt\">uint16_t</span> output_size,</span>\n<span id=\"cb12-16\"><a href=\"#cb12-16\" aria-hidden=\"true\"></a>           <span class=\"dt\">uint16_t</span> input_size);</span>\n<span id=\"cb12-17\"><a href=\"#cb12-17\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-18\"><a href=\"#cb12-18\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> init(<span class=\"dt\">rne_t</span>&amp; rne) <span class=\"kw\">override</span>;</span>\n<span id=\"cb12-19\"><a href=\"#cb12-19\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-20\"><a href=\"#cb12-20\" aria-hidden=\"true\"></a>    <span class=\"co\">// The input data should have size input_size_</span></span>\n<span id=\"cb12-21\"><a href=\"#cb12-21\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> forward(<span class=\"dt\">num_t</span>* inputs) <span class=\"kw\">override</span>;</span>\n<span id=\"cb12-22\"><a href=\"#cb12-22\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-23\"><a href=\"#cb12-23\" aria-hidden=\"true\"></a>    <span class=\"co\">// The gradient data should have size output_size_</span></span>\n<span id=\"cb12-24\"><a href=\"#cb12-24\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> reverse(<span class=\"dt\">num_t</span>* gradients) <span class=\"kw\">override</span>;</span>\n<span id=\"cb12-25\"><a href=\"#cb12-25\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-26\"><a href=\"#cb12-26\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> param_count() <span class=\"at\">const</span> <span class=\"kw\">noexcept</span> <span class=\"kw\">override</span></span>\n<span id=\"cb12-27\"><a href=\"#cb12-27\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb12-28\"><a href=\"#cb12-28\" aria-hidden=\"true\"></a>        <span class=\"co\">// Weight matrix entries + bias entries</span></span>\n<span id=\"cb12-29\"><a href=\"#cb12-29\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> (<span class=\"va\">input_size_</span> + <span class=\"dv\">1</span>) * <span class=\"va\">output_size_</span>;</span>\n<span id=\"cb12-30\"><a href=\"#cb12-30\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb12-31\"><a href=\"#cb12-31\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-32\"><a href=\"#cb12-32\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span>* param(<span class=\"dt\">size_t</span> index);</span>\n<span id=\"cb12-33\"><a href=\"#cb12-33\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span>* gradient(<span class=\"dt\">size_t</span> index);</span>\n<span id=\"cb12-34\"><a href=\"#cb12-34\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-35\"><a href=\"#cb12-35\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> print() <span class=\"at\">const</span> <span class=\"kw\">override</span>;</span>\n<span id=\"cb12-36\"><a href=\"#cb12-36\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-37\"><a href=\"#cb12-37\" aria-hidden=\"true\"></a><span class=\"kw\">private</span>:</span>\n<span id=\"cb12-38\"><a href=\"#cb12-38\" aria-hidden=\"true\"></a>    Activation <span class=\"va\">activation_</span>;</span>\n<span id=\"cb12-39\"><a href=\"#cb12-39\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint16_t</span> <span class=\"va\">output_size_</span>;</span>\n<span id=\"cb12-40\"><a href=\"#cb12-40\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint16_t</span> <span class=\"va\">input_size_</span>;</span>\n<span id=\"cb12-41\"><a href=\"#cb12-41\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-42\"><a href=\"#cb12-42\" aria-hidden=\"true\"></a>    <span class=\"co\">/////////////////////</span></span>\n<span id=\"cb12-43\"><a href=\"#cb12-43\" aria-hidden=\"true\"></a>    <span class=\"co\">// Node Parameters //</span></span>\n<span id=\"cb12-44\"><a href=\"#cb12-44\" aria-hidden=\"true\"></a>    <span class=\"co\">/////////////////////</span></span>\n<span id=\"cb12-45\"><a href=\"#cb12-45\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-46\"><a href=\"#cb12-46\" aria-hidden=\"true\"></a>    <span class=\"co\">// weights_.size() := output_size_ * input_size_</span></span>\n<span id=\"cb12-47\"><a href=\"#cb12-47\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">weights_</span>;</span>\n<span id=\"cb12-48\"><a href=\"#cb12-48\" aria-hidden=\"true\"></a>    <span class=\"co\">// biases_.size() := output_size_</span></span>\n<span id=\"cb12-49\"><a href=\"#cb12-49\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">biases_</span>;</span>\n<span id=\"cb12-50\"><a href=\"#cb12-50\" aria-hidden=\"true\"></a>    <span class=\"co\">// activations_.size() := output_size_</span></span>\n<span id=\"cb12-51\"><a href=\"#cb12-51\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">activations_</span>;</span>\n<span id=\"cb12-52\"><a href=\"#cb12-52\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-53\"><a href=\"#cb12-53\" aria-hidden=\"true\"></a>    <span class=\"co\">////////////////////</span></span>\n<span id=\"cb12-54\"><a href=\"#cb12-54\" aria-hidden=\"true\"></a>    <span class=\"co\">// Loss Gradients //</span></span>\n<span id=\"cb12-55\"><a href=\"#cb12-55\" aria-hidden=\"true\"></a>    <span class=\"co\">////////////////////</span></span>\n<span id=\"cb12-56\"><a href=\"#cb12-56\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-57\"><a href=\"#cb12-57\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">activation_gradients_</span>;</span>\n<span id=\"cb12-58\"><a href=\"#cb12-58\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-59\"><a href=\"#cb12-59\" aria-hidden=\"true\"></a>    <span class=\"co\">// During the training cycle, parameter loss gradients are accumulated in</span></span>\n<span id=\"cb12-60\"><a href=\"#cb12-60\" aria-hidden=\"true\"></a>    <span class=\"co\">// the following buffers.</span></span>\n<span id=\"cb12-61\"><a href=\"#cb12-61\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">weight_gradients_</span>;</span>\n<span id=\"cb12-62\"><a href=\"#cb12-62\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">bias_gradients_</span>;</span>\n<span id=\"cb12-63\"><a href=\"#cb12-63\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-64\"><a href=\"#cb12-64\" aria-hidden=\"true\"></a>    <span class=\"co\">// This buffer is used to store temporary gradients used in a SINGLE</span></span>\n<span id=\"cb12-65\"><a href=\"#cb12-65\" aria-hidden=\"true\"></a>    <span class=\"co\">// backpropagation pass. Note that this does not accumulate like the weight</span></span>\n<span id=\"cb12-66\"><a href=\"#cb12-66\" aria-hidden=\"true\"></a>    <span class=\"co\">// and bias gradients do.</span></span>\n<span id=\"cb12-67\"><a href=\"#cb12-67\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">input_gradients_</span>;</span>\n<span id=\"cb12-68\"><a href=\"#cb12-68\" aria-hidden=\"true\"></a></span>\n<span id=\"cb12-69\"><a href=\"#cb12-69\" aria-hidden=\"true\"></a>    <span class=\"co\">// The last input is needed to compute loss gradients with respect to the</span></span>\n<span id=\"cb12-70\"><a href=\"#cb12-70\" aria-hidden=\"true\"></a>    <span class=\"co\">// weights during backpropagation</span></span>\n<span id=\"cb12-71\"><a href=\"#cb12-71\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span>* <span class=\"va\">last_input_</span>;</span>\n<span id=\"cb12-72\"><a href=\"#cb12-72\" aria-hidden=\"true\"></a>};</span></code></pre></div>\n<p>Compared to the <code>MNIST</code> node, the <code>FFNode</code> uses a lot more state to track all tunable parameters (weight matrix elements and biases), as well as the loss gradients corresponding to each parameter. The loss gradients must be kept because, remember, utilizing them to actually adjust the parameters is performed only after <code>N</code> samples have been evaluated, where <code>N</code> is the chosen batch size in our stochastic gradient descent algorithm. If the purpose of some of the class members here is still opaque, they will show up later when implement backpropagation.</p>\n<p>First, we must decide how to initialize the weights and biases of our node. When deciding on a scheme, there are a few key principles to keep in mind. First, the initialization must exhibit symmetry of any sort. For example, if all the parameters are initialized to the same random value, the loss gradients with respect to all individual parameters will be identical, and our network will be no better than a network with a single parameter. In addition, we do not want the parameters to be initialized such that they are too large, or too small. Most papers that discuss weight initialization strive to ensure that the loss gradients remain in a realm where floating point number retain precision (in the range <span class=\"math inline\">[1, 2)</span>). The other criteria is that parameters should generally be initialized such that they are roughly similar in magnitude. Parameters that deviate too far from the mean are likely to either dominate loss gradients, or produce too small a signal to contribute. Proper parameter initialization is but a small part of addressing the larger problem common in neural networks known as the problem of <em>exploding and vanishing gradients</em>. Here, we present the implementation with a couple references if you wish to dig deeper.</p>\n<div class=\"sourceCode\" id=\"cb13\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb13-1\"><a href=\"#cb13-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> FFNode::init(<span class=\"dt\">rne_t</span>&amp; rne)</span>\n<span id=\"cb13-2\"><a href=\"#cb13-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb13-3\"><a href=\"#cb13-3\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> sigma;</span>\n<span id=\"cb13-4\"><a href=\"#cb13-4\" aria-hidden=\"true\"></a>    <span class=\"cf\">switch</span> (<span class=\"va\">activation_</span>)</span>\n<span id=\"cb13-5\"><a href=\"#cb13-5\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb13-6\"><a href=\"#cb13-6\" aria-hidden=\"true\"></a>    <span class=\"cf\">case</span> Activation::ReLU:</span>\n<span id=\"cb13-7\"><a href=\"#cb13-7\" aria-hidden=\"true\"></a>        <span class=\"co\">// Kaiming He, et. al. weight initialization for ReLU networks</span></span>\n<span id=\"cb13-8\"><a href=\"#cb13-8\" aria-hidden=\"true\"></a>        <span class=\"co\">// https://arxiv.org/pdf/1502.01852.pdf</span></span>\n<span id=\"cb13-9\"><a href=\"#cb13-9\" aria-hidden=\"true\"></a>        <span class=\"co\">//</span></span>\n<span id=\"cb13-10\"><a href=\"#cb13-10\" aria-hidden=\"true\"></a>        <span class=\"co\">// Suggests using a normal distribution with variance := 2 / n_in</span></span>\n<span id=\"cb13-11\"><a href=\"#cb13-11\" aria-hidden=\"true\"></a>        sigma = <span class=\"bu\">std::</span>sqrt(<span class=\"fl\">2.0</span> / <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">num_t</span>&gt;(<span class=\"va\">input_size_</span>));</span>\n<span id=\"cb13-12\"><a href=\"#cb13-12\" aria-hidden=\"true\"></a>        <span class=\"cf\">break</span>;</span>\n<span id=\"cb13-13\"><a href=\"#cb13-13\" aria-hidden=\"true\"></a>    <span class=\"cf\">case</span> Activation::Softmax:</span>\n<span id=\"cb13-14\"><a href=\"#cb13-14\" aria-hidden=\"true\"></a>    <span class=\"cf\">default</span>:</span>\n<span id=\"cb13-15\"><a href=\"#cb13-15\" aria-hidden=\"true\"></a>        <span class=\"co\">// LeCun initialization as suggested in &quot;Self-Normalizing Neural</span></span>\n<span id=\"cb13-16\"><a href=\"#cb13-16\" aria-hidden=\"true\"></a>        <span class=\"co\">// Networks&quot;</span></span>\n<span id=\"cb13-17\"><a href=\"#cb13-17\" aria-hidden=\"true\"></a>        <span class=\"co\">// https://arxiv.org/pdf/1706.02515.pdf</span></span>\n<span id=\"cb13-18\"><a href=\"#cb13-18\" aria-hidden=\"true\"></a>        sigma = <span class=\"bu\">std::</span>sqrt(<span class=\"fl\">1.0</span> / <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">num_t</span>&gt;(<span class=\"va\">input_size_</span>));</span>\n<span id=\"cb13-19\"><a href=\"#cb13-19\" aria-hidden=\"true\"></a>        <span class=\"cf\">break</span>;</span>\n<span id=\"cb13-20\"><a href=\"#cb13-20\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb13-21\"><a href=\"#cb13-21\" aria-hidden=\"true\"></a></span>\n<span id=\"cb13-22\"><a href=\"#cb13-22\" aria-hidden=\"true\"></a>    <span class=\"co\">// </span><span class=\"al\">NOTE</span><span class=\"co\">: Unfortunately, the C++ standard does not guarantee that the results</span></span>\n<span id=\"cb13-23\"><a href=\"#cb13-23\" aria-hidden=\"true\"></a>    <span class=\"co\">// obtained from a distribution function will be identical given the same</span></span>\n<span id=\"cb13-24\"><a href=\"#cb13-24\" aria-hidden=\"true\"></a>    <span class=\"co\">// inputs across different compilers and platforms. A production ML</span></span>\n<span id=\"cb13-25\"><a href=\"#cb13-25\" aria-hidden=\"true\"></a>    <span class=\"co\">// framework will likely implement its own distributions to provide</span></span>\n<span id=\"cb13-26\"><a href=\"#cb13-26\" aria-hidden=\"true\"></a>    <span class=\"co\">// deterministic results.</span></span>\n<span id=\"cb13-27\"><a href=\"#cb13-27\" aria-hidden=\"true\"></a>    <span class=\"kw\">auto</span> dist = <span class=\"bu\">std::</span>normal_distribution&lt;<span class=\"dt\">num_t</span>&gt;{<span class=\"fl\">0.0</span>, sigma};</span>\n<span id=\"cb13-28\"><a href=\"#cb13-28\" aria-hidden=\"true\"></a></span>\n<span id=\"cb13-29\"><a href=\"#cb13-29\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">num_t</span>&amp; w : <span class=\"va\">weights_</span>)</span>\n<span id=\"cb13-30\"><a href=\"#cb13-30\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb13-31\"><a href=\"#cb13-31\" aria-hidden=\"true\"></a>        w = dist(rne);</span>\n<span id=\"cb13-32\"><a href=\"#cb13-32\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb13-33\"><a href=\"#cb13-33\" aria-hidden=\"true\"></a></span>\n<span id=\"cb13-34\"><a href=\"#cb13-34\" aria-hidden=\"true\"></a>    <span class=\"co\">// </span><span class=\"al\">NOTE</span><span class=\"co\">: Setting biases to zero is a common practice, as is initializing the</span></span>\n<span id=\"cb13-35\"><a href=\"#cb13-35\" aria-hidden=\"true\"></a>    <span class=\"co\">// bias to a small value (e.g. on the order of 0.01). It is unclear if the</span></span>\n<span id=\"cb13-36\"><a href=\"#cb13-36\" aria-hidden=\"true\"></a>    <span class=\"co\">// latter produces a consistent result over the former, but the thinking is</span></span>\n<span id=\"cb13-37\"><a href=\"#cb13-37\" aria-hidden=\"true\"></a>    <span class=\"co\">// that a non-zero bias will ensure that the neuron always &quot;fires&quot; at the</span></span>\n<span id=\"cb13-38\"><a href=\"#cb13-38\" aria-hidden=\"true\"></a>    <span class=\"co\">// beginning to produce a signal.</span></span>\n<span id=\"cb13-39\"><a href=\"#cb13-39\" aria-hidden=\"true\"></a>    <span class=\"co\">//</span></span>\n<span id=\"cb13-40\"><a href=\"#cb13-40\" aria-hidden=\"true\"></a>    <span class=\"co\">// Here, we initialize all biases to a small number, but the reader should</span></span>\n<span id=\"cb13-41\"><a href=\"#cb13-41\" aria-hidden=\"true\"></a>    <span class=\"co\">// consider experimenting with other approaches.</span></span>\n<span id=\"cb13-42\"><a href=\"#cb13-42\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">num_t</span>&amp; b : <span class=\"va\">biases_</span>)</span>\n<span id=\"cb13-43\"><a href=\"#cb13-43\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb13-44\"><a href=\"#cb13-44\" aria-hidden=\"true\"></a>        b = <span class=\"fl\">0.01</span>;</span>\n<span id=\"cb13-45\"><a href=\"#cb13-45\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb13-46\"><a href=\"#cb13-46\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>The common theme is that the distribution of random weights scales roughly as the inverse square root of the input vector size. This way, the distribution of the node’s output will fall in a “nice” range with respect to floating-point precision. Other initialization schemes are of course possible, and in some cases critical depending on the choice of activation function.</p>\n<p>With weights and biases initialized, it’s time to implement <code>FFNode::forward</code>. The straightforward plan is, for both the ReLU and softmax nodes, first perform the affine transform <span class=\"math inline\">\\mathbf{W}\\mathbf{x} + \\mathbf{b}</span>, then perform the activation function which will be one of the linear rectifier or the softmax function. Here’s what this looks like:</p>\n<div class=\"sourceCode\" id=\"cb14\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb14-1\"><a href=\"#cb14-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> FFNode::forward(<span class=\"dt\">num_t</span>* inputs)</span>\n<span id=\"cb14-2\"><a href=\"#cb14-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb14-3\"><a href=\"#cb14-3\" aria-hidden=\"true\"></a>    <span class=\"co\">// Remember the last input data for backpropagation later</span></span>\n<span id=\"cb14-4\"><a href=\"#cb14-4\" aria-hidden=\"true\"></a>    <span class=\"va\">last_input_</span> = inputs;</span>\n<span id=\"cb14-5\"><a href=\"#cb14-5\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-6\"><a href=\"#cb14-6\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb14-7\"><a href=\"#cb14-7\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb14-8\"><a href=\"#cb14-8\" aria-hidden=\"true\"></a>        <span class=\"co\">// For each output vector, compute the dot product of the input data</span></span>\n<span id=\"cb14-9\"><a href=\"#cb14-9\" aria-hidden=\"true\"></a>        <span class=\"co\">// with the weight vector add the bias</span></span>\n<span id=\"cb14-10\"><a href=\"#cb14-10\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-11\"><a href=\"#cb14-11\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span> z{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb14-12\"><a href=\"#cb14-12\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-13\"><a href=\"#cb14-13\" aria-hidden=\"true\"></a>        <span class=\"dt\">size_t</span> offset = i * <span class=\"va\">input_size_</span>;</span>\n<span id=\"cb14-14\"><a href=\"#cb14-14\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-15\"><a href=\"#cb14-15\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">input_size_</span>; ++j)</span>\n<span id=\"cb14-16\"><a href=\"#cb14-16\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb14-17\"><a href=\"#cb14-17\" aria-hidden=\"true\"></a>            z += <span class=\"va\">weights_</span>[offset + j] * inputs[j];</span>\n<span id=\"cb14-18\"><a href=\"#cb14-18\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb14-19\"><a href=\"#cb14-19\" aria-hidden=\"true\"></a>        <span class=\"co\">// Add neuron bias</span></span>\n<span id=\"cb14-20\"><a href=\"#cb14-20\" aria-hidden=\"true\"></a>        z += <span class=\"va\">biases_</span>[i];</span>\n<span id=\"cb14-21\"><a href=\"#cb14-21\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-22\"><a href=\"#cb14-22\" aria-hidden=\"true\"></a>        <span class=\"cf\">switch</span> (<span class=\"va\">activation_</span>)</span>\n<span id=\"cb14-23\"><a href=\"#cb14-23\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb14-24\"><a href=\"#cb14-24\" aria-hidden=\"true\"></a>        <span class=\"cf\">case</span> Activation::ReLU:</span>\n<span id=\"cb14-25\"><a href=\"#cb14-25\" aria-hidden=\"true\"></a>            <span class=\"va\">activations_</span>[i] = <span class=\"bu\">std::</span>max(z, <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>});</span>\n<span id=\"cb14-26\"><a href=\"#cb14-26\" aria-hidden=\"true\"></a>            <span class=\"cf\">break</span>;</span>\n<span id=\"cb14-27\"><a href=\"#cb14-27\" aria-hidden=\"true\"></a>        <span class=\"cf\">case</span> Activation::Softmax:</span>\n<span id=\"cb14-28\"><a href=\"#cb14-28\" aria-hidden=\"true\"></a>        <span class=\"cf\">default</span>:</span>\n<span id=\"cb14-29\"><a href=\"#cb14-29\" aria-hidden=\"true\"></a>            <span class=\"va\">activations_</span>[i] = <span class=\"bu\">std::</span>exp(z);</span>\n<span id=\"cb14-30\"><a href=\"#cb14-30\" aria-hidden=\"true\"></a>            <span class=\"cf\">break</span>;</span>\n<span id=\"cb14-31\"><a href=\"#cb14-31\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb14-32\"><a href=\"#cb14-32\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb14-33\"><a href=\"#cb14-33\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-34\"><a href=\"#cb14-34\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (<span class=\"va\">activation_</span> == Activation::Softmax)</span>\n<span id=\"cb14-35\"><a href=\"#cb14-35\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb14-36\"><a href=\"#cb14-36\" aria-hidden=\"true\"></a>        <span class=\"co\">// softmax(z)_i = exp(z_i) / \\sum_j(exp(z_j))</span></span>\n<span id=\"cb14-37\"><a href=\"#cb14-37\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span> sum_exp_z{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb14-38\"><a href=\"#cb14-38\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb14-39\"><a href=\"#cb14-39\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb14-40\"><a href=\"#cb14-40\" aria-hidden=\"true\"></a>            <span class=\"co\">// </span><span class=\"al\">NOTE</span><span class=\"co\">: with exploding gradients, it is quite easy for this</span></span>\n<span id=\"cb14-41\"><a href=\"#cb14-41\" aria-hidden=\"true\"></a>            <span class=\"co\">// exponential function to overflow, which will result in NaNs</span></span>\n<span id=\"cb14-42\"><a href=\"#cb14-42\" aria-hidden=\"true\"></a>            <span class=\"co\">// infecting the network.</span></span>\n<span id=\"cb14-43\"><a href=\"#cb14-43\" aria-hidden=\"true\"></a>            sum_exp_z += <span class=\"va\">activations_</span>[i];</span>\n<span id=\"cb14-44\"><a href=\"#cb14-44\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb14-45\"><a href=\"#cb14-45\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span> inv_sum_exp_z = <span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>} / sum_exp_z;</span>\n<span id=\"cb14-46\"><a href=\"#cb14-46\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb14-47\"><a href=\"#cb14-47\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb14-48\"><a href=\"#cb14-48\" aria-hidden=\"true\"></a>            <span class=\"va\">activations_</span>[i] *= inv_sum_exp_z;</span>\n<span id=\"cb14-49\"><a href=\"#cb14-49\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb14-50\"><a href=\"#cb14-50\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb14-51\"><a href=\"#cb14-51\" aria-hidden=\"true\"></a></span>\n<span id=\"cb14-52\"><a href=\"#cb14-52\" aria-hidden=\"true\"></a>    <span class=\"co\">// Forward activation data to all subsequent nodes in the computational</span></span>\n<span id=\"cb14-53\"><a href=\"#cb14-53\" aria-hidden=\"true\"></a>    <span class=\"co\">// graph</span></span>\n<span id=\"cb14-54\"><a href=\"#cb14-54\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (Node* subsequent : <span class=\"va\">subsequents_</span>)</span>\n<span id=\"cb14-55\"><a href=\"#cb14-55\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb14-56\"><a href=\"#cb14-56\" aria-hidden=\"true\"></a>        subsequent-&gt;forward(<span class=\"va\">activations_</span>.data());</span>\n<span id=\"cb14-57\"><a href=\"#cb14-57\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb14-58\"><a href=\"#cb14-58\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>As before, we forward all final results to all subsequent nodes even though there will only be a single subsequent node in this case. Whenever writing code as above, it is prudent to consider all potential corner cases which could result in the myriad issues that arise in floating-point computation:</p>\n<ul>\n<li>Loss of precision</li>\n<li>Floating point overflow and underflow</li>\n<li>Divide by zero</li>\n</ul>\n<p>Loss of precision easily occurs when in a number of situations, such as subtracting two quantities of similar size, or adding and multiplying quantities with greatly different magnitudes. Floating point overflow and underflow occur typically when repeatedly performing an operation such that an accumulator explodes to <span class=\"math inline\">\\infty</span> or <span class=\"math inline\">-\\infty</span>. In this case, the use of <code>std::exp</code> is one operation that sticks out. We will not implement a stable softmax here, but the following identity can be used to improve its stability should you need it:</p>\n<p><span class=\"math display\">\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i = \\mathrm{softmax}(\\mathbf{z})_i</span></p>\n<p>In this expression, <span class=\"math inline\">\\mathbf{C}</span> is a constant vector where all its elements are equal in value. Expanding the definition of softmax in the LHS gives:</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i &amp;= \\frac{\\exp{(z_i + C)}}{\\sum_i\\exp{(z_i + C})} \\\\\n&amp;= \\frac\n    {\\exp{z_i}\\exp{C}}\n    {\\left(\\sum_i\\exp{z_i}\\right)\\exp C} \\\\\n&amp;= \\mathrm{softmax}(\\mathbf{z})_i &amp;&amp; \\blacksquare\n\\end{aligned}\n</span></p>\n<p>Thus, if we are considered about saturating <code>std::exp</code> with a large argument, we can simply set <span class=\"math inline\">C</span> to be the additive inverse of the <span class=\"math inline\">z_i</span> with the greatest magnitude within <span class=\"math inline\">\\mathbf{z}</span>. Performing this each time we apply softmax will usually maintain the arguments of the softmax within a reasonable range (unless elements of <span class=\"math inline\">z_i</span> explode in opposite directions).</p>\n<p>As a practical implementor’s trick, it is possible to enable floating point exception traps to throw an exception when a <code>NaN</code> is generated in a floating point register. Using libc for example, we can trap floating point exceptions using</p>\n<div class=\"sourceCode\" id=\"cb15\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb15-1\"><a href=\"#cb15-1\" aria-hidden=\"true\"></a><span class=\"pp\">#include </span><span class=\"im\">&lt;cfenv&gt;</span></span>\n<span id=\"cb15-2\"><a href=\"#cb15-2\" aria-hidden=\"true\"></a></span>\n<span id=\"cb15-3\"><a href=\"#cb15-3\" aria-hidden=\"true\"></a>feenableexcept(FE_INVALID | FE_OVERFLOW);</span></code></pre></div>\n<p>It is also possible to trap exceptions specifically in regions where you anticipate a potential issue (which enhances the overall throughput of the network). In the interest of brevity, please consult your compiler’s documentation for how to do this.</p>\n<p>One observation you might have made is the first line of our routine.</p>\n<div class=\"sourceCode\" id=\"cb16\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb16-1\"><a href=\"#cb16-1\" aria-hidden=\"true\"></a><span class=\"va\">last_input_</span> = inputs;</span></code></pre></div>\n<p>Here, we retain a pointer to the data ingested by the feedforward node for a full training cycle. Before delving into any derivations, let’s first present the code for the backpropagation of gradients through our feedforward node and dissect it immediately afterwards.</p>\n<div class=\"sourceCode\" id=\"cb17\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb17-1\"><a href=\"#cb17-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> FFNode::reverse(<span class=\"dt\">num_t</span>* gradients)</span>\n<span id=\"cb17-2\"><a href=\"#cb17-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb17-3\"><a href=\"#cb17-3\" aria-hidden=\"true\"></a>    <span class=\"co\">// First, we compute dJ/dz as dJ/dg(z) * dg(z)/dz and store it in our</span></span>\n<span id=\"cb17-4\"><a href=\"#cb17-4\" aria-hidden=\"true\"></a>    <span class=\"co\">// activations array</span></span>\n<span id=\"cb17-5\"><a href=\"#cb17-5\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb17-6\"><a href=\"#cb17-6\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb17-7\"><a href=\"#cb17-7\" aria-hidden=\"true\"></a>        <span class=\"co\">// dg(z)/dz</span></span>\n<span id=\"cb17-8\"><a href=\"#cb17-8\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span> activation_grad{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb17-9\"><a href=\"#cb17-9\" aria-hidden=\"true\"></a>        <span class=\"cf\">switch</span> (<span class=\"va\">activation_</span>)</span>\n<span id=\"cb17-10\"><a href=\"#cb17-10\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb17-11\"><a href=\"#cb17-11\" aria-hidden=\"true\"></a>        <span class=\"cf\">case</span> Activation::ReLU:</span>\n<span id=\"cb17-12\"><a href=\"#cb17-12\" aria-hidden=\"true\"></a>            <span class=\"cf\">if</span> (<span class=\"va\">activations_</span>[i] &gt; <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>})</span>\n<span id=\"cb17-13\"><a href=\"#cb17-13\" aria-hidden=\"true\"></a>            {</span>\n<span id=\"cb17-14\"><a href=\"#cb17-14\" aria-hidden=\"true\"></a>                activation_grad = <span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>};</span>\n<span id=\"cb17-15\"><a href=\"#cb17-15\" aria-hidden=\"true\"></a>            }</span>\n<span id=\"cb17-16\"><a href=\"#cb17-16\" aria-hidden=\"true\"></a>            <span class=\"cf\">else</span></span>\n<span id=\"cb17-17\"><a href=\"#cb17-17\" aria-hidden=\"true\"></a>            {</span>\n<span id=\"cb17-18\"><a href=\"#cb17-18\" aria-hidden=\"true\"></a>                activation_grad = <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb17-19\"><a href=\"#cb17-19\" aria-hidden=\"true\"></a>            }</span>\n<span id=\"cb17-20\"><a href=\"#cb17-20\" aria-hidden=\"true\"></a>            <span class=\"co\">// dJ/dz = dJ/dg(z) * dg(z)/dz</span></span>\n<span id=\"cb17-21\"><a href=\"#cb17-21\" aria-hidden=\"true\"></a>            <span class=\"va\">activation_gradients_</span>[i] = gradients[i] * activation_grad;</span>\n<span id=\"cb17-22\"><a href=\"#cb17-22\" aria-hidden=\"true\"></a>            <span class=\"cf\">break</span>;</span>\n<span id=\"cb17-23\"><a href=\"#cb17-23\" aria-hidden=\"true\"></a>        <span class=\"cf\">case</span> Activation::Softmax:</span>\n<span id=\"cb17-24\"><a href=\"#cb17-24\" aria-hidden=\"true\"></a>        <span class=\"cf\">default</span>:</span>\n<span id=\"cb17-25\"><a href=\"#cb17-25\" aria-hidden=\"true\"></a>            <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">output_size_</span>; ++j)</span>\n<span id=\"cb17-26\"><a href=\"#cb17-26\" aria-hidden=\"true\"></a>            {</span>\n<span id=\"cb17-27\"><a href=\"#cb17-27\" aria-hidden=\"true\"></a>                <span class=\"cf\">if</span> (i == j)</span>\n<span id=\"cb17-28\"><a href=\"#cb17-28\" aria-hidden=\"true\"></a>                {</span>\n<span id=\"cb17-29\"><a href=\"#cb17-29\" aria-hidden=\"true\"></a>                    activation_grad += <span class=\"va\">activations_</span>[i]</span>\n<span id=\"cb17-30\"><a href=\"#cb17-30\" aria-hidden=\"true\"></a>                                       * (<span class=\"dt\">num_t</span>{<span class=\"fl\">1.0</span>} - <span class=\"va\">activations_</span>[i])</span>\n<span id=\"cb17-31\"><a href=\"#cb17-31\" aria-hidden=\"true\"></a>                                       * gradients[j];</span>\n<span id=\"cb17-32\"><a href=\"#cb17-32\" aria-hidden=\"true\"></a>                }</span>\n<span id=\"cb17-33\"><a href=\"#cb17-33\" aria-hidden=\"true\"></a>                <span class=\"cf\">else</span></span>\n<span id=\"cb17-34\"><a href=\"#cb17-34\" aria-hidden=\"true\"></a>                {</span>\n<span id=\"cb17-35\"><a href=\"#cb17-35\" aria-hidden=\"true\"></a>                    activation_grad</span>\n<span id=\"cb17-36\"><a href=\"#cb17-36\" aria-hidden=\"true\"></a>                        += -<span class=\"va\">activations_</span>[i] * <span class=\"va\">activations_</span>[j] * gradients[j];</span>\n<span id=\"cb17-37\"><a href=\"#cb17-37\" aria-hidden=\"true\"></a>                }</span>\n<span id=\"cb17-38\"><a href=\"#cb17-38\" aria-hidden=\"true\"></a>            }</span>\n<span id=\"cb17-39\"><a href=\"#cb17-39\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-40\"><a href=\"#cb17-40\" aria-hidden=\"true\"></a>            <span class=\"va\">activation_gradients_</span>[i] = activation_grad;</span>\n<span id=\"cb17-41\"><a href=\"#cb17-41\" aria-hidden=\"true\"></a>            <span class=\"cf\">break</span>;</span>\n<span id=\"cb17-42\"><a href=\"#cb17-42\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb17-43\"><a href=\"#cb17-43\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb17-44\"><a href=\"#cb17-44\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-45\"><a href=\"#cb17-45\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb17-46\"><a href=\"#cb17-46\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb17-47\"><a href=\"#cb17-47\" aria-hidden=\"true\"></a>        <span class=\"co\">// dJ/db_i = dJ/dg(z_i) * dJ(g_i)/dz_i.</span></span>\n<span id=\"cb17-48\"><a href=\"#cb17-48\" aria-hidden=\"true\"></a>        <span class=\"va\">bias_gradients_</span>[i] += <span class=\"va\">activation_gradients_</span>[i];</span>\n<span id=\"cb17-49\"><a href=\"#cb17-49\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb17-50\"><a href=\"#cb17-50\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-51\"><a href=\"#cb17-51\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>fill(<span class=\"va\">input_gradients_</span>.begin(), <span class=\"va\">input_gradients_</span>.end(), <span class=\"dv\">0</span>);</span>\n<span id=\"cb17-52\"><a href=\"#cb17-52\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-53\"><a href=\"#cb17-53\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb17-54\"><a href=\"#cb17-54\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb17-55\"><a href=\"#cb17-55\" aria-hidden=\"true\"></a>        <span class=\"dt\">size_t</span> offset = i * <span class=\"va\">input_size_</span>;</span>\n<span id=\"cb17-56\"><a href=\"#cb17-56\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">input_size_</span>; ++j)</span>\n<span id=\"cb17-57\"><a href=\"#cb17-57\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb17-58\"><a href=\"#cb17-58\" aria-hidden=\"true\"></a>            <span class=\"va\">input_gradients_</span>[j]</span>\n<span id=\"cb17-59\"><a href=\"#cb17-59\" aria-hidden=\"true\"></a>                += <span class=\"va\">weights_</span>[offset + j] * <span class=\"va\">activation_gradients_</span>[i];</span>\n<span id=\"cb17-60\"><a href=\"#cb17-60\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb17-61\"><a href=\"#cb17-61\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb17-62\"><a href=\"#cb17-62\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-63\"><a href=\"#cb17-63\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">input_size_</span>; ++i)</span>\n<span id=\"cb17-64\"><a href=\"#cb17-64\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb17-65\"><a href=\"#cb17-65\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">output_size_</span>; ++j)</span>\n<span id=\"cb17-66\"><a href=\"#cb17-66\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb17-67\"><a href=\"#cb17-67\" aria-hidden=\"true\"></a>            <span class=\"va\">weight_gradients_</span>[j * <span class=\"va\">input_size_</span> + i]</span>\n<span id=\"cb17-68\"><a href=\"#cb17-68\" aria-hidden=\"true\"></a>                += <span class=\"va\">last_input_</span>[i] * <span class=\"va\">activation_gradients_</span>[j];</span>\n<span id=\"cb17-69\"><a href=\"#cb17-69\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb17-70\"><a href=\"#cb17-70\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb17-71\"><a href=\"#cb17-71\" aria-hidden=\"true\"></a></span>\n<span id=\"cb17-72\"><a href=\"#cb17-72\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (Node* node : <span class=\"va\">antecedents_</span>)</span>\n<span id=\"cb17-73\"><a href=\"#cb17-73\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb17-74\"><a href=\"#cb17-74\" aria-hidden=\"true\"></a>        node-&gt;reverse(<span class=\"va\">input_gradients_</span>.data());</span>\n<span id=\"cb17-75\"><a href=\"#cb17-75\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb17-76\"><a href=\"#cb17-76\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>This code is likely more difficult to digest, so let’s break it down into parts. During reverse accumulation (aka backpropagation), we will be given the loss gradients with respect to all of the outputs from the most recent forward pass, written mathematically as <span class=\"math inline\">\\partial J_{CE}/\\partial a_i</span> for each output scalar <span class=\"math inline\">a_i</span>. Given that information, we need to perform the following tasks:</p>\n<ol type=\"1\">\n<li>Compute <span class=\"math inline\">\\partial J_{CE}/\\partial w_{ij}</span> for each weight in our weight matrix</li>\n<li>Compute <span class=\"math inline\">\\partial J_{CE}/\\partial b_i</span> for each bias in our bias vector</li>\n<li>Compute <span class=\"math inline\">\\partial J_{CE}/\\partial x_i</span> for each input scalar in the most recent forward pass</li>\n<li>Propagate all the loss gradients with respect to the inputs in step 3 back to the antecedent nodes</li>\n</ol>\n<p>As all outputs pass through an activation function, we will need to compute <span class=\"math inline\">\\partial J_{CE}/\\partial g(\\mathbf{z})_i</span> where <span class=\"math inline\">g</span> is one of the linear rectifier or softmax function corresponding to a particular component of the output vector. Both derivatives are computed in the background section, so we’ll just recite the results here. For the linear rectifier, <span class=\"math inline\">\\partial J_{CE}/\\partial g(\\mathbf{z})_i</span> will simply be 1 if <span class=\"math inline\">a_i \\neq 0</span>, and 0 otherwise. The softmax gradient is slightly more involved, but because every output of the softmax contributes additively to the loss, we require a sum of gradients here:</p>\n<p><span class=\"math display\">\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} = \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) &amp; i = j \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j &amp; i \\neq j\n\\end{cases}</span></p>\n<p>The factor <span class=\"math inline\">\\partial J_{CE}/\\partial a_i</span> comes from the chain rule and is passed in from the subsequent node. These intermediate expressions are computed, scaled by <span class=\"math inline\">\\partial a_i/\\partial z_i</span>, and then stored in <code>activation_gradients_</code> in the top portion of <code>FFNode::reverse</code>. Equivalently by the chain rule, we are caching in <code>activation_gradients_</code> <span class=\"math inline\">\\partial J_{CE}/\\partial z_i</span> for each <span class=\"math inline\">i</span>. Because the loss gradients with respect to every parameter and input have a functional dependence on the activation function gradients, all results computed in tasks 1 through 4 above will depend on <code>activation_gradients_</code>.</p>\n<h4 id=\"computing-bias-gradients\">Computing bias gradients</h4>\n<p>The bias gradients are the easiest to compute due to how they show up in the expression. Since a node’s output is given as</p>\n<p><span class=\"math display\">a_i = g\\left(\\mathbf{W}_i \\cdot \\mathbf{x} + b_i = z_i\\right)</span></p>\n<p>for some activation function <span class=\"math inline\">g</span>, the derivative with respect to <span class=\"math inline\">b_i</span> is just</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\frac{\\partial{a_i}}{\\partial b_i} &amp;= \\frac{\\partial g}{\\partial z_i}\\frac{\\partial z_i}{\\partial b_i} \\\\\n&amp;= \\frac{\\partial g}{\\partial z_i}\n\\end{aligned}\n</span></p>\n<p>Thus we can simply accumulate the result stored in <code>activation_gradients_</code> as the loss gradient with respect to each bias. Please take note! The code that performs this update is</p>\n<div class=\"sourceCode\" id=\"cb18\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb18-1\"><a href=\"#cb18-1\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb18-2\"><a href=\"#cb18-2\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb18-3\"><a href=\"#cb18-3\" aria-hidden=\"true\"></a>        <span class=\"va\">bias_gradients_</span>[i] += <span class=\"va\">activation_gradients_</span>[i];</span>\n<span id=\"cb18-4\"><a href=\"#cb18-4\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<p>The following code would <em>not</em> be correct:</p>\n<div class=\"sourceCode\" id=\"cb19\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb19-1\"><a href=\"#cb19-1\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb19-2\"><a href=\"#cb19-2\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb19-3\"><a href=\"#cb19-3\" aria-hidden=\"true\"></a>        <span class=\"co\">// </span><span class=\"al\">NOTE</span><span class=\"co\">: WRONG! Will only alone batch sizes of 1</span></span>\n<span id=\"cb19-4\"><a href=\"#cb19-4\" aria-hidden=\"true\"></a>        <span class=\"va\">bias_gradients_</span>[i] = <span class=\"va\">activation_gradients_</span>[i];</span>\n<span id=\"cb19-5\"><a href=\"#cb19-5\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<p>As the admonition in the comment suggests, while it’s helpful to conceptualize the loss gradient as something that resets every time we perform a forward and reverse pass of a training sample, in actuality, we require the gradients with respect to the <em>cumulative mean loss accrued while evaluating the entire batch</em> for stochastic gradient descent. Luckily, because the losses per sample accumulate additively, the gradients of the loss with respect to all parameters in the model also update additively.</p>\n<h4 id=\"computing-the-weight-gradients\">Computing the weight gradients</h4>\n<p>The weight gradients are slightly more involved than the bias gradients, but are still relatively easy to compute with a bit of bookkeeping. For any given weight <span class=\"math inline\">w_{ij}</span>, we can observe that such a weight participates only in the evaluation of <span class=\"math inline\">z_i</span>. That is:</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\frac{\\partial \\mathbf{z}}{\\partial w_{ij}} &amp;= \\frac{\\partial z_i}{\\partial w_{ij}} \\\\\n&amp;= \\frac{\\partial (\\mathbf{w}_{i} \\cdot \\mathbf{x}) + b_i}{\\partial w_{ij}} \\\\\n&amp;= x_j \\\\\n\\end{aligned}\n</span></p>\n<p><span class=\"math display\">\n\\boxed{\\frac{\\partial J_{CE}}{\\partial w_{ij}} = \\frac{\\partial J_{CE}}{\\partial a_i}\\frac{\\partial a_i}{\\partial z_i}x_j}\n</span></p>\n<p>The boxed result shows the final loss gradient with respect to a weight parameter. The weight gradient accumulation appears in the following code, where all <span class=\"math inline\">N \\times M</span> weights are updated in a couple of nested loops:</p>\n<div class=\"sourceCode\" id=\"cb20\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb20-1\"><a href=\"#cb20-1\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">input_size_</span>; ++i)</span>\n<span id=\"cb20-2\"><a href=\"#cb20-2\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb20-3\"><a href=\"#cb20-3\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">output_size_</span>; ++j)</span>\n<span id=\"cb20-4\"><a href=\"#cb20-4\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb20-5\"><a href=\"#cb20-5\" aria-hidden=\"true\"></a>            <span class=\"va\">weight_gradients_</span>[j * <span class=\"va\">input_size_</span> + i]</span>\n<span id=\"cb20-6\"><a href=\"#cb20-6\" aria-hidden=\"true\"></a>                += <span class=\"va\">last_input_</span>[i] * <span class=\"va\">activation_gradients_</span>[j];</span>\n<span id=\"cb20-7\"><a href=\"#cb20-7\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb20-8\"><a href=\"#cb20-8\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<h4 id=\"computing-the-input-gradients\">Computing the input gradients</h4>\n<p>The last set of gradients we need to compute are the loss gradients with respect to the inputs, to be forwarded to the antecedent node. This calculation is similar to the calculation of the weight gradients in terms of the linear dependence. However, it is important to note that a given input participates in the computation of <em>all</em> output scalars. Thus, we expect each individual input gradient to be a summation.</p>\n<p><span class=\"math display\">\n\\frac{\\partial J_{CE}}{\\partial x_i} = \\sum_j \\frac{\\partial J_{CE}}{\\partial a_j}\\frac{\\partial a_j}{\\partial z_j}w_{ij}\n</span></p>\n<p>The code that computes the input gradients is defined here:</p>\n<div class=\"sourceCode\" id=\"cb21\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb21-1\"><a href=\"#cb21-1\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>fill(<span class=\"va\">input_gradients_</span>.begin(), <span class=\"va\">input_gradients_</span>.end(), <span class=\"dv\">0</span>);</span>\n<span id=\"cb21-2\"><a href=\"#cb21-2\" aria-hidden=\"true\"></a></span>\n<span id=\"cb21-3\"><a href=\"#cb21-3\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">output_size_</span>; ++i)</span>\n<span id=\"cb21-4\"><a href=\"#cb21-4\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb21-5\"><a href=\"#cb21-5\" aria-hidden=\"true\"></a>        <span class=\"dt\">size_t</span> offset = i * <span class=\"va\">input_size_</span>;</span>\n<span id=\"cb21-6\"><a href=\"#cb21-6\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"va\">input_size_</span>; ++j)</span>\n<span id=\"cb21-7\"><a href=\"#cb21-7\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb21-8\"><a href=\"#cb21-8\" aria-hidden=\"true\"></a>            <span class=\"va\">input_gradients_</span>[j]</span>\n<span id=\"cb21-9\"><a href=\"#cb21-9\" aria-hidden=\"true\"></a>                += <span class=\"va\">weights_</span>[offset + j] * <span class=\"va\">activation_gradients_</span>[i];</span>\n<span id=\"cb21-10\"><a href=\"#cb21-10\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb21-11\"><a href=\"#cb21-11\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<p>Note that unlike the weight and bias gradients which accumulate while training an entire batch of samples, the input gradients here are ephemeral and reset every pass since the only depend on the evaluation of an individual sample.</p>\n<p>Finally, to complete the <code>FFNode::reverse</code> method, the input gradients computed are based backwards for use in an antecedent node’s gradient update (reproduced below). The code as presented <em>does not work</em> with non-sequential computational graphs, but is meant to provide a starting point for futher experimentation.</p>\n<div class=\"sourceCode\" id=\"cb22\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb22-1\"><a href=\"#cb22-1\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (Node* node : <span class=\"va\">antecedents_</span>)</span>\n<span id=\"cb22-2\"><a href=\"#cb22-2\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb22-3\"><a href=\"#cb22-3\" aria-hidden=\"true\"></a>        node-&gt;reverse(<span class=\"va\">input_gradients_</span>.data());</span>\n<span id=\"cb22-4\"><a href=\"#cb22-4\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<h3 id=\"the-categorical-cross-entropy-loss-node\">The Categorical Cross-Entropy Loss Node</h3>\n<p>The last node we need to implement is the node which computes the categorical cross-entropy of the prediction. A possible class definition for such this node is shown below:</p>\n<div class=\"sourceCode\" id=\"cb23\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb23-1\"><a href=\"#cb23-1\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> CCELossNode : <span class=\"kw\">public</span> Node</span>\n<span id=\"cb23-2\"><a href=\"#cb23-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb23-3\"><a href=\"#cb23-3\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb23-4\"><a href=\"#cb23-4\" aria-hidden=\"true\"></a>    CCELossNode(Model&amp; model,</span>\n<span id=\"cb23-5\"><a href=\"#cb23-5\" aria-hidden=\"true\"></a>                <span class=\"bu\">std::</span>string name,</span>\n<span id=\"cb23-6\"><a href=\"#cb23-6\" aria-hidden=\"true\"></a>                <span class=\"dt\">uint16_t</span> input_size,</span>\n<span id=\"cb23-7\"><a href=\"#cb23-7\" aria-hidden=\"true\"></a>                <span class=\"dt\">size_t</span> batch_size);</span>\n<span id=\"cb23-8\"><a href=\"#cb23-8\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-9\"><a href=\"#cb23-9\" aria-hidden=\"true\"></a>    <span class=\"co\">// No initialization is needed for this node</span></span>\n<span id=\"cb23-10\"><a href=\"#cb23-10\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> init(<span class=\"dt\">rne_t</span>&amp;) <span class=\"kw\">override</span> {}</span>\n<span id=\"cb23-11\"><a href=\"#cb23-11\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-12\"><a href=\"#cb23-12\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> forward(<span class=\"dt\">num_t</span>* inputs) <span class=\"kw\">override</span>;</span>\n<span id=\"cb23-13\"><a href=\"#cb23-13\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-14\"><a href=\"#cb23-14\" aria-hidden=\"true\"></a>    <span class=\"co\">// As a loss node, the argument to this method is ignored (the gradient of</span></span>\n<span id=\"cb23-15\"><a href=\"#cb23-15\" aria-hidden=\"true\"></a>    <span class=\"co\">// the loss with respect to itself is unity)</span></span>\n<span id=\"cb23-16\"><a href=\"#cb23-16\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> reverse(<span class=\"dt\">num_t</span>* gradients = <span class=\"kw\">nullptr</span>) <span class=\"kw\">override</span>;</span>\n<span id=\"cb23-17\"><a href=\"#cb23-17\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-18\"><a href=\"#cb23-18\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> print() <span class=\"at\">const</span> <span class=\"kw\">override</span>;</span>\n<span id=\"cb23-19\"><a href=\"#cb23-19\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-20\"><a href=\"#cb23-20\" aria-hidden=\"true\"></a>    <span class=\"co\">// During training, this must be set to the expected target distribution</span></span>\n<span id=\"cb23-21\"><a href=\"#cb23-21\" aria-hidden=\"true\"></a>    <span class=\"co\">// for a given sample</span></span>\n<span id=\"cb23-22\"><a href=\"#cb23-22\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> set_target(<span class=\"dt\">num_t</span> <span class=\"at\">const</span>* target)</span>\n<span id=\"cb23-23\"><a href=\"#cb23-23\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb23-24\"><a href=\"#cb23-24\" aria-hidden=\"true\"></a>        <span class=\"va\">target_</span> = target;</span>\n<span id=\"cb23-25\"><a href=\"#cb23-25\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb23-26\"><a href=\"#cb23-26\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-27\"><a href=\"#cb23-27\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> accuracy() <span class=\"at\">const</span>;</span>\n<span id=\"cb23-28\"><a href=\"#cb23-28\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> avg_loss() <span class=\"at\">const</span>;</span>\n<span id=\"cb23-29\"><a href=\"#cb23-29\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> reset_score();</span>\n<span id=\"cb23-30\"><a href=\"#cb23-30\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-31\"><a href=\"#cb23-31\" aria-hidden=\"true\"></a><span class=\"kw\">private</span>:</span>\n<span id=\"cb23-32\"><a href=\"#cb23-32\" aria-hidden=\"true\"></a>    <span class=\"dt\">uint16_t</span> <span class=\"va\">input_size_</span>;</span>\n<span id=\"cb23-33\"><a href=\"#cb23-33\" aria-hidden=\"true\"></a></span>\n<span id=\"cb23-34\"><a href=\"#cb23-34\" aria-hidden=\"true\"></a>    <span class=\"co\">// We minimize the average loss, not the net loss so that the losses</span></span>\n<span id=\"cb23-35\"><a href=\"#cb23-35\" aria-hidden=\"true\"></a>    <span class=\"co\">// produced do not scale with batch size (which allows us to keep training</span></span>\n<span id=\"cb23-36\"><a href=\"#cb23-36\" aria-hidden=\"true\"></a>    <span class=\"co\">// parameters constant)</span></span>\n<span id=\"cb23-37\"><a href=\"#cb23-37\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">inv_batch_size_</span>;</span>\n<span id=\"cb23-38\"><a href=\"#cb23-38\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">loss_</span>;</span>\n<span id=\"cb23-39\"><a href=\"#cb23-39\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"at\">const</span>* <span class=\"va\">target_</span>;</span>\n<span id=\"cb23-40\"><a href=\"#cb23-40\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span>* <span class=\"va\">last_input_</span>;</span>\n<span id=\"cb23-41\"><a href=\"#cb23-41\" aria-hidden=\"true\"></a>    <span class=\"co\">// Stores the last active classification in the target one-hot encoding</span></span>\n<span id=\"cb23-42\"><a href=\"#cb23-42\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> <span class=\"va\">active_</span>;</span>\n<span id=\"cb23-43\"><a href=\"#cb23-43\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">cumulative_loss_</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb23-44\"><a href=\"#cb23-44\" aria-hidden=\"true\"></a>    <span class=\"co\">// Store running counts of correct and incorrect predictions</span></span>\n<span id=\"cb23-45\"><a href=\"#cb23-45\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> <span class=\"va\">correct_</span>   = <span class=\"dv\">0</span>;</span>\n<span id=\"cb23-46\"><a href=\"#cb23-46\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> <span class=\"va\">incorrect_</span> = <span class=\"dv\">0</span>;</span>\n<span id=\"cb23-47\"><a href=\"#cb23-47\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>vector&lt;<span class=\"dt\">num_t</span>&gt; <span class=\"va\">gradients_</span>;</span>\n<span id=\"cb23-48\"><a href=\"#cb23-48\" aria-hidden=\"true\"></a>};</span></code></pre></div>\n<p>The <code>CCELossNode</code> is similar to other nodes in that it implements a forward pass for computing the loss of a given sample, and a reverse pass to compute gradients of that loss and pass them back to the antecedent node. Distinct from the previous nodes is that the argument to <code>CCELossNode::reverse</code> is ignored as the loss node is not expected to have any subsequents.</p>\n<p>The implementation of <code>CCELossNode::forward</code> follows from the definition of cross-entropy, recalled here with some modifications:</p>\n<p><span class=\"math display\">J_{CE}(\\hat{\\mathbf{y}}, \\mathbf{y}) = -\\sum_j y_j \\log{\\left(\\max(\\hat{y}_j, \\epsilon) \\right)} </span></p>\n<p><span class=\"math inline\">J</span> is the common symbol ascribed to the cost or objective function, while <span class=\"math inline\">\\hat{y}</span> and <span class=\"math inline\">y</span> refer to the predicted distribution and correct distribution respectively. In addition, the argument of the logarithm is clamped with a small <span class=\"math inline\">\\epsilon</span> to avoid a numerical singularity. The implementation is as follows:</p>\n<div class=\"sourceCode\" id=\"cb24\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb24-1\"><a href=\"#cb24-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> CCELossNode::forward(<span class=\"dt\">num_t</span>* data)</span>\n<span id=\"cb24-2\"><a href=\"#cb24-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb24-3\"><a href=\"#cb24-3\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> max{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb24-4\"><a href=\"#cb24-4\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> max_index;</span>\n<span id=\"cb24-5\"><a href=\"#cb24-5\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-6\"><a href=\"#cb24-6\" aria-hidden=\"true\"></a>    <span class=\"va\">loss_</span> = <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb24-7\"><a href=\"#cb24-7\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">input_size_</span>; ++i)</span>\n<span id=\"cb24-8\"><a href=\"#cb24-8\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb24-9\"><a href=\"#cb24-9\" aria-hidden=\"true\"></a>        <span class=\"cf\">if</span> (data[i] &gt; max)</span>\n<span id=\"cb24-10\"><a href=\"#cb24-10\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb24-11\"><a href=\"#cb24-11\" aria-hidden=\"true\"></a>            max_index = i;</span>\n<span id=\"cb24-12\"><a href=\"#cb24-12\" aria-hidden=\"true\"></a>            max       = data[i];</span>\n<span id=\"cb24-13\"><a href=\"#cb24-13\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb24-14\"><a href=\"#cb24-14\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-15\"><a href=\"#cb24-15\" aria-hidden=\"true\"></a>        <span class=\"va\">loss_</span> -= <span class=\"va\">target_</span>[i]</span>\n<span id=\"cb24-16\"><a href=\"#cb24-16\" aria-hidden=\"true\"></a>                 * <span class=\"bu\">std::</span>log(</span>\n<span id=\"cb24-17\"><a href=\"#cb24-17\" aria-hidden=\"true\"></a>                     <span class=\"bu\">std::</span>max(data[i], <span class=\"bu\">std::</span>numeric_limits&lt;<span class=\"dt\">num_t</span>&gt;::epsilon()));</span>\n<span id=\"cb24-18\"><a href=\"#cb24-18\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-19\"><a href=\"#cb24-19\" aria-hidden=\"true\"></a>        <span class=\"cf\">if</span> (<span class=\"va\">target_</span>[i] != <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>})</span>\n<span id=\"cb24-20\"><a href=\"#cb24-20\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb24-21\"><a href=\"#cb24-21\" aria-hidden=\"true\"></a>            <span class=\"va\">active_</span> = i;</span>\n<span id=\"cb24-22\"><a href=\"#cb24-22\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb24-23\"><a href=\"#cb24-23\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb24-24\"><a href=\"#cb24-24\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-25\"><a href=\"#cb24-25\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (max_index == <span class=\"va\">active_</span>)</span>\n<span id=\"cb24-26\"><a href=\"#cb24-26\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb24-27\"><a href=\"#cb24-27\" aria-hidden=\"true\"></a>        ++<span class=\"va\">correct_</span>;</span>\n<span id=\"cb24-28\"><a href=\"#cb24-28\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb24-29\"><a href=\"#cb24-29\" aria-hidden=\"true\"></a>    <span class=\"cf\">else</span></span>\n<span id=\"cb24-30\"><a href=\"#cb24-30\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb24-31\"><a href=\"#cb24-31\" aria-hidden=\"true\"></a>        ++<span class=\"va\">incorrect_</span>;</span>\n<span id=\"cb24-32\"><a href=\"#cb24-32\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb24-33\"><a href=\"#cb24-33\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-34\"><a href=\"#cb24-34\" aria-hidden=\"true\"></a>    <span class=\"va\">cumulative_loss_</span> += <span class=\"va\">loss_</span>;</span>\n<span id=\"cb24-35\"><a href=\"#cb24-35\" aria-hidden=\"true\"></a></span>\n<span id=\"cb24-36\"><a href=\"#cb24-36\" aria-hidden=\"true\"></a>    <span class=\"co\">// Store the data pointer to compute gradients later</span></span>\n<span id=\"cb24-37\"><a href=\"#cb24-37\" aria-hidden=\"true\"></a>    <span class=\"va\">last_input_</span> = data;</span>\n<span id=\"cb24-38\"><a href=\"#cb24-38\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>As with the feedforward node, a pointer to the inputs to the node is preserved to compute gradients later. A bit of bookkeeping is also done so we can track accuracy and accumulate loss during batch. The derivative of the loss of an individual sample with respect to the inputs is also fairly straightforward.</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\frac{\\partial J_{CE}}{\\partial{\\hat{y}_i}} &amp;= \\frac{\\partial \\left(-\\sum_j y_j\\log{\\left(\\max(\\hat{y}_j, \\epsilon)\\right)}\\right)}{\\partial \\hat{y}_i} \\\\\n&amp;= -\\frac{y_i}{\\max(\\hat{y}_i, \\epsilon)}\n\\end{aligned}\n</span></p>\n<p>The implementation is similarly straightforward. As with the other nodes with loss gradients, the loss gradients with respect to all inputs are forwarded to antecedent nodes.</p>\n<div class=\"sourceCode\" id=\"cb25\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb25-1\"><a href=\"#cb25-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> CCELossNode::reverse(<span class=\"dt\">num_t</span>* data)</span>\n<span id=\"cb25-2\"><a href=\"#cb25-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb25-3\"><a href=\"#cb25-3\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"va\">input_size_</span>; ++i)</span>\n<span id=\"cb25-4\"><a href=\"#cb25-4\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb25-5\"><a href=\"#cb25-5\" aria-hidden=\"true\"></a>        <span class=\"va\">gradients_</span>[i] = -<span class=\"va\">inv_batch_size_</span> * <span class=\"va\">target_</span>[i]</span>\n<span id=\"cb25-6\"><a href=\"#cb25-6\" aria-hidden=\"true\"></a>            / <span class=\"bu\">std::</span>max(<span class=\"va\">last_input_</span>[i], <span class=\"bu\">std::</span>numeric_limits&lt;<span class=\"dt\">num_t</span>&gt;::epsilon());</span>\n<span id=\"cb25-7\"><a href=\"#cb25-7\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb25-8\"><a href=\"#cb25-8\" aria-hidden=\"true\"></a></span>\n<span id=\"cb25-9\"><a href=\"#cb25-9\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (Node* node : <span class=\"va\">antecedents_</span>)</span>\n<span id=\"cb25-10\"><a href=\"#cb25-10\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb25-11\"><a href=\"#cb25-11\" aria-hidden=\"true\"></a>        node-&gt;reverse(<span class=\"va\">gradients_</span>.data());</span>\n<span id=\"cb25-12\"><a href=\"#cb25-12\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb25-13\"><a href=\"#cb25-13\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>One thing to keep in mind here is that this implementation is <em>not</em> the most efficient implementation possible for a softmax layer feeding to a cross-entropy loss function by any stretch. The code and derivation here is completely general for arbitrary sample probability distributions. If, however, we can assume that the target distribution is one-hot encoded, then all gradients in this node will either be 0 or <span class=\"math inline\">-1/\\hat{y}_k</span> where <span class=\"math inline\">k</span> is the active label in the one-hot target. Upon substitution in the previous layer, it should be clear that important cancellations are possible that dramatically simplify the gradient computations in the softmax layer. Here’s the simplification, again assuming that the <span class=\"math inline\">k</span>th index is the correct label:</p>\n<p><span class=\"math display\">\n\\begin{aligned}\n\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} &amp;= \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) &amp; i = j \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j &amp; i \\neq j\n\\end{cases} \\\\\n&amp;= \\begin{dcases}\n-\\frac{\\mathrm{softmax}(\\mathbf{z})_k(1 - \\mathrm{softmax}(\\mathbf{z}_k))}{\\mathrm{softmax}(\\mathbf{z})_k} &amp; i = k \\\\\n\\frac{\\mathrm{softmax}(\\mathbf{z})_i\\mathrm{softmax}(\\mathbf{z})_k}{\\mathrm{softmax}(\\mathbf{z})_k} &amp; i \\neq k\\\\\n\\end{dcases} \\\\\n&amp;= \\begin{dcases}\n\\mathrm{softmax}(\\mathbf{z})_k - 1 &amp; i = k \\\\\n\\mathrm{softmax}(\\mathbf{z})_i &amp; i \\neq k\n\\end{dcases}\n\\end{aligned}\n</span></p>\n<p>When following the computation above, remember that <span class=\"math inline\">\\partial J_{CE} / \\partial a_i</span> is 0 for all <span class=\"math inline\">i \\neq k</span>. Thus, the only term in the sum that survives is the term corresponding to <span class=\"math inline\">j = k</span>, at which point we break out the differentation depending on whether <span class=\"math inline\">i = k</span> or <span class=\"math inline\">i \\neq k</span>.</p>\n<p>This is an elegant result! Essentially, the gradient of a the loss with respect to an emitted probability <span class=\"math inline\">p(x)</span> is simply <span class=\"math inline\">p(x)</span> if <span class=\"math inline\">x</span> was not the correct label, and <span class=\"math inline\">p(x) - 1</span> if it was. Considering the effect of gradient descent, this should check out with our intuition. The optimizer seeks to suppress probabilities predicted that should have been 0, and increase probabilities predicted that should have been 1. Check for yourself that after gradient descent is performed, the gradients derived here will nudge the model in the appropriate direction.</p>\n<p>This sort of optimization highlights an important observation about backpropagation, namely, that backpropagation does not guarantee any sort of optimality beyond a worst-case performance ceiling. Several production neural networks have architectures that employ heuristics to identify optimizations such as this one, but the problem of generating a perfect computational strategy is NP and so not covered here. The code provided here will remain in the general form, despite being slower in the interest of maintaining generality and not adding complexity, but you are encouraged to consider abstractions to permit this type of optimization in your own architecture (a useful keyword to aid your research is <em>common subexpression elimination</em> or <em>CSE</em> for short).</p>\n<p>The last thing we need to provide for <code>CCELossNode</code> are a few helper routines:</p>\n<div class=\"sourceCode\" id=\"cb26\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb26-1\"><a href=\"#cb26-1\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> CCELossNode::print() <span class=\"at\">const</span></span>\n<span id=\"cb26-2\"><a href=\"#cb26-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb26-3\"><a href=\"#cb26-3\" aria-hidden=\"true\"></a>    <span class=\"bu\">std::</span>printf(<span class=\"st\">&quot;Avg Loss: </span><span class=\"sc\">%f\\t%f%%</span><span class=\"st\"> correct</span><span class=\"sc\">\\n</span><span class=\"st\">&quot;</span>, avg_loss(), accuracy() * <span class=\"fl\">100.0</span>);</span>\n<span id=\"cb26-4\"><a href=\"#cb26-4\" aria-hidden=\"true\"></a>}</span>\n<span id=\"cb26-5\"><a href=\"#cb26-5\" aria-hidden=\"true\"></a></span>\n<span id=\"cb26-6\"><a href=\"#cb26-6\" aria-hidden=\"true\"></a><span class=\"dt\">num_t</span> CCELossNode::accuracy() <span class=\"at\">const</span></span>\n<span id=\"cb26-7\"><a href=\"#cb26-7\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb26-8\"><a href=\"#cb26-8\" aria-hidden=\"true\"></a>    <span class=\"cf\">return</span> <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">num_t</span>&gt;(<span class=\"va\">correct_</span>)</span>\n<span id=\"cb26-9\"><a href=\"#cb26-9\" aria-hidden=\"true\"></a>           / <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">num_t</span>&gt;(<span class=\"va\">correct_</span> + <span class=\"va\">incorrect_</span>);</span>\n<span id=\"cb26-10\"><a href=\"#cb26-10\" aria-hidden=\"true\"></a>}</span>\n<span id=\"cb26-11\"><a href=\"#cb26-11\" aria-hidden=\"true\"></a><span class=\"dt\">num_t</span> CCELossNode::avg_loss() <span class=\"at\">const</span></span>\n<span id=\"cb26-12\"><a href=\"#cb26-12\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb26-13\"><a href=\"#cb26-13\" aria-hidden=\"true\"></a>    <span class=\"cf\">return</span> <span class=\"va\">cumulative_loss_</span> / <span class=\"kw\">static_cast</span>&lt;<span class=\"dt\">num_t</span>&gt;(<span class=\"va\">correct_</span> + <span class=\"va\">incorrect_</span>);</span>\n<span id=\"cb26-14\"><a href=\"#cb26-14\" aria-hidden=\"true\"></a>}</span>\n<span id=\"cb26-15\"><a href=\"#cb26-15\" aria-hidden=\"true\"></a></span>\n<span id=\"cb26-16\"><a href=\"#cb26-16\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> CCELossNode::reset_score()</span>\n<span id=\"cb26-17\"><a href=\"#cb26-17\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb26-18\"><a href=\"#cb26-18\" aria-hidden=\"true\"></a>    <span class=\"va\">cumulative_loss_</span> = <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb26-19\"><a href=\"#cb26-19\" aria-hidden=\"true\"></a>    <span class=\"va\">correct_</span>         = <span class=\"dv\">0</span>;</span>\n<span id=\"cb26-20\"><a href=\"#cb26-20\" aria-hidden=\"true\"></a>    <span class=\"va\">incorrect_</span>       = <span class=\"dv\">0</span>;</span>\n<span id=\"cb26-21\"><a href=\"#cb26-21\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>These routines let us observe the performance of our network during training in terms of both loss and accuracy.</p>\n<h3 id=\"gradient-descent-optimizer\">Gradient Descent Optimizer</h3>\n<p>At some point after loss gradients with respect to model parameters have accumulated, the gradients will need to be used to actually adjust the parameters themselves. This is provided by the <code>GDOptimizer</code> class implemented as below:</p>\n<div class=\"sourceCode\" id=\"cb27\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb27-1\"><a href=\"#cb27-1\" aria-hidden=\"true\"></a><span class=\"kw\">class</span> GDOptimizer : <span class=\"kw\">public</span> Optimizer</span>\n<span id=\"cb27-2\"><a href=\"#cb27-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb27-3\"><a href=\"#cb27-3\" aria-hidden=\"true\"></a><span class=\"kw\">public</span>:</span>\n<span id=\"cb27-4\"><a href=\"#cb27-4\" aria-hidden=\"true\"></a>    <span class=\"co\">// &quot;Eta&quot; is the commonly accepted character used to denote the learning</span></span>\n<span id=\"cb27-5\"><a href=\"#cb27-5\" aria-hidden=\"true\"></a>    <span class=\"co\">// rate. Given a loss gradient dJ/dp for some parameter p, during gradient</span></span>\n<span id=\"cb27-6\"><a href=\"#cb27-6\" aria-hidden=\"true\"></a>    <span class=\"co\">// descent, p will be adjusted such that p&#39; = p - eta * dJ/dp.</span></span>\n<span id=\"cb27-7\"><a href=\"#cb27-7\" aria-hidden=\"true\"></a>    GDOptimizer(<span class=\"dt\">num_t</span> eta) : <span class=\"va\">eta_</span>{eta} {}</span>\n<span id=\"cb27-8\"><a href=\"#cb27-8\" aria-hidden=\"true\"></a></span>\n<span id=\"cb27-9\"><a href=\"#cb27-9\" aria-hidden=\"true\"></a>    <span class=\"co\">// This should be invoked at the end of each batch&#39;s evaluation. The</span></span>\n<span id=\"cb27-10\"><a href=\"#cb27-10\" aria-hidden=\"true\"></a>    <span class=\"co\">// interface technically permits the use of different optimizers for</span></span>\n<span id=\"cb27-11\"><a href=\"#cb27-11\" aria-hidden=\"true\"></a>    <span class=\"co\">// different segments of the computational graph.</span></span>\n<span id=\"cb27-12\"><a href=\"#cb27-12\" aria-hidden=\"true\"></a>    <span class=\"dt\">void</span> train(Node&amp; node) <span class=\"kw\">override</span>;</span>\n<span id=\"cb27-13\"><a href=\"#cb27-13\" aria-hidden=\"true\"></a></span>\n<span id=\"cb27-14\"><a href=\"#cb27-14\" aria-hidden=\"true\"></a><span class=\"kw\">private</span>:</span>\n<span id=\"cb27-15\"><a href=\"#cb27-15\" aria-hidden=\"true\"></a>    <span class=\"dt\">num_t</span> <span class=\"va\">eta_</span>;</span>\n<span id=\"cb27-16\"><a href=\"#cb27-16\" aria-hidden=\"true\"></a>};</span>\n<span id=\"cb27-17\"><a href=\"#cb27-17\" aria-hidden=\"true\"></a></span>\n<span id=\"cb27-18\"><a href=\"#cb27-18\" aria-hidden=\"true\"></a><span class=\"dt\">void</span> GDOptimizer::train(Node&amp; node)</span>\n<span id=\"cb27-19\"><a href=\"#cb27-19\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb27-20\"><a href=\"#cb27-20\" aria-hidden=\"true\"></a>    <span class=\"dt\">size_t</span> param_count = node.param_count();</span>\n<span id=\"cb27-21\"><a href=\"#cb27-21\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != param_count; ++i)</span>\n<span id=\"cb27-22\"><a href=\"#cb27-22\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb27-23\"><a href=\"#cb27-23\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span>&amp; param    = *node.param(i);</span>\n<span id=\"cb27-24\"><a href=\"#cb27-24\" aria-hidden=\"true\"></a>        <span class=\"dt\">num_t</span>&amp; gradient = *node.gradient(i);</span>\n<span id=\"cb27-25\"><a href=\"#cb27-25\" aria-hidden=\"true\"></a></span>\n<span id=\"cb27-26\"><a href=\"#cb27-26\" aria-hidden=\"true\"></a>        param = param - <span class=\"va\">eta_</span> * gradient;</span>\n<span id=\"cb27-27\"><a href=\"#cb27-27\" aria-hidden=\"true\"></a></span>\n<span id=\"cb27-28\"><a href=\"#cb27-28\" aria-hidden=\"true\"></a>        <span class=\"co\">// Reset the gradient which will be accumulated again in the next</span></span>\n<span id=\"cb27-29\"><a href=\"#cb27-29\" aria-hidden=\"true\"></a>        <span class=\"co\">// training epoch</span></span>\n<span id=\"cb27-30\"><a href=\"#cb27-30\" aria-hidden=\"true\"></a>        gradient = <span class=\"dt\">num_t</span>{<span class=\"fl\">0.0</span>};</span>\n<span id=\"cb27-31\"><a href=\"#cb27-31\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb27-32\"><a href=\"#cb27-32\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>Not shown is the <code>Optimizer</code> class interface which simply provides a virtual <code>train</code> method. As you implement more sophisticated optimizers, you will find that more state may be needed to perform necessary tasks (e.g. computing gradient moving averages). Also implicit in this implementation is that our <code>Node</code> classes need to provide an indexing scheme for each parameter as well as an accessor for the total number of parameters. For example, accessing the <code>FFNode</code> parameters is a fairly simple matter:</p>\n<div class=\"sourceCode\" id=\"cb28\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb28-1\"><a href=\"#cb28-1\" aria-hidden=\"true\"></a><span class=\"dt\">num_t</span>* FFNode::param(<span class=\"dt\">size_t</span> index)</span>\n<span id=\"cb28-2\"><a href=\"#cb28-2\" aria-hidden=\"true\"></a>{</span>\n<span id=\"cb28-3\"><a href=\"#cb28-3\" aria-hidden=\"true\"></a>    <span class=\"cf\">if</span> (index &lt; <span class=\"va\">weights_</span>.size())</span>\n<span id=\"cb28-4\"><a href=\"#cb28-4\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb28-5\"><a href=\"#cb28-5\" aria-hidden=\"true\"></a>        <span class=\"cf\">return</span> &amp;<span class=\"va\">weights_</span>[index];</span>\n<span id=\"cb28-6\"><a href=\"#cb28-6\" aria-hidden=\"true\"></a>    }</span>\n<span id=\"cb28-7\"><a href=\"#cb28-7\" aria-hidden=\"true\"></a>    <span class=\"cf\">return</span> &amp;<span class=\"va\">biases_</span>[index - <span class=\"va\">weights_</span>.size()];</span>\n<span id=\"cb28-8\"><a href=\"#cb28-8\" aria-hidden=\"true\"></a>}</span></code></pre></div>\n<p>The parameters are indexed 0 through the return value of <code>Node::param_count()</code> minus one. Note that the optimizer doesn’t care whether the parameter accessed in this way is a weight, bias, average, etc. As a trainable parameter, the only thing that matters during gradient descent is the current value and the loss gradient.</p>\n<h2 id=\"tying-it-all-together\">Tying it all Together</h2>\n<p>Now that we have the individual nodes implemented, all that remains is to wire things up and start training! This is how we can construct a model with a input, hidden, output, and loss nodes, all wired sequentially.</p>\n<div class=\"sourceCode\" id=\"cb29\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb29-1\"><a href=\"#cb29-1\" aria-hidden=\"true\"></a>    Model model{<span class=\"st\">&quot;ff&quot;</span>};</span>\n<span id=\"cb29-2\"><a href=\"#cb29-2\" aria-hidden=\"true\"></a></span>\n<span id=\"cb29-3\"><a href=\"#cb29-3\" aria-hidden=\"true\"></a>    MNIST&amp; mnist = &amp;model.add_node&lt;MNIST&gt;(images, labels);</span>\n<span id=\"cb29-4\"><a href=\"#cb29-4\" aria-hidden=\"true\"></a></span>\n<span id=\"cb29-5\"><a href=\"#cb29-5\" aria-hidden=\"true\"></a>    FFNode&amp; hidden = model.add_node&lt;FFNode&gt;(<span class=\"st\">&quot;hidden&quot;</span>, Activation::ReLU, <span class=\"dv\">32</span>, <span class=\"dv\">784</span>);</span>\n<span id=\"cb29-6\"><a href=\"#cb29-6\" aria-hidden=\"true\"></a></span>\n<span id=\"cb29-7\"><a href=\"#cb29-7\" aria-hidden=\"true\"></a>    FFNode&amp; output</span>\n<span id=\"cb29-8\"><a href=\"#cb29-8\" aria-hidden=\"true\"></a>        = model.add_node&lt;FFNode&gt;(<span class=\"st\">&quot;output&quot;</span>, Activation::Softmax, <span class=\"dv\">10</span>, <span class=\"dv\">32</span>);</span>\n<span id=\"cb29-9\"><a href=\"#cb29-9\" aria-hidden=\"true\"></a></span>\n<span id=\"cb29-10\"><a href=\"#cb29-10\" aria-hidden=\"true\"></a>    CCELossNode&amp; loss = &amp;model.add_node&lt;CCELossNode&gt;(<span class=\"st\">&quot;loss&quot;</span>, <span class=\"dv\">10</span>, batch_size);</span>\n<span id=\"cb29-11\"><a href=\"#cb29-11\" aria-hidden=\"true\"></a>    loss.set_target(mnist.label());</span>\n<span id=\"cb29-12\"><a href=\"#cb29-12\" aria-hidden=\"true\"></a></span>\n<span id=\"cb29-13\"><a href=\"#cb29-13\" aria-hidden=\"true\"></a>    model.create_edge(hidden, mnist);</span>\n<span id=\"cb29-14\"><a href=\"#cb29-14\" aria-hidden=\"true\"></a>    model.create_edge(output, hidden);</span>\n<span id=\"cb29-15\"><a href=\"#cb29-15\" aria-hidden=\"true\"></a>    model.create_edge(loss, output);</span>\n<span id=\"cb29-16\"><a href=\"#cb29-16\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb29-17\"><a href=\"#cb29-17\" aria-hidden=\"true\"></a>    <span class=\"co\">// This function should visit all constituent nodes and initialize</span></span>\n<span id=\"cb29-18\"><a href=\"#cb29-18\" aria-hidden=\"true\"></a>    <span class=\"co\">// their parameters</span></span>\n<span id=\"cb29-19\"><a href=\"#cb29-19\" aria-hidden=\"true\"></a>    model.init();</span>\n<span id=\"cb29-20\"><a href=\"#cb29-20\" aria-hidden=\"true\"></a>    </span>\n<span id=\"cb29-21\"><a href=\"#cb29-21\" aria-hidden=\"true\"></a>    <span class=\"co\">// Create a gradient descent optimizer with a hardcoded learning rate</span></span>\n<span id=\"cb29-22\"><a href=\"#cb29-22\" aria-hidden=\"true\"></a>    GDOptimizer optimizer{<span class=\"dt\">num_t</span>{<span class=\"fl\">0.3</span>}};</span></code></pre></div>\n<p>As mentioned before, the “edges” are somewhat cosmetic as none of our nodes actually support multiple node inputs or outputs. An actual implementation that would support such a non-sequential topology will likely need a sort of signals and slots abstraction. The interface provided here is strictly to impress on you the importance of the abstraction of our neural network as a computational graph, which is critical when additional complexity is added later.</p>\n<p>With this, we are ready to implement the core loop of the training algorithm.</p>\n<div class=\"sourceCode\" id=\"cb30\"><pre class=\"sourceCode cpp\"><code class=\"sourceCode cpp\"><span id=\"cb30-1\"><a href=\"#cb30-1\" aria-hidden=\"true\"></a>    <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> i = <span class=\"dv\">0</span>; i != <span class=\"dv\">256</span>; ++i)</span>\n<span id=\"cb30-2\"><a href=\"#cb30-2\" aria-hidden=\"true\"></a>    {</span>\n<span id=\"cb30-3\"><a href=\"#cb30-3\" aria-hidden=\"true\"></a>        <span class=\"cf\">for</span> (<span class=\"dt\">size_t</span> j = <span class=\"dv\">0</span>; j != <span class=\"dv\">64</span>; ++j)</span>\n<span id=\"cb30-4\"><a href=\"#cb30-4\" aria-hidden=\"true\"></a>        {</span>\n<span id=\"cb30-5\"><a href=\"#cb30-5\" aria-hidden=\"true\"></a>            mnist-&gt;forward();</span>\n<span id=\"cb30-6\"><a href=\"#cb30-6\" aria-hidden=\"true\"></a>            loss-&gt;reverse();</span>\n<span id=\"cb30-7\"><a href=\"#cb30-7\" aria-hidden=\"true\"></a>        }</span>\n<span id=\"cb30-8\"><a href=\"#cb30-8\" aria-hidden=\"true\"></a></span>\n<span id=\"cb30-9\"><a href=\"#cb30-9\" aria-hidden=\"true\"></a>        model.train(optimizer);</span>\n<span id=\"cb30-10\"><a href=\"#cb30-10\" aria-hidden=\"true\"></a>    }</span></code></pre></div>\n<p>Here, we train our model over 256 batches. Each batch consists of 64 samples, and for each sample, we invoke <code>MNIST::forward</code> and <code>CCELossNode::reverse</code>. During the forward pass, our <code>MNIST</code> node extracts a new sample and label and forwards the sample data to the next node. This data propagates through the network until the final output distribution is passed to the loss node and losses are computed. All this occurs within the single line: <code>mnist-&gt;forward()</code>. In the subsequent line, gradients are computed and passed back until the reverse accumulation terminates at the <code>MNIST</code> node again. After all gradients for the batch are accumulated, the model can <code>train</code>, which invokes the optimizer on each node to simultaneously adjust all model parameters for each node.</p>\n<p>After adding some additional logging, the results of the network look like this:</p>\n<pre><code>Executing training routine\nLoaded images file with 60000 entries\nhidden: 784 -&gt; 32\noutput: 32 -&gt; 10\nInitializing model parameters with seed: 116726080\nAvg Loss: 0.254111  96.875000% correct</code></pre>\n<p>To evaluate the efficacy of the model, we can serialize all the parameters to disk, load them up, disable the training step, and evaluate the model on the test data. For this particular run, the results were as follows:</p>\n<pre><code>Executing evaluation routine\nLoaded images file with 10000 entries\nhidden: 784 -&gt; 32\noutput: 32 -&gt; 10\nAvg Loss: 0.292608  91.009998% correct</code></pre>\n<p>As you can see, the accuracy dropped on the test data relative to the training data. This is a hallmark characterstic of <em>overfitting</em>, which is to be expected given that we haven’t implemented any regularization whatsoever! That said, 91% accuracy isn’t all that bad when we consider the fact that our model has no notion of pixel-adjacency whatsoever. For image data, convolutional networks are a far more apt architecture than the one chosen for this demonstration.</p>\n<h3 id=\"regularization\">Regularization</h3>\n<p>Regularization will not be implemented as part of this self-contained neural network, but it is such a fundamental part of most deep learning frameworks that we’ll discuss it here.</p>\n<p>Often, the dimensionality of our model will be much higher than what is stricly needed to make accurate predictions. This stems from the fact that we seldom no a priori how many features are needed for the model to be successful. Thus, the likelihood of overfitting increases as more training data is fed into the model. The primary tool to combat overfitting is <em>regularization</em>. Loosely speaking, regularization is any strategy employed to restrict the hypothesis space of fit-functions the model can occcupy to prevent overfitting.</p>\n<p>What is meant by restricting the hypothesis space, you might ask? The idea is to consider the entire family of functions possible spanned by the model’s entire parameter vector. If our model has 10000 parameters (many networks will easily exceed this), each unique 10000-dimensional vector corresponds to a possible solution. However, we know it’s unlikely that certain parameters should be vastly greater in magnitude than others in a theoretically <em>optimal</em> condition. Models with “strange” parameter vectors that are unlikely to be the optimal solution are likely converged on as a result of overfitting. Therefore, it makes sense to consider ways to constrain the space this parameter vector may occupy.</p>\n<p>The most common approach to achieve this is to add an initial penalty term to the loss function which is a function of the weight. For example, here is the cross-entropy loss with the so-called <span class=\"math inline\">L^2</span> regularizer (also known as the ridge regularizer) added:</p>\n<p><span class=\"math display\">-\\sum_{x\\in X} y_x \\log{\\hat{y}_x} + \\frac{\\lambda}{2} \\mathbf{w}^{T}\\mathbf{w}</span></p>\n<p>In a slight abuse of notation, <span class=\"math inline\">\\mathbf{w}</span> here corresponds to a vector containing every weight in our network. The factor <span class=\"math inline\">\\lambda</span> is a constant we can choose to adjust the penalty size. Note that when a regularizer is used, we <em>expect training loss to increase</em>. The tradeoff is that we simultaneously <em>expect test loss to decrease</em>. Tuning the regularization speed <span class=\"math inline\">\\lambda</span> is a routine problem for model fitting in the wild.</p>\n<p>By modifying the loss function, in principal, all loss gradients must change as well. Fortunately, as we’ve only added a quadratic term to the loss, the only change to the gradient will be an additional linear additive term <span class=\"math inline\">\\lambda\\mathbf{w}</span>. This means we don’t have to add a ton of code to modify all the gradient calculations thus far. Instead, we can simply <em>decay</em> the weight based on a percentage of the weight’s magnitude when we adjust the weight after each batch is performed. You will often here this type of regularization referred to as simply <em>weight decay</em> for this reason.</p>\n<p>To implement <span class=\"math inline\">L^2</span> regularization, simply add a percentage of a weight’s value to its loss gradient. Crucially, do not adjust bias parameters in the same way. We only wish to penalize parameters for which increased magnitude corresponds with more complex models. Bias parameters are simply scalar offsets, regardless of their value and do not scale the inputs. Thus, attempting to regularize them will likely increase <em>both</em> training and test error.</p>\n<h2 id=\"where-to-go-from-here\">Where to go from here</h2>\n<p>At this point, our toy network is complete. With any luck, you’ve taken away a few key patterns that will aid in both your intuition about how deep learning techniques work, and your efforts to actually implement them. The implementation presented here is both far from complete, and far from ideal. Critically missing is adequate visualization for the error rate as a function of training time, mis-predicted samples, and the model parameters themselves. Without visualization, model tuning can be time consuming, veering on impossible. In addition, our model training samples are always ingested in the order they are provided in the training file. In practice, this sequence should be shuffled to avoid introducing training bias.</p>\n<p>Here are a few additional things you can try, in no particular order.</p>\n<ul>\n<li>Add various regularization modes such as <span class=\"math inline\">L^2</span>, <span class=\"math inline\">L^1</span>, or dropout.</li>\n<li>Track loss reduction momentum to implement <em>early stopping</em>, thereby reducing wasted training cycles</li>\n<li>Implement a convolution node with a variable sized weight filter. You will likely need to implement the max-pooling operation as well.</li>\n<li>Implement a batch-normalization node.</li>\n<li>Modify the interfaces provided here so that <code>Node::forward</code> and <code>Node::reverse</code> also pass slot ids to handle nodes with multiple inputs and outputs.</li>\n<li>Leverage the slots abstraction above to implement a residual network.</li>\n<li>Improve efficiency by adding support for SIMD or GPU-based compute kernels.</li>\n<li>Add multithreading to allow separate batches to be trained simultaneously.</li>\n<li>Provide alternative optimizers that decay the learning rate over time, or decay the learning rate as a function of loss momentum.</li>\n<li>Add a “meta-training” feature that can tune <em>hyperparameters</em> used to configure your model (e.g. learning rate, regularization rate, network depth, layer dimension).</li>\n<li>Pick a research paper you’re interested in and endeavor to implement it end to end.</li>\n</ul>\n<p>As you can see, the sky’s the limit and there is simply no end to the amount of work possible to improve a neural network’s ability to learn and make inferences. A good body of work is also there to improve tooling around data ingestion, model configuration serialization, automated testing, continuous learning in the cloud, etc. Crucially though, new research and development is constantly in the works in this ever-changing field. On top of studying deep learning as a discipline in and of itself, there is plenty of room for specialization in particular domains, be it computer vision, NLP, epidemiology, or something else. My hope is that for some of you, the neural network in a weekend may take the form of a neural network in a fulfilling career or lifetime.</p>\n<h4 id=\"further-reading\">Further Reading</h4>\n<p>If you get a single book, <em>Deep Learning</em> (listed first in the following table) is highly recommended as a relatively self-complete text with cogent explanations written in a readable style. As you venture into attempting to perform ML tasks in a particular domain, search for a relatively recent highly cited “survey” paper, which should introduce you to the main ideas and give you a starting point for further research. <a href=\"https://arxiv.org/pdf/1907.09408.pdf\">Here</a> is an example of one such survey paper, in this case with an emphasis on object detection.</p>\n<table>\n<colgroup>\n<col style=\"width: 33%\" />\n<col style=\"width: 33%\" />\n<col style=\"width: 33%\" />\n</colgroup>\n<thead>\n<tr class=\"header\">\n<th>Title</th>\n<th>Authors</th>\n<th>Description</th>\n</tr>\n</thead>\n<tbody>\n<tr class=\"odd\">\n<td><em>Deep Learning</em></td>\n<td>Ian Goodfellow, Yoshua Bengio, and Aaron Courville</td>\n<td>Seminal text on the theory and practice of using neural networks to learn and perform tasks</td>\n</tr>\n<tr class=\"even\">\n<td><em>Numerical Methods for Scientists and Engineers</em></td>\n<td>R. W. Hamming</td>\n<td>Excellent general text covering important topics such as floating point precision and various approximation methods</td>\n</tr>\n<tr class=\"odd\">\n<td><em>Standard notations for Deep Learning</em> (<a href=\"https://cs230.stanford.edu/files/Notation.pdf\">link</a>)</td>\n<td>Stanford CS230 Course Notes</td>\n<td>Cheatsheet covering standard notation used by many texts and papers</td>\n</tr>\n<tr class=\"even\">\n<td><em>Neural Networks and Deep Learning</em> (<a href=\"http://neuralnetworksanddeeplearning.com/index.html\">link</a>)</td>\n<td>Michael Nielsen</td>\n<td>A gentler introduction to the theory and practice of neural networks</td>\n</tr>\n<tr class=\"odd\">\n<td><em>Automatic Differentiation in Machine Learning: a Survey</em> (<a href=\"https://arxiv.org/pdf/1502.05767.pdf\">link</a>)</td>\n<td>Atılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind</td>\n<td>Excellent survey paper documentating the various algorithms used for computational differentiation including viable alternatives to backpropagation</td>\n</tr>\n</tbody>\n</table>\n</body>\n</html>\n"
  },
  {
    "path": "doc/DOC.md",
    "content": "---\ntitle: C++ Neural Network in a Weekend\nauthor: Jeremy Ong\nheader-includes: |\n  \\usepackage{amsmath}\n  \\usepackage{tikz}\n  \\usepackage{mathtools}\n  \\usepackage{amsthm}\n  \\usepackage{amssymb}\n  \\usepackage{bm}\n  \\usetikzlibrary{positioning}\n  \\usetikzlibrary{arrows}\n  \\usetikzlibrary{shapes}\n  \\usetikzlibrary{calc}\n---\n\n## Introduction\n\nWould you like to write a neural network from start to finish?\nAre you perhaps shaky on some of the fundamental concepts and derivations, such as categorical cross-entropy loss or backpropagation?\nAlternatively, would you like an introduction to machine learning without relying on \"magical\" frameworks that seem to perform AI miracles with only a few lines of code (and just as little intuition)?\nIf so, this article was written for you.\n\nDeep learning as a technology and discipline has been booming.\nNearly every facet of deep learning is teeming with progress and healthy competition to achieve state of the art performance and efficiency.\nIt's no surprise that resources tend to emphasize the \"latest and greatest\" in feats such as object recognition, natural language parsing, \"deep fakes\", and more.\nIn contrast, fewer resources expand as much on the practical *engineering* aspects of deep learning.\nThat is, how should a deep learning framework be structured?\nHow do you go about rolling your own infrastructure instead of relying on Keras, Pytorch, Tensorflow, or any of the other dominant frameworks?\nWhether you wish to write your own for learning purposes,\nor if you need to deploy a neural network on a constrained (i.e. embedded) device,\nthere is plenty to be gained from authoring a neural network from scratch.\n\nThe neural network outlined here is hosted on [github](https://github.com/jeremyong/cpp_nn_in_a_weekend) and has enough abstractions to vaguely resemble a production network, without being overly engineered as to be indigestible in a sitting or two.\nThe training and test data provided is the venerable [MNIST](http://yann.lecun.com/exdb/mnist/) dataset of handwritten digits.\nWhile more exotic (and original) datasets exist, MNIST is chosen here because its sheer ubiquity guarantees you can find corresponding literature to help drive further experimentation, or troubleshoot when things go wrong.\n\n## Background\n\nThis section serves as a moderately high-level description of the major mathematical underpinnings of neural networks and may be safely skipped by those who prefer to jump straight to the code.\n\nSuppose we have a task we would like a machine learning model to complete (e.g. recognizing handwritten digits).\nAt a high level, we need to perform the following tasks:\n\n1. First, we must conceptualize the task as a \"function\" such that the inputs and outputs of the task can be described in a concrete mathematical sense (amenable for programmability).\n2. Second, we need a way to quantify the degree to which our model is performing poorly against a known set of correct answers. This is typically denoted as the *loss* or *objective* function of the model.\n3. Third, we need an *optimization strategy* which will describe how to adjust the model after feedback is provided regarding the model's performance as per the loss function described above.\n4. Fourth, we need a *regularization strategy* to address inadvertently tuning the model with a high degree of specificity to our training data, at the cost of generalized performance when handling inputs not yet encountered.\n5. Fifth, we need an *architecture* for our model, including how inputs are transformed into outputs and an enumaration of all the adjustable parameters the model supports.\n6. Finally, we need a robust *implementation* that executes the above within memory and execution budgets, accounting for floating-point stability, reproducibility, and a number of other engineering-related matters.\n\n*Deep learning* is distinct from other machine learning models in that the architecture is heavily over-parameterized and based on simpler *building blocks* as opposed to bespoke components.\nThe building blocks used are neurons, or particular arrangements of neurons, typically organized as layers.\nOver the course of training a deep learning model, it is expected that *features* of the inputs are learned and manifested as various parameter values in these neurons.\nThis is in contrast to traditional machine learning, where features are not learned, but implemented directly.\n\n### Categorical Cross-Entropy Loss\n\nMore concretely, the task at hand is to train a model to recognize a 28 by 28 pixel handwritten greyscale digit.\nFor simplicity, our model will interpret the data as a flattened 784-dimensional vector.\nInstead of describing the architecture of the model first, we'll start with understanding what the model should output\nand how to assess the model's performance.\nThe output of our model will be a 10-dimensional vector, representing the probability distribution of the supplied input.\nThat is, each element of the output vector indicates the model's estimation of the probability that the digit's value matches the corresponding element index.\nFor example, if the model outputs:\n\n$$M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]$$\n\nfor some input image $\\mathbf{I}$, we interpret this to mean that the model believes there is an equal chance of the examined digit to be a 2 or a 3.\n\nNext, we should consider how to quantify the model's loss.\nSuppose, for example, that the image $\\mathbf{I}$ actually corresponded to the digit \"7\" (our model made a horrible prediction!),\nhow might we penalize the model?\nIn this case, we know that the *actual* probability distribution is the following:\n\n$$\\left[0, 0, 0, 0, 0, 0, 0, 1, 0, 0\\right]$$\n\nThis is known as a \"one-hot\" encoded vector, but it may be helpful to think of it as a probability distribution given a set of events that are mutually exclusive (a digit cannot be both a \"7\" *and* a \"3\" for instance).\n\nFortunately, information theory provides us some guidance on defining an easy-to-compute loss function which quantifies the dissimilarities between two probability distributions.\nIf the probability of of an event $E$ is given as $P(E)$, then the *entropy* of this event is given as $-\\log P(E)$.\nThe negation ensures that this is a positive quantity, and by inspection, the entropy increases as an event becomes less likely.\nConversely, in the limit as $P(E)$ approaches $1$, the entropy shrinks to $0$.\nWhile several interpretations of entropy are possible, the pertinent interpretation here is that entropy is a *measure of the information conveyed when a particular event occurs*.\nThat the \"sun rose this morning\" is a fairly mundane observation but being told \"the sun exploded\" is sure to pique your attention.\nBecause we are reasonably certain that the sun rises each morning (with near 100% confidence), that \"the sun rises\" is an event that conveys little additional information when it occurs.\n\nLet's consider next entropy in the context of a probability distribution.\nGiven a discrete random variable $X$ which can take on values $x_0, \\dots, x_{n-1}$ with\nprobabilities $p(x_0), \\dots, p(x_{n-1})$, the entropy of the random variable $X$ is defined as:\n\n$$H(X) = -\\sum_{x \\in X} p(x) \\log p(x)$$\n\nFor example, suppose $W$ is a binary random variable that represents today's weather which can either be \"sunny\" or \"rainy\" (a binary random variable).\nThe entropy $H(W)$ can be given as:\n\n$$H(W) = -S\\log S - (1 - S) \\log (1 - S)$$\n\nwhere $S$ is the probability of a sunny day, and hence $1 - S$ is the probability of a rainy day.\nAs a binary random variable, the summation over weighted entropies expands to only two terms.\nWhat does this quantity mean?\nIf we were to describe it in words, each term of the sum in the entropy calculation corresponds to the information of a particular event, weighted by the probability of the event.\nThus, the entropy of the distribution is literally the *expected amount of information contained in an event* for a given distribution.\nIf we plot $-S\\log S - (1 - S) \\log(1 - S)$ as a function of $S$, we will see something like this:\n\n```{.matplotlib}\nimport matplotlib.pyplot as plt\nimport array as arr\nimport math as math\n\ns = arr.array('f')\ns.append(0)\nh = arr.array('f')\nh.append(0)\n\nlast = 0\nn = 30\nfor i in range(0, n):\n    last += 1 / (n + 1)\n    s.append(last)\n    h.append(-last * math.log(last) - (1 - last) * math.log(1 - last))\n\ns.append(1.0)\nh.append(0)\n\nplt.figure()\nplt.plot(s, h)\nplt.xlabel('$S$')\nplt.ylabel('$H(S) = -S\\log S - (1 - S)\\log (1 - S)$')\nplt.title('Binary Entropy')\n```\n\nAs a minor note, while $\\log 0$ is an undefined quantity, information theorists accept that $\\lim_{p\\rightarrow 0} p\\log p = 0$ by convention.\nIntuitively, the expected entropy should be unaffected by the set of impossible events.\n\nAs you might expect, when the distribution is 50-50, the uncertainty of a binary is maximal,\nand by extension the amount of information contained in each event is maximized too.\nPut another way, if you lived in an area where it was always sunny, you wouldn't *learn anything*\nif someone told you it was sunny today. However, in a tropical region characterized by capricious weather,\ninformation conveyed about the weather is far more meaningful.\n\nIn the previous example, we weighted the event entropies according to the event's probability distribution.\nWhat would happen if, instead, we used weights corresponding to a *different* probability distribution?\nThis is known as the *cross entropy*:\n\n$$H(p, q) = -\\sum_{x \\in X} p(x)\\log q(x)$$\n\nTo get some intuition about this, first, we note that if $p(x) = q(x), \\forall x\\in X$, the cross entropy trivially matches the self-entropy.\nLet's go back to our binary entropy example and visualize what it looks like if we chose a completely *incorrect* distribution.\nSpecifically, suppose we computed the cross entropy where if the probability of a sunny day is $S$, we weight the entropy with $1 - S$ instead of $S$ as in the self-entropy formula.\n\n```{.matplotlib}\nimport matplotlib.pyplot as plt\nimport array as arr\nimport math as math\n\ns = arr.array('f')\nh = arr.array('f')\n\nlast = 0\nn = 30\nfor i in range(0, n):\n    last += 1 / (n + 1)\n    s.append(last)\n    h.append(-(1 - last) * math.log(last) - last * math.log(1 - last))\n\nplt.figure()\nplt.plot(s, h)\nplt.xlabel('$S$')\nplt.ylabel('$-(1-S)\\log S - S\\log (1 - S)$')\nplt.title('Cross entropy with mismatched distribution')\n```\n\nIf you compare the values with the previous figure, you'll see that the cross entropy diverges from the self-entropy\neverywhere except $0.5$, where $S = 1 - S$. The difference between the cross entropy $H(p, q)$ and entropy $H(p)$\nprovides then, a *measure of error* between the presumed distribution $q$ and the true distribution $p$.\nThis difference is also known as the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)\nor KL divergence for short.\n\nGiven that the entropy of a given probability distribution $p$ is constant, then $H(p)$ must be constant as well.\nThis is why in practice, we will generally seek to minimize the cross entropy between $p$ and a predicted distribution $q$,\nwhich by extension will minimize the Kullback-Leibler divergence as well.\n\nNow, we have the tools to know if our model is succeeding or not! Given an estimation of a sample's label as before:\n\n$$M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]$$\n\nwe will treat our model's output as a predicted probability distribution of the sample digit's classification from 0 to 9.\nThen, we compute the cross entropy between this predction and the true distribution, which will be in the form of a one-hot vector.\nSupposing the actual digit is 3 in this particular case ($P(7) = 1$):\n\n$$ \\sum_{x\\in \\{0,\\dots, 9\\}} -P(x) \\log Q(x) = -P(3) \\log(Q(3)) = \\log(0.5) \\approx 0.301 $$\n\nLet's make a few observations before continuing. First, for a one-hot vector, the entropy is 0 (can you see why?).\nSecond, by pretending the correct digit above is $3$ and not, say, $7$, we conveniently avoided $\\log 0$ showing up\nin the final expression. A common method to avoid this is to add a small $\\epsilon$ to the log argument to avoid this singularity,\nbut we'll discuss this in more detail later.\n\n### Creating our Approximation Function with a Neural Network\n\nNow that we know how to evaluate our model, we'll need to decide how to go about making predictions in the form of a probability distribution.\nOur model will need to take as inputs, 28x28 images (which as mentioned before, will be flattened to 784x1 vectors for simplicity).\nLet's enumerate the properties our model will need:\n\n1. Parameterization - our model will need parameters we can adjust to \"fit\" the model to the data\n2. Nonlinearity - it is assuredly not the case that the probability distribution can be modeled with a set of linear equations\n3. Differentiability - the gradient of our model's output with respect to any given parameter indicates the *impact* of that parameter on the final result\n\nThere are an infinite number of functions that fit this criteria, but here, we'll use a simple feedforward network with a single hidden layer.\n\n\\begin{center}\n\\begin{tikzpicture}[x=1.5cm, y=1cm, >=stealth]\n\n\\tikzset{%\n    every neuron/.style = {\n        circle,\n        draw,\n        minimum size=0.5cm\n    },\n    neuron missing/.style = {\n        draw=none,\n        scale=1.5,\n        text height=0.3333cm,\n        execute at begin node=\\color{black}$\\vdots$\n    }\n}\n\n\n\\foreach \\m/\\l [count=\\y] in {1,2,3,missing,missing,783,784}\n  \\node [every neuron/.try, neuron \\m/.try] (input-\\m) at (0,2.5-\\y) {};\n\n\\foreach \\m [count=\\y] in {1,2,3,missing,4}\n  \\node [every neuron/.try, neuron \\m/.try ] (hidden-\\m) at (2,2-\\y*1.15) {};\n\n\\foreach \\m [count=\\y] in {1,2,missing,10}\n  \\node [every neuron/.try, neuron \\m/.try ] (output-\\m) at (4,1.25-\\y) {};\n\n\\foreach \\l in {1,2,3,783,784}\n  \\draw [<-] (input-\\l) -- ++(-1,0)\n    node [above, midway] {$x_{\\l}^{[0]}$};\n\n\\foreach \\l [count=\\i] in {1,2,3}\n  \\node [above] at (hidden-\\i.north) {$h_\\l^{[1]}$};\n\n\\node [below] at (hidden-4.south) {$h_n^{[1]}$};\n\n\\foreach \\l in {1,2,10}\n  \\draw [->] (output-\\l) -- ++(1,0)\n    node [above, midway] {$\\hat{y}_{\\l}^{[2]}$};\n\n\\foreach \\i in {1,2,3,783,784}\n{\n  \\draw [->] (input-\\i) -- (hidden-4);\n  \\foreach \\j in {1,...,3}\n    \\draw [->] (input-\\i) -- (hidden-\\j);\n}\n\n\\foreach \\i in {1,2,3,4}\n{\n  \\draw [->] (hidden-\\i) -- (output-10);\n  \\foreach \\j in {1,...,2}\n    \\draw [->] (hidden-\\i) -- (output-\\j);\n}\n\n\\foreach \\l [count=\\x from 0] in {Input, Hidden, Output}\n  \\node [align=center, above] at (\\x*2,2) {\\l \\\\ layer};\n\n\\end{tikzpicture}\n\\end{center}\n\nA few quick notes regarding notation: a superscript of the form $[i]$ is used to denote the $i$th layer.\nA subscript is used to denote a particular element within a layer or vector. The vector $\\mathbf{x}$ is\nusually reserved for training samples, and the vector $\\mathbf{y}$ is typically reserved for sample labels\n(i.e. the desired \"answer\" for a given sample). The vector $\\hat{\\mathbf{y}}$ is used to denote a model's\npredicted labels for a given input.\n\nOn the far left, we have the input layer with $784$ nodes corresponding to each of the 28 by 28 pixels in an individual sample.\nEach $x_i^{(0)}$ is a floating point value between 0 and 1 inclusive.\nBecause the data is encoded with 8 bits of precision, there are 256 possible values for each input.\nEach of the 784 input values fan out to each of the nodes in the hidden layer without modification.\n\nIn the center hidden layer, we have a variable number of nodes that each receive all 784 inputs, perform some processing,\nand fan out the result to the output nodes on the far right.\nThat is, each node in the hidden layer transforms a $\\mathbb{R}^{784}$ vector into a scalar output,\nso as a whole, the $n$ nodes collectively need to map $\\mathbb{R}^\\rightarrow \\mathbb{R}^n$.\nThe simplest way to do this is with an $n\\times 784$ matrix (treating inputs as column vectors).\nModeling the hidden layer this way, each of the $n$ nodes in the hidden layer is associated with a single row\nin our $\\mathbb{R}^{n\\times 784}$ matrix. Each entry of this matrix is referred to as a *weight*.\n\nWe still have two issues we need to address however. First, a matrix provides a linear mapping between\ntwo spaces, and linear maps take $0$ to $0$ (you can visualize such maps as planes through the origin).\nThus, such fully-connected layers typically add a *bias* to each output node to turn the map into an affine map.\nThis enables the model to respond zeroes in the input. Thus, the hidden layer as a whole has now both\na weight matrix, and also a bias vector. A linear mapping with a constant bias is commonly referred to as\nan *affine map*.\n\nThe second issue is that our hidden layer's now-affine mapping still scales linearly with the input, and one of our\nrequirements for our approximation function was nonlinearity (a strict prerequisite for universality).\nThus, we perform one final non-linear operation the result of the affine map.\nThis is known as the *activation function*, and an infinite number of choices present itself here.\nIn practice, the *rectifier function*, defined below, is a perennial choice.\n\n$$f(x) = \\max(0, x)$$\n\n```{.matplotlib}\nimport matplotlib.pyplot as plt\nimport array as arr\nimport math as math\n\nf = arr.array('f')\nf.append(0)\nf.append(0)\nf.append(1)\nx = arr.array('f')\nx.append(-1)\nx.append(-0)\nx.append(1)\n\nplt.figure()\nplt.plot(x, f)\nplt.xlabel('$x$')\nplt.ylabel('$\\max(0, x)$')\nplt.title('Rectifier function')\n```\n\nThe rectifier is popular for having a number of desirable properties.\n\n1. Easy to compute\n2. Easy to differentiate (except at 0, which has not been found to be a problem in practice)\n3. Sparse activation, which aids in addressing model overfitting and \"unlearning\" useful weights\n\nAs our hidden layer units will use this rectifier just before emitting its final output to the next layer,\nour hidden units may be called *rectified linear units* or ReLUs for short.\n\nSummarizing our hidden layer, the output of each unit in the layer can be written as:\n\n$$a_i^{[1]} = \\max(0, W_{i}^{[1]} \\cdot \\mathbf{x}^{[0]} + b_i^{[1]})$$\n\nIt's common to refer to the final activated output of a neural network layer as the vector $\\mathbf{a}$, and the result of the internal\naffine map $\\mathbf{z}$. Using this notation and considering the output of the hidden layer as a whole as a vector quantity, we can write:\n\n$$\n\\begin{aligned}\n\\mathbf{z}^{[1]} &= \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]} \\\\\n\\mathbf{a}^{[1]} &= \\max(\\mathbf{0}, \\mathbf{z}^{[1]}) \\\\\n\\mathbf{a}^{[1]}, \\mathbf{b}^{[1]} &\\in \\mathbb{R}^n \\\\\n\\mathbf{W}^{[1]} &\\in \\mathbb{R}^{n\\times 784} \\\\\n\\mathbf{x}^{[0]} &\\in \\mathbb{R}^{784}\n\\end{aligned}\n$$\n\nThe last layer to consider is the output layer. As with the hidden layer, we need a dimensionality transform,\nin this case, taking vectors in $\\mathbb{R}^n$ and mapping them to vectors in $\\mathbb{R}^{10}$ (corresponding to the 10 possible digits in the target output).\nAs before, we will use an affine map with the appropriately sized weight matrix and bias vector.\nHere, however, the rectifier isn't suitable as an activation function because we want to emit a probability distribution.\nTo be a valid probability distribution, each output of the hidden layer must be in the range $[0, 1]$, and the sum of all outputs must equal $1$.\nThe most common activation function used to achieve this is the *softmax function*:\n\n$$\\mathrm{softmax}(\\mathbf{z})_i = \\frac{\\exp(z_i)}{\\sum_j \\exp(z_j)}$$\n\nGiven a vector input $z$, each component of the softmax output (as a vector quantity) is given as per the expression above.\nThe exponential functions conveniently map negative numbers to positive numbers, and the denominator ensures\nthat all outputs will be between 0 and 1, and sum to 1 as desired. There are other reasons why an exponential function\nis used here, stemming from our choice of a loss function (based on the underpinning notion of maximum-likelihood estimation),\nbut we won't get into that in too much detail here (consult the further reading section at the end to learn more). Suffice it to say\nthat an additional benefit of the exponential function is its clean interaction with the logarithm used in our choice of\nloss function, especially when we will need to compute gradients in the next section.\n\nSummarizing our neural network architecture, with two weight matrices and two bias vectors,\nwe can construct two affine maps which map vectors in $\\mathbb{R}^{784}$ to $\\mathbb{R}^n$ to $\\mathbb{R}^{10}$.\nPrior to forwarding the results of one affine map as the input of the next, we employ an activation function to add\nnon-linearity to the model. First, we use a linear rectifier and second, we use a softmax function, ensuring\nthat we end up with a nice discrete probability distribution with 10 possible events corresponding to the 10 digits.\n\nOur network is small enough that we can actually write out the entire process as a single function using the notation we've built so far:\n\n$$f(\\mathbf{x}^{[0]}) = \\mathbf{y}^{[2]} = \\mathrm{softmax}\\left(\\mathbf{W}^{[2]}\\left(\\max\\left(\\mathbf{0}, \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]}\\right) \\right) + \\mathbf{b}^{[1]} \\right)$$\n\n### Optimizing our network\n\nWe now have a model given above which can turn our 784 dimensional inputs into a 10-element probability distribution,\n*and* we have a way to evaluate how accuracy of each prediction.\nNext, we need a reliable way to improve the model based on the feedback provided by our loss function.\nThis is known as function *optimization*, and most methods of model optimization are based on the principle of *gradient descent*.\n\nThe idea is quite simple.\nGiven a function with a set of parameters which we'll denote $\\bm{\\theta}$, the partial derivative of that function with respect to a\ngiven parameter $\\theta_i \\in \\bm{\\theta}$ tells us the overall *impact* of $\\theta_i$ on the final result.\nIn our model, we have many parameters; each weight and bias constitutes an individually tunable parameter.\nThus, our strategy should be, given a set of input samples, compute the loss our model produces for each sample.\nThen, compute the partial derivatives of that loss with respect to *every parameter* in our model.\nFinally, adjust each parameter in proportion to its impact on the final loss.\nMathematically, this process is described below (note that the superscript $(i)$ is used to denote the $i$-th sample):\n\n$$\n\\begin{aligned}\n\\mathrm{Total~Loss} &= \\sum_i J(\\mathbf{x}^{(i)}; \\bm\\theta) \\\\\n\\mathrm{Compute}~ &\\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in \\bm\\theta \\\\\n\\mathrm{Adjust}~ & \\theta_j \\rightarrow \\theta_j - \\eta \\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in\\bm\\theta\\\\\n\\end{aligned}\n$$\n\nHere, there is some flexibility in the choice of $\\eta$, often referred to as the *learning rate*.\nA small $\\eta$ promotes more conservative and accurate steps, but at the cost of our model being more costly to update.\nA large $\\eta$ on the other hand results in larger updates to our model per training cycle, but may result in instability.\nUpdating in the above fashion should adjust the model such that it will produce a smaller loss given the same inputs.\n\nIn practice, the size of the input set may be very large, rendering it intractable to evaluate the model on every single\ntraining sample in the sum above before adjusting parameters.\nThus, a common strategy is to use *stochastic gradient descent* (abbrev. SGD) and perform loss-gradient-based adjustments after\nevaluating smaller batches of samples. Concretely, the MNIST handwritten digits database contains 60,000 training samples.\nIf we were to train our model using gradient descent in the strictest sense, we would execute the following pseudocode:\n\n```\nmodel.init()\n\nfor i in num_training_cycles\n    loss <- 0\n\n    for n in 60000\n        x <- MNIST.data[n]\n        y <- model.predict(x)\n        loss += loss(y, MNIST.labels[n])\n    \n    model.gradient_descent(loss)\n```\n\nIn contrast, SGD pseudocode would look like:\n\n```\nmodel.init()\n\nfor i in num_batches\n    loss <- 0\n    for j in batch_size\n        x <- MNIST.data[n]\n        y <- model.predict(x)\n        loss += loss(y, MNIST.labels[n])\n    \n    model.gradient_descent(loss)\n```\n\nSGD is very similar, but the batch size can be much smaller than the amount of training data available.\nThis enables the model to get more frequent updates and waste fewer cycles especially at the start of training when the model is likely wildly inaccurate.\n\nWhen it comes time to compute the gradients, we are fortunate to have made the prescient choice of constructing our model solely with elementary functions\nin a manner conducive to relatively painless differentiation. However, we still must exercise care\nas there is plenty of bookkeeping involved. We will evaluate loss-gradients with respect to individual parameters when we walkthrough the implementation\nlater, but for now, let's establish a few preliminary results.\n\nRecall that our choice of loss function was the categorical cross entropy function, reproduced below:\n\n$$J_{CE}(\\mathbf{\\hat{y}}, \\mathbf{y}) = -\\sum_{i} y_i \\log{\\hat{y}_i}$$\n\nThe index $i$ is enumerated over the set of possible outcomes (i.e. the set of digits from 0 to 9).\nThe quantities $y_i$ are the elements of the one-hot label corresponding to the correct outcome, and\n$\\hat{\\mathbf{y}}$ is the discrete probability distribution emitted by our model. We compute $\\partial J_{CE}/\\partial \\hat{y}_i$ like so:\n\n$$\\frac{\\partial J_{CE}}{\\partial \\hat{y}_i} = -\\frac{y_i}{\\hat{y}_i}$$\n\nNotice that for a one-hot vector, this partial derivative vanishes whenever $i$ corresponds to an incorrect outcome.\n\nWorking backwards in our model, we next provide the partial derivative of the softmax function:\n\n$$\n\\begin{aligned}\n\\mathrm{softmax}(\\mathbf{z})_i &= \\frac{\\exp{z_i}}{\\sum_j \\exp{z_j}} \\\\\n\\frac{\\partial \\left(\\mathrm{softmax}(\\mathbf{z})_i\\right)}{\\partial z_k} &=\n\\begin{dcases}\n\\frac{\\left(\\sum_j\\exp{z_j}\\right)\\exp{z_i} - \\exp{2z_i}}{\\left(\\sum_j\\exp{z_j}\\right)^2}& i = k \\\\\n\\frac{-\\exp{z_i}\\exp{z_k}}{\\left(\\sum_j\\exp{z_j}\\right)^2}& i \\neq k\n\\end{dcases} \\\\\n&= \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = k \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_k & i \\neq k\n\\end{cases}\n\\end{aligned}\n$$\n\nThe last set of equations follow from factorizing and rearranging the expressions preceding it.\nIt's often confusing to newer practitioners that the partial derivative of softmax needs this\nunique treatment. The key observation is that softmax is a vector-function. It accepts a vector\nas an input and emits a vector as an output. It also \"mixes\" the input components, thereby imposing\na functional dependence of *every output component* on *every input component*. The lone $\\exp{z_i}$ in the numerator of the \nsoftmax equation creates an asymmetric dependence of the output component on the input components.\n\nFinally, let's consider the partial derivative of the linear rectifier.\n\n$$\n\\begin{aligned}\n\\mathrm{ReLU}(z) &= \\max(0, z) \\\\\n\\frac{\\partial \\mathrm{ReLU}(z)}{\\partial z} &=\n\\begin{cases}\n0 & z < 0 \\\\\n\\mathrm{undefined} & z = 0 \\\\\nz & z > 0\n\\end{cases}\n\\end{aligned}\n$$\n\nWhile the partial derivative *exactly* at 0 is undefined, in practice, the derivative is simply assigned to 0.\nWhy the non-differentiability at 0 isn't an issue has been a subject of practical debate for a long time.\nHere is a simple line of thinking to justify the apparent issue. Consider a rectifier function that is nudged\n*ever so slightly* to the right such that the inflection point is $\\epsilon / 2$, where $\\epsilon$ is the\nsmallest positive floating point number the machine can represent. In this case, the model will never produce\na value that sits directly on this inflection point, and as far as the computer is concerned, we never encounter\na point where this function is non-differentiable. We can even imagine an infinitesimal curve that smooths out\nthe function at that inflection point if we want. Either way, experimentally, the linear rectifier remains\none of the most effective activation functions for reasons mentioned, so we have no reason to discredit it over\na technicality.\n\nNow that we can compute partial derivatives of all the nonlinear functions in our neural network (and presumbly the linear functions as well),\nwe are prepared to compute loss gradients with respect to any parameter in the network.\nOur tool of choice is the venerable chain rule of calculus:\n\n$$\\left.\\frac{\\partial f(g(x))}{\\partial x}\\right\\rvert_x = \\left.\\frac{\\partial f}{\\partial g}\\right\\rvert_{g(x)} \\left.\\frac{\\partial g}{\\partial x}\\right\\rvert_x$$\n\nThis gives us the partial derivative of a composite function $f\\circ g$ evaluated at a particular value of $x$.\nOur model itself is a series of composite functions, and as we can now compute the partials of each individual component\nin the model, we are ready to begin implementation in the next section.\n\n## Setting up\n\nOur project will leverage [CMake](https://cmake.org) as the meta-build system to support as many operating systems\nand compilers as possible. A modern C++ compiler will also be needed to compile the code. As of this writing,\nthe code has been tested with GCC 10.1.0 and Clang 10.0.0. You should feel free to simply adapt the code to your\ncompiler and build system of choice. To emphasize the independent nature of this project, *no further dependencies are needed*.\nAt your discretion, you may opt to use external testing frameworks, matrix and math libraries, data structures,\nor any other external dependency as you see fit. If you're a newer C++ practitioner, you are welcome to model the structure of the final\nproject hosted on Github [here](https://github.com/jeremyong/nn_in_a_weekend).\n\nIn addition, you will need the data hosted on the MNIST database website linked [here](http://yann.lecun.com/exdb/mnist/).\nThe four files available there consist of training images, training labels, test images, and test labels.\n\nIt is highly recommended that you attempt to clone the repository and get things running (instructions on the README will\nalways be kept up to date). The code presented in this article will not be completely exhaustive, but will touch on all\nthe major points, eschewing only various rudimentary helpers functions or uninteresting details for brevity. Alternatively,\na valid approach may be to simply follow along the implementation notes below and attempt to blaze your own trail. Both\nbranches are viable approaches for learning.\n\n## Implementation\n\n### The Computational Graph\n\nThe network we will be constructing is purely sequential.\nInputs flow from left to right and the only connections made are between one layer and the layer immediately succeeding it.\nIn reality, many production-grade neural networks specialized for computer vision, natural language processing,\nand other domains rely on architectures that are non-sequential.\nExamples include ResNet, which introduces connections between layers that are not adjacent, and various recurrent\nneural networks, which have a cyclic topology (outputs of the model are fed back as inputs to the model).\nThus, it's useful to think of the model as a whole as *computational graph*.\nWhile we won't be employing any complicated computational graph topologies here, we will still structure the\ncode with this notion in mind. Each layer of our network will be modeled as a `Node` with data flowing forwards\nand backwards through the node during training. Providing support for a fully general computational graph (i.e. non-sequential)\nis outside the scope of this tutorial, but some scaffolding will be provided should you want to extend it yourself\nin the future. For now, here is the interface we'll use:\n\n```c++\n#include <cstdint>\n#include <string>\n#include <vector>\n\nusing num_t = float;\nusing rne_t = std::mt19937;\n\n// To be defined later. This class encapsulates all the nodes in our graph \nclass Model;\n\nclass Node\n{\npublic:\n    Node(Model& model, std::string name);\n    \n    // Nodes must describe how they should be initialized\n    virtual void init(rne_t& rne) = 0;\n    \n    // During forward propagation, nodes transform input data and feed results\n    // to all subsequent nodes\n    virtual void forward(num_t* inputs) = 0;\n\n    // During reverse propagation, nodes receive loss gradients to its previous\n    // outputs and compute gradients with respect to each tunable parameter\n    virtual void reverse(num_t* gradients) = 0;\n    \n    // If the node has tunable parameters, this method should be overridden\n    // to reflect the quantity of tunable parameters\n    virtual size_t param_count() const noexcept { return 0; }\n    \n    // Accessor for parameter by index\n    virtual num_t* param(size_t index) { return nullptr; }\n    \n    // Access for loss-gradient with respect to a parameter specified by index\n    virtual num_t* gradient(size_t index) { return nullptr; }\n    \n    // Human-readable name for debugging purposes\n    std::string const& name() const noexcept { return name_; }\n    \n    // Information dump for debugging purposes\n    virtual void print() const = 0;\n\nprotected:\n    friend class Model;\n    \n    Model& model_;\n    std::string name_;\n    // Nodes that precede this node in the computational graph\n    std::vector<Node*> antecedents_;\n    // Nodes that succeed this node in the computational graph\n    std::vector<Node*> subsequents_;\n};\n```\n\nThe bulwark of the implementation will consist of implementing this interface for all the nodes in our network.\nWe will need to implement this interface for each of the nodes shown in the diagram below.\n\n\\begin{center}\n\n\\tikzstyle{block} = [rectangle, draw, text width=6em, text centered, rounded corners, minimum height=4em]\n\n\\begin{tikzpicture}[node distance = 3cm, auto]\n\n\\node [block] (MNIST) {MNIST};\n\\node [block, right of=MNIST] (hidden) {Hidden (ReLU)};\n\\node [block, right of=hidden] (output) {Output (Softmax)};\n\\node [block, right of=output, dashed] (loss) {Loss (Cross-entropy)};\n\n\\draw [->] (MNIST.10) -- (hidden.170);\n\\draw [->] (hidden.10) -- (output.170);\n\\draw [<-, dashed] (hidden.350) -- (output.190);\n\\draw [->] (output.10) -- (loss.170);\n\\draw [<-, dashed] (output.350) -- (loss.190);\n\\draw [->,dashed] (loss.south) -- ($(loss.south) + (0, -.5cm) $) -- node[below]{Label query} ($(MNIST.south) + (0, -.5cm) $)-- (MNIST.south);\n\n\\end{tikzpicture}\n\\end{center}\n\nThe first node (`MNIST`) will be responsible for acquiring new training samples and feeding it to the next layer for processing.\nIn addition, it will provide an accessor that the final categorical cross-entropy loss node will use to query the correct label for that sample (the \"label query\").\nThe hidden node will perform the affine transform and apply the linear rectification activation.\nThe output node will also perform an affine transform, but will then apply the softmax function.\nFinally, the loss node will compute the loss of the predicted distribution based on the queried label for a given sample.\n\nIn the figure above, solid arrows from left to right indicate data flow during the *feedforward* or *evaluation*\nportion of the model's execution. Each solid arrow corresponds to a data vector emitted by the source, and ingested\nby the destination. The dashed arrows from right to left indicate data flow during the *backpropagation* or *reverse accumulation* portion\nof the algorithm.\nThese arrows correspond to gradient vectors of the evaluated loss with respect to the outputs passed during the feedforward phase.\nFor example, as seen above, the hidden node is expected to forward data to the output node ($\\mathbf{a}^{[1]}$). Later, after the model\nprediction has been computed and the loss evaluated, the gradient of the loss with respect to those outputs is\nexpected ($\\partial J_{CE}/\\partial a^{[1]}_i$ for each $a_i^{[1]}$ in $\\mathbf{a}^{[1]}$).\n\nWhen simply evaluating the model (without training), the final loss node will simply be omitted from the graph.\nIn addition, no back-propagation of gradients will occur as the model parameters are ossified during evaluation.\n\nThe model class interface shown below will be used to house all the nodes in the computational graph,\nand provide various routines that are useful for operating over all constituent nodes as a collection.\n\n```c++\nclass Model\n{\npublic:\n    Model(std::string name);\n    \n    // Add a node to the model, forwarding arguments to the node's constructor\n    template <typename Node_t, typename... T>\n    Node_t& add_node(T&&... args)\n    {\n        nodes_.emplace_back(\n            std::make_unique<Node_t>(*this, std::forward<T>(args)...));\n        return reinterpret_cast<Node_t&>(*nodes_.back());\n    }\n\n    // Create a dependency between two constituent nodes\n    void create_edge(Node& dst, Node& src);\n\n    // Initialize the parameters of all nodes with the provided seed. If the\n    // seed is 0, a new random seed is chosen instead. Returns the seed used.\n    rne_t::result_type init(rne_t::result_type seed = 0);\n\n    // Adjust all model parameters of constituent nodes using the\n    // provided optimizer (shown later)\n    void train(Optimizer& optimizer);\n\n    std::string const& name() const noexcept\n    {\n        return name_;\n    }\n\n    void print() const;\n\n    // Routines for saving and loading model parameters to and from disk\n    void save(std::ofstream& out);\n    void load(std::ifstream& in);\n\nprivate:\n    friend class Node;\n\n    std::string name_;\n    std::vector<std::unique_ptr<Node>> nodes_;\n};\n```\n\n### Training Data and Labels\n\nAll machine learning pipelines must consider how to ingest data and labels.\nData refers to the information the model is expected to use to make inferences and predictions.\nLabels correspond to the \"correct answer\" for each data sample, used to compute losses and train the model.\nThe interface of the MNIST data parser is shows below as an implemented `Node` class.\n\n```c++\nclass MNIST : public Node\n{\npublic:\n    constexpr static size_t DIM = 28 * 28;\n    \n    // The constructor receives an input filestream corresponding to the\n    // data samples and labels\n    MNIST(Model& model, std::ifstream& images, std::ifstream& labels);\n    \n    // This is an input node and has no parameters to initialize\n    void init(rne_t&) override {}\n    \n    // Read the next sample and label and forward the data\n    void forward(num_t* data = nullptr) override;\n\n    // No optimization is done in this node so this is a no-op\n    void reverse(num_t* gradients = nullptr) override {}\n    \n    void print() const override;\n\n    // Consume the next sample and label from the file streams\n    void read_next();\n    \n    // Accessor for the most recently read sample\n    num_t const* data() const noexcept\n    {\n        return data_;\n    }\n    \n    // Accessor for the most recently read label\n    num_t* label() const noexcept\n    {\n        return label_;\n    }\n    \n    // Quick ASCII visualization of the last digit read\n    void print_last();\n    \nprivate:\n    std::ifstream& images_;\n    std::ifstream& labels_;\n    uint32_t image_count_;\n\n    char buf_[DIM];\n    num_t data_[DIM];\n    num_t label_[10];\n};\n```\n\nIn the constructor, we must verify that the files passed as arguments are valid\nMNIST data and label files. Both files start with distinct \"magic values\" as a\nquick sanity check. The sample file starts with 2051 encoded as a 4-byte big-endian\nunsigned integer, whereas the label file starts with 2049. For the data file,\nthe magic number is followed by the image count and image dimensions. The\nlabel file magic number is followed by the label count (expected to match the image count).\n\nTo consume big-endian unsigned integers from the file stream, we'll use a simple routine:\n\n```c++\nvoid read_be(std::ifstream& in, uint32_t* out)\n{\n    char* buf = reinterpret_cast<char*>(out);\n    in.read(buf, 4);\n\n    std::swap(buf[0], buf[3]);\n    std::swap(buf[1], buf[2]);\n}\n```\n\nIf you happen to be using a big-endian processor, you will not need to perform the\nbyte swaps, but most desktop and mobile architectures are little-endian.\n\nThe implementation that parses the magic numbers and various other descriptors is\nproduced below:\n\n```cpp\nMNIST::MNIST(Model& model, std::ifstream& images, std::ifstream& labels)\n    : Node{model, \"MNIST input\"}\n    , images_{images}\n    , labels_{labels}\n{\n    // Confirm that passed input file streams are well-formed MNIST data sets\n    uint32_t image_magic;\n    read_be(images, &image_magic);\n    if (image_magic != 2051)\n    {\n        throw std::runtime_error{\"Images file appears to be malformed\"};\n    }\n    read_be(images, &image_count_);\n\n    uint32_t labels_magic;\n    read_be(labels, &labels_magic);\n    if (labels_magic != 2049)\n    {\n        throw std::runtime_error{\"Labels file appears to be malformed\"};\n    }\n\n    uint32_t label_count;\n    read_be(labels, &label_count);\n    if (label_count != image_count_)\n    {\n        throw std::runtime_error(\n            \"Label count did not match the number of images supplied\");\n    }\n\n    uint32_t rows;\n    uint32_t columns;\n    read_be(images, &rows);\n    read_be(images, &columns);\n    if (rows != 28 || columns != 28)\n    {\n        throw std::runtime_error{\n            \"Expected 28x28 images, non-MNIST data supplied\"};\n    }\n\n    printf(\"Loaded images file with %d entries\\n\", image_count_);\n}\n```\n\nNext, let's implement the `MNIST::read_next`, which will consume the next\nsample and label from the file streams:\n\n```c++\nvoid MNIST::read_next()\n{\n    images_.read(buf_, DIM);\n    num_t inv = num_t{1.0} / num_t{255.0};\n    for (size_t i = 0; i != DIM; ++i)\n    {\n        data_[i] = static_cast<uint8_t>(buf_[i]) * inv;\n    }\n\n    char label;\n    labels_.read(&label, 1);\n\n    for (size_t i = 0; i != 10; ++i)\n    {\n        label_[i] = num_t{0.0};\n    }\n    label_[static_cast<uint8_t>(label)] = num_t{1.0};\n}\n```\n\nFor the labels, note that the label is encoded as a single unsigned digit,\nbut we convert it to a 1-hot encoding for loss computation purposes later.\nIf your application can assume that the labels will be one-hot encoded,\nthis conversion may not be necessary and a more efficient implementation is possible.\n\nTo verify our work, let's write up a quick-and-dirty ASCII printer for the last read\ndigit and try our parser out. If you have a rendering backend (written in say, Vulkan,\nD3D12, OpenGL, etc.) at your disposal, you may wish to use that instead for a\ncleaner visualization.\n\n```c++\nvoid MNIST::print_last()\n{\n    for (size_t i = 0; i != 10; ++i)\n    {\n        if (label_[i] == num_t{1.0})\n        {\n            printf(\"This is a %zu:\\n\", i);\n            break;\n        }\n    }\n\n    for (size_t i = 0; i != 28; ++i)\n    {\n        size_t offset = i * 28;\n        for (size_t j = 0; j != 28; ++j)\n        {\n            if (data_[offset + j] > num_t{0.5})\n            {\n                if (data_[offset + j] > num_t{0.9})\n                {\n                    printf(\"#\");\n                }\n                else if (data_[offset + j] > num_t{0.7})\n                {\n                    printf(\"*\");\n                }\n                else\n                {\n                    printf(\".\");\n                }\n            }\n            else\n            {\n                printf(\" \");\n            }\n        }\n        printf(\"\\n\");\n    }\n    printf(\"\\n\");\n}\n```\n\nOn my machine, consuming the evaluation data and printing it produces the following result (the first\nsample from the test data is shown):\n\n```\nThis is a 7:\n                            \n       *..                  \n      *#####********.       \n          .*#*####*##.      \n                   ##       \n                   #*       \n                  ##        \n                 .##        \n                 ##         \n                .#*         \n                *#          \n                #*          \n               ##           \n              *#.           \n             *#*            \n             ##             \n            *#              \n           .##              \n           ###              \n           ##*              \n           #*\n```\n\nso we can be somewhat confident that our MNIST data ingestor is working properly. The only\nremaining routine we need to implement is `MNIST::forward` which should consume the\nnext sample, and forward the data to all subsequent nodes in the graph.\n\n```cpp\nvoid MNIST::forward(num_t* data)\n{\n    read_next();\n    for (Node* node : subsequents_)\n    {\n        node->forward(data_);\n    }\n}\n```\n\nSuch an interface ensures our `MNIST` node will be interoperable with networks\nthat aren't purely sequential.\n\n### The Feedforward Node\n\nThe hidden and output nodes have much in common and so will be implemented in terms of\na single feedforward node class. The feedforward node will need a configurable activation\nfunction and dimensionality. Here's the interface for the `FFNode`:\n\n```c++\nenum class Activation\n{\n    ReLU,\n    Softmax\n};\n\nclass FFNode : public Node\n{\npublic:\n    // A feedforward node is defined by the activation\n    // function and input/output dimensionality\n    FFNode(Model& model,\n           std::string name,\n           Activation activation,\n           uint16_t output_size,\n           uint16_t input_size);\n\n    void init(rne_t& rne) override;\n\n    // The input data should have size input_size_\n    void forward(num_t* inputs) override;\n\n    // The gradient data should have size output_size_\n    void reverse(num_t* gradients) override;\n\n    size_t param_count() const noexcept override\n    {\n        // Weight matrix entries + bias entries\n        return (input_size_ + 1) * output_size_;\n    }\n\n    num_t* param(size_t index);\n    num_t* gradient(size_t index);\n\n    void print() const override;\n\nprivate:\n    Activation activation_;\n    uint16_t output_size_;\n    uint16_t input_size_;\n\n    /////////////////////\n    // Node Parameters //\n    /////////////////////\n\n    // weights_.size() := output_size_ * input_size_\n    std::vector<num_t> weights_;\n    // biases_.size() := output_size_\n    std::vector<num_t> biases_;\n    // activations_.size() := output_size_\n    std::vector<num_t> activations_;\n\n    ////////////////////\n    // Loss Gradients //\n    ////////////////////\n\n    std::vector<num_t> activation_gradients_;\n\n    // During the training cycle, parameter loss gradients are accumulated in\n    // the following buffers.\n    std::vector<num_t> weight_gradients_;\n    std::vector<num_t> bias_gradients_;\n\n    // This buffer is used to store temporary gradients used in a SINGLE\n    // backpropagation pass. Note that this does not accumulate like the weight\n    // and bias gradients do.\n    std::vector<num_t> input_gradients_;\n\n    // The last input is needed to compute loss gradients with respect to the\n    // weights during backpropagation\n    num_t* last_input_;\n};\n```\n\nCompared to the `MNIST` node, the `FFNode` uses a lot more state to track\nall tunable parameters (weight matrix elements and biases), as well as the\nloss gradients corresponding to each parameter. The loss gradients must\nbe kept because, remember, utilizing them to actually adjust the parameters\nis performed only after `N` samples have been evaluated, where `N` is the\nchosen batch size in our stochastic gradient descent algorithm. If the\npurpose of some of the class members here is still opaque, they will show up\nlater when implement backpropagation.\n\nFirst, we must decide how to initialize the weights and biases of our node.\nWhen deciding on a scheme, there are a few key principles to keep in mind.\nFirst, the initialization must exhibit symmetry of any sort. For example,\nif all the parameters are initialized to the same random value, the loss gradients\nwith respect to all individual parameters will be identical, and our network\nwill be no better than a network with a single parameter. In addition, we\ndo not want the parameters to be initialized such that they are too large,\nor too small. Most papers that discuss weight initialization strive to ensure\nthat the loss gradients remain in a realm where floating point number retain\nprecision (in the range $[1, 2)$). The other criteria is that parameters\nshould generally be initialized such that they are roughly similar in magnitude.\nParameters that deviate too far from the mean are likely to either dominate\nloss gradients, or produce too small a signal to contribute. Proper parameter\ninitialization is but a small part of addressing the larger problem common\nin neural networks known as the problem of *exploding and vanishing gradients*.\nHere, we present the implementation with a couple references if you wish to\ndig deeper.\n\n```c++\nvoid FFNode::init(rne_t& rne)\n{\n    num_t sigma;\n    switch (activation_)\n    {\n    case Activation::ReLU:\n        // Kaiming He, et. al. weight initialization for ReLU networks\n        // https://arxiv.org/pdf/1502.01852.pdf\n        //\n        // Suggests using a normal distribution with variance := 2 / n_in\n        sigma = std::sqrt(2.0 / static_cast<num_t>(input_size_));\n        break;\n    case Activation::Softmax:\n    default:\n        // LeCun initialization as suggested in \"Self-Normalizing Neural\n        // Networks\"\n        // https://arxiv.org/pdf/1706.02515.pdf\n        sigma = std::sqrt(1.0 / static_cast<num_t>(input_size_));\n        break;\n    }\n\n    // NOTE: Unfortunately, the C++ standard does not guarantee that the results\n    // obtained from a distribution function will be identical given the same\n    // inputs across different compilers and platforms. A production ML\n    // framework will likely implement its own distributions to provide\n    // deterministic results.\n    auto dist = std::normal_distribution<num_t>{0.0, sigma};\n\n    for (num_t& w : weights_)\n    {\n        w = dist(rne);\n    }\n\n    // NOTE: Setting biases to zero is a common practice, as is initializing the\n    // bias to a small value (e.g. on the order of 0.01). It is unclear if the\n    // latter produces a consistent result over the former, but the thinking is\n    // that a non-zero bias will ensure that the neuron always \"fires\" at the\n    // beginning to produce a signal.\n    //\n    // Here, we initialize all biases to a small number, but the reader should\n    // consider experimenting with other approaches.\n    for (num_t& b : biases_)\n    {\n        b = 0.01;\n    }\n}\n```\n\nThe common theme is that the distribution of random weights scales roughly\nas the inverse square root of the input vector size. This way, the distribution\nof the node's output will fall in a \"nice\" range with respect to floating-point\nprecision. Other initialization schemes are of course possible, and in some cases\ncritical depending on the choice of activation function.\n\nWith weights and biases initialized, it's time to implement `FFNode::forward`.\nThe straightforward plan is, for both the ReLU and softmax nodes, first perform the\naffine transform $\\mathbf{W}\\mathbf{x} + \\mathbf{b}$, then perform the activation function\nwhich will be one of the linear rectifier or the softmax function. Here's what this\nlooks like:\n\n\n```c++\nvoid FFNode::forward(num_t* inputs)\n{\n    // Remember the last input data for backpropagation later\n    last_input_ = inputs;\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // For each output vector, compute the dot product of the input data\n        // with the weight vector add the bias\n\n        num_t z{0.0};\n\n        size_t offset = i * input_size_;\n\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            z += weights_[offset + j] * inputs[j];\n        }\n        // Add neuron bias\n        z += biases_[i];\n\n        switch (activation_)\n        {\n        case Activation::ReLU:\n            activations_[i] = std::max(z, num_t{0.0});\n            break;\n        case Activation::Softmax:\n        default:\n            activations_[i] = std::exp(z);\n            break;\n        }\n    }\n\n    if (activation_ == Activation::Softmax)\n    {\n        // softmax(z)_i = exp(z_i) / \\sum_j(exp(z_j))\n        num_t sum_exp_z{0.0};\n        for (size_t i = 0; i != output_size_; ++i)\n        {\n            // NOTE: with exploding gradients, it is quite easy for this\n            // exponential function to overflow, which will result in NaNs\n            // infecting the network.\n            sum_exp_z += activations_[i];\n        }\n        num_t inv_sum_exp_z = num_t{1.0} / sum_exp_z;\n        for (size_t i = 0; i != output_size_; ++i)\n        {\n            activations_[i] *= inv_sum_exp_z;\n        }\n    }\n\n    // Forward activation data to all subsequent nodes in the computational\n    // graph\n    for (Node* subsequent : subsequents_)\n    {\n        subsequent->forward(activations_.data());\n    }\n}\n```\n\nAs before, we forward all final results to all subsequent nodes even though there\nwill only be a single subsequent node in this case. Whenever writing code\nas above, it is prudent to consider all potential corner cases which could result\nin the myriad issues that arise in floating-point computation:\n\n- Loss of precision\n- Floating point overflow and underflow\n- Divide by zero\n\nLoss of precision easily occurs when in a number of situations, such as subtracting\ntwo quantities of similar size, or adding and multiplying quantities with\ngreatly different magnitudes. Floating point overflow and underflow occur\ntypically when repeatedly performing an operation such that an accumulator explodes\nto $\\infty$ or $-\\infty$. In this case, the use of `std::exp` is one operation\nthat sticks out. We will not implement a stable softmax here, but the following identity\ncan be used to improve its stability should you need it:\n\n$$\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i = \\mathrm{softmax}(\\mathbf{z})_i$$\n\nIn this expression, $\\mathbf{C}$ is a constant vector where all its elements are\nequal in value. Expanding the definition of softmax in the LHS gives:\n\n$$\n\\begin{aligned}\n\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i &= \\frac{\\exp{(z_i + C)}}{\\sum_i\\exp{(z_i + C})} \\\\\n&= \\frac\n    {\\exp{z_i}\\exp{C}}\n    {\\left(\\sum_i\\exp{z_i}\\right)\\exp C} \\\\\n&= \\mathrm{softmax}(\\mathbf{z})_i && \\blacksquare\n\\end{aligned}\n$$\n\nThus, if we are considered about saturating `std::exp` with a large argument, we can simply set\n$C$ to be the additive inverse of the $z_i$ with the greatest magnitude within $\\mathbf{z}$. Performing this\neach time we apply softmax will usually maintain the arguments of the softmax within a reasonable\nrange (unless elements of $z_i$ explode in opposite directions).\n\nAs a practical implementor's trick, it is possible to enable floating point exception traps\nto throw an exception when a `NaN` is generated in a floating point register. Using libc\nfor example, we can trap floating point exceptions using\n\n```c++\n#include <cfenv>\n\nfeenableexcept(FE_INVALID | FE_OVERFLOW);\n```\n\nIt is also possible to trap exceptions specifically in regions where you anticipate\na potential issue (which enhances the overall throughput of the network). In the interest\nof brevity, please consult your compiler's documentation for how to do this.\n\nOne observation you might have made is the first line of our routine.\n\n```c++\nlast_input_ = inputs;\n```\n\nHere, we retain a pointer to the data ingested by the feedforward node for a full training\ncycle. Before delving into any derivations, let's first present the code for the backpropagation\nof gradients through our feedforward node and dissect it immediately afterwards.\n\n```c++\nvoid FFNode::reverse(num_t* gradients)\n{\n    // First, we compute dJ/dz as dJ/dg(z) * dg(z)/dz and store it in our\n    // activations array\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // dg(z)/dz\n        num_t activation_grad{0.0};\n        switch (activation_)\n        {\n        case Activation::ReLU:\n            if (activations_[i] > num_t{0.0})\n            {\n                activation_grad = num_t{1.0};\n            }\n            else\n            {\n                activation_grad = num_t{0.0};\n            }\n            // dJ/dz = dJ/dg(z) * dg(z)/dz\n            activation_gradients_[i] = gradients[i] * activation_grad;\n            break;\n        case Activation::Softmax:\n        default:\n            for (size_t j = 0; j != output_size_; ++j)\n            {\n                if (i == j)\n                {\n                    activation_grad += activations_[i]\n                                       * (num_t{1.0} - activations_[i])\n                                       * gradients[j];\n                }\n                else\n                {\n                    activation_grad\n                        += -activations_[i] * activations_[j] * gradients[j];\n                }\n            }\n\n            activation_gradients_[i] = activation_grad;\n            break;\n        }\n    }\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // dJ/db_i = dJ/dg(z_i) * dJ(g_i)/dz_i.\n        bias_gradients_[i] += activation_gradients_[i];\n    }\n\n    std::fill(input_gradients_.begin(), input_gradients_.end(), 0);\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        size_t offset = i * input_size_;\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            input_gradients_[j]\n                += weights_[offset + j] * activation_gradients_[i];\n        }\n    }\n\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        for (size_t j = 0; j != output_size_; ++j)\n        {\n            weight_gradients_[j * input_size_ + i]\n                += last_input_[i] * activation_gradients_[j];\n        }\n    }\n\n    for (Node* node : antecedents_)\n    {\n        node->reverse(input_gradients_.data());\n    }\n}\n```\n\nThis code is likely more difficult to digest, so let's break it down into parts.\nDuring reverse accumulation (aka backpropagation), we will be given the loss gradients with\nrespect to all of the outputs from the most recent forward pass, written mathematically\nas $\\partial J_{CE}/\\partial a_i$ for each output scalar $a_i$.\nGiven that information, we need to perform the following tasks:\n\n1. Compute $\\partial J_{CE}/\\partial w_{ij}$ for each weight in our weight matrix\n2. Compute $\\partial J_{CE}/\\partial b_i$ for each bias in our bias vector\n3. Compute $\\partial J_{CE}/\\partial x_i$ for each input scalar in the most recent forward pass \n4. Propagate all the loss gradients with respect to the inputs in step 3 back to the antecedent nodes\n\nAs all outputs pass through an activation function, we will need\nto compute $\\partial J_{CE}/\\partial g(\\mathbf{z})_i$ where $g$ is one of the linear rectifier or softmax function\ncorresponding to a particular component of the output vector.\nBoth derivatives are computed in the background section, so we'll just recite the results here.\nFor the linear rectifier, $\\partial J_{CE}/\\partial g(\\mathbf{z})_i$ will simply be 1 if $a_i \\neq 0$,\nand 0 otherwise. The softmax gradient is slightly more involved, but because every output\nof the softmax contributes additively to the loss, we require a sum of gradients here:\n\n$$\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} = \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = j \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j & i \\neq j\n\\end{cases}$$\n\nThe factor $\\partial J_{CE}/\\partial a_i$ comes from the chain rule and is passed in from the subsequent node.\nThese intermediate expressions are computed, scaled by $\\partial a_i/\\partial z_i$, and then stored in `activation_gradients_` in the top portion\nof `FFNode::reverse`.\nEquivalently by the chain rule, we are caching in `activation_gradients_` $\\partial J_{CE}/\\partial z_i$ for each $i$.\nBecause the loss gradients with respect to every parameter and input have a functional dependence on the activation function\ngradients, all results computed in tasks 1 through 4 above will depend on `activation_gradients_`.\n\n#### Computing bias gradients\n\nThe bias gradients are the easiest to compute due to how they show up in the expression.\nSince a node's output is given as\n\n$$a_i = g\\left(\\mathbf{W}_i \\cdot \\mathbf{x} + b_i = z_i\\right)$$\n\nfor some activation function $g$, the derivative with respect to $b_i$ is just\n\n$$\n\\begin{aligned}\n\\frac{\\partial{a_i}}{\\partial b_i} &= \\frac{\\partial g}{\\partial z_i}\\frac{\\partial z_i}{\\partial b_i} \\\\\n&= \\frac{\\partial g}{\\partial z_i}\n\\end{aligned}\n$$\n\nThus we can simply accumulate the result stored in `activation_gradients_` as the loss gradient with\nrespect to each bias. Please take note! The code that performs this update is\n\n```c++\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        bias_gradients_[i] += activation_gradients_[i];\n    }\n```\n\nThe following code would *not* be correct:\n\n\n```c++\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // NOTE: WRONG! Will only alone batch sizes of 1\n        bias_gradients_[i] = activation_gradients_[i];\n    }\n```\n\nAs the admonition in the comment suggests, while it's helpful to conceptualize the loss gradient\nas something that resets every time we perform a forward and reverse pass of a training sample,\nin actuality, we require the gradients with respect to the *cumulative mean loss accrued while evaluating\nthe entire batch* for stochastic gradient descent. Luckily, because the losses per sample accumulate\nadditively, the gradients of the loss with respect to all parameters in the model also update\nadditively.\n\n#### Computing the weight gradients\n\nThe weight gradients are slightly more involved than the bias gradients, but are still relatively\neasy to compute with a bit of bookkeeping. For any given weight $w_{ij}$, we can observe that such\na weight participates only in the evaluation of $z_i$. That is:\n\n$$\n\\begin{aligned}\n\\frac{\\partial \\mathbf{z}}{\\partial w_{ij}} &= \\frac{\\partial z_i}{\\partial w_{ij}} \\\\\n&= \\frac{\\partial (\\mathbf{w}_{i} \\cdot \\mathbf{x}) + b_i}{\\partial w_{ij}} \\\\\n&= x_j \\\\\n\\end{aligned}\n$$\n\n$$\n\\boxed{\\frac{\\partial J_{CE}}{\\partial w_{ij}} = \\frac{\\partial J_{CE}}{\\partial a_i}\\frac{\\partial a_i}{\\partial z_i}x_j}\n$$\n\nThe boxed result shows the final loss gradient with respect to a weight parameter.\nThe weight gradient accumulation appears in the following code, where all $N \\times M$\nweights are updated in a couple of nested loops:\n\n\n```c++\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        for (size_t j = 0; j != output_size_; ++j)\n        {\n            weight_gradients_[j * input_size_ + i]\n                += last_input_[i] * activation_gradients_[j];\n        }\n    }\n```\n\n#### Computing the input gradients\n\nThe last set of gradients we need to compute are the loss gradients with respect to the\ninputs, to be forwarded to the antecedent node. This calculation is similar to the\ncalculation of the weight gradients in terms of the linear dependence. However, it is\nimportant to note that a given input participates in the computation of *all* output\nscalars. Thus, we expect each individual input gradient to be a summation.\n\n$$\n\\frac{\\partial J_{CE}}{\\partial x_i} = \\sum_j \\frac{\\partial J_{CE}}{\\partial a_j}\\frac{\\partial a_j}{\\partial z_j}w_{ij}\n$$\n\nThe code that computes the input gradients is defined here:\n\n```c++\n    std::fill(input_gradients_.begin(), input_gradients_.end(), 0);\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        size_t offset = i * input_size_;\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            input_gradients_[j]\n                += weights_[offset + j] * activation_gradients_[i];\n        }\n    }\n```\n\nNote that unlike the weight and bias gradients which accumulate while training an entire\nbatch of samples, the input gradients here are ephemeral and reset every pass since\nthe only depend on the evaluation of an individual sample.\n\nFinally, to complete the `FFNode::reverse` method, the input gradients computed\nare based backwards for use in an antecedent node's gradient update (reproduced below). The code as\npresented *does not work* with non-sequential computational graphs, but is meant\nto provide a starting point for futher experimentation.\n\n```c++\n    for (Node* node : antecedents_)\n    {\n        node->reverse(input_gradients_.data());\n    }\n```\n\n### The Categorical Cross-Entropy Loss Node\n\nThe last node we need to implement is the node which computes the categorical cross-entropy of the prediction.\nA possible class definition for such this node is shown below:\n\n```c++\nclass CCELossNode : public Node\n{\npublic:\n    CCELossNode(Model& model,\n                std::string name,\n                uint16_t input_size,\n                size_t batch_size);\n\n    // No initialization is needed for this node\n    void init(rne_t&) override {}\n\n    void forward(num_t* inputs) override;\n\n    // As a loss node, the argument to this method is ignored (the gradient of\n    // the loss with respect to itself is unity)\n    void reverse(num_t* gradients = nullptr) override;\n\n    void print() const override;\n\n    // During training, this must be set to the expected target distribution\n    // for a given sample\n    void set_target(num_t const* target)\n    {\n        target_ = target;\n    }\n\n    num_t accuracy() const;\n    num_t avg_loss() const;\n    void reset_score();\n\nprivate:\n    uint16_t input_size_;\n\n    // We minimize the average loss, not the net loss so that the losses\n    // produced do not scale with batch size (which allows us to keep training\n    // parameters constant)\n    num_t inv_batch_size_;\n    num_t loss_;\n    num_t const* target_;\n    num_t* last_input_;\n    // Stores the last active classification in the target one-hot encoding\n    size_t active_;\n    num_t cumulative_loss_{0.0};\n    // Store running counts of correct and incorrect predictions\n    size_t correct_   = 0;\n    size_t incorrect_ = 0;\n    std::vector<num_t> gradients_;\n};\n```\n\nThe `CCELossNode` is similar to other nodes in that it implements a forward pass\nfor computing the loss of a given sample, and a reverse pass to compute gradients\nof that loss and pass them back to the antecedent node. Distinct from the previous\nnodes is that the argument to `CCELossNode::reverse` is ignored as the loss node\nis not expected to have any subsequents.\n\nThe implementation of `CCELossNode::forward` follows from the definition of cross-entropy,\nrecalled here with some modifications:\n\n$$J_{CE}(\\hat{\\mathbf{y}}, \\mathbf{y}) = -\\sum_j y_j \\log{\\left(\\max(\\hat{y}_j, \\epsilon) \\right)} $$\n\n$J$ is the common symbol ascribed to the cost or objective function, while $\\hat{y}$ and $y$ refer\nto the predicted distribution and correct distribution respectively.\nIn addition, the argument of the logarithm is clamped with a small $\\epsilon$\nto avoid a numerical singularity. The implementation is as follows:\n\n```c++\nvoid CCELossNode::forward(num_t* data)\n{\n    num_t max{0.0};\n    size_t max_index;\n\n    loss_ = num_t{0.0};\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        if (data[i] > max)\n        {\n            max_index = i;\n            max       = data[i];\n        }\n\n        loss_ -= target_[i]\n                 * std::log(\n                     std::max(data[i], std::numeric_limits<num_t>::epsilon()));\n\n        if (target_[i] != num_t{0.0})\n        {\n            active_ = i;\n        }\n    }\n\n    if (max_index == active_)\n    {\n        ++correct_;\n    }\n    else\n    {\n        ++incorrect_;\n    }\n\n    cumulative_loss_ += loss_;\n\n    // Store the data pointer to compute gradients later\n    last_input_ = data;\n}\n```\n\nAs with the feedforward node, a pointer to the inputs to the node is preserved to compute gradients\nlater. A bit of bookkeeping is also done so we can track accuracy and accumulate loss during\nbatch. The derivative of the loss of an individual sample with respect to the inputs is also\nfairly straightforward.\n\n$$\n\\begin{aligned}\n\\frac{\\partial J_{CE}}{\\partial{\\hat{y}_i}} &= \\frac{\\partial \\left(-\\sum_j y_j\\log{\\left(\\max(\\hat{y}_j, \\epsilon)\\right)}\\right)}{\\partial \\hat{y}_i} \\\\\n&= -\\frac{y_i}{\\max(\\hat{y}_i, \\epsilon)}\n\\end{aligned}\n$$\n\nThe implementation is similarly straightforward. As with the other nodes with loss gradients, the loss gradients\nwith respect to all inputs are forwarded to antecedent nodes.\n\n```c++\nvoid CCELossNode::reverse(num_t* data)\n{\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        gradients_[i] = -inv_batch_size_ * target_[i]\n            / std::max(last_input_[i], std::numeric_limits<num_t>::epsilon());\n    }\n\n    for (Node* node : antecedents_)\n    {\n        node->reverse(gradients_.data());\n    }\n}\n```\n\nOne thing to keep in mind here is that this implementation is *not* the most efficient implementation\npossible for a softmax layer feeding to a cross-entropy loss function by any stretch.\nThe code and derivation here is completely general for arbitrary sample probability distributions.\nIf, however, we can assume that the target distribution is one-hot encoded, then all gradients\nin this node will either be 0 or $-1/\\hat{y}_k$ where $k$ is the active label in the one-hot target.\nUpon substitution in the previous layer, it should be\nclear that important cancellations are possible that dramatically simplify the gradient computations\nin the softmax layer. Here's the simplification, again assuming that the $k$th index is the correct\nlabel:\n\n$$\n\\begin{aligned}\n\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} &= \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = j \\\\\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j & i \\neq j\n\\end{cases} \\\\\n&= \\begin{dcases}\n-\\frac{\\mathrm{softmax}(\\mathbf{z})_k(1 - \\mathrm{softmax}(\\mathbf{z}_k))}{\\mathrm{softmax}(\\mathbf{z})_k} & i = k \\\\\n\\frac{\\mathrm{softmax}(\\mathbf{z})_i\\mathrm{softmax}(\\mathbf{z})_k}{\\mathrm{softmax}(\\mathbf{z})_k} & i \\neq k\\\\\n\\end{dcases} \\\\\n&= \\begin{dcases}\n\\mathrm{softmax}(\\mathbf{z})_k - 1 & i = k \\\\\n\\mathrm{softmax}(\\mathbf{z})_i & i \\neq k\n\\end{dcases}\n\\end{aligned}\n$$\n\nWhen following the computation above, remember that $\\partial J_{CE} / \\partial a_i$ is 0 for all $i \\neq k$.\nThus, the only term in the sum that survives is the term corresponding to $j = k$, at which point\nwe break out the differentation depending on whether $i = k$ or $i \\neq k$.\n\nThis is an elegant result! Essentially, the gradient of a the loss with respect to an emitted\nprobability $p(x)$ is simply $p(x)$ if $x$ was not the correct label, and $p(x) - 1$ if it was.\nConsidering the effect of gradient descent, this should check out with our intuition. The optimizer\nseeks to suppress probabilities predicted that should have been 0, and increase probabilities predicted\nthat should have been 1. Check for yourself that after gradient descent is performed, the gradients\nderived here will nudge the model in the appropriate direction.\n\nThis sort of optimization highlights an important observation about backpropagation,\nnamely, that backpropagation does not guarantee any sort of optimality beyond a worst-case performance ceiling.\nSeveral production neural networks have architectures that employ heuristics to identify optimizations\nsuch as this one, but the problem of generating a perfect computational strategy is NP and so not\ncovered here. The code provided here will remain in the general form, despite being slower in the\ninterest of maintaining generality and not adding complexity, but you are encouraged to consider\nabstractions to permit this type of optimization in your own architecture (a useful keyword to\naid your research is *common subexpression elimination* or *CSE* for short).\n\nThe last thing we need to provide for `CCELossNode` are a few helper routines:\n\n```c++\nvoid CCELossNode::print() const\n{\n    std::printf(\"Avg Loss: %f\\t%f%% correct\\n\", avg_loss(), accuracy() * 100.0);\n}\n\nnum_t CCELossNode::accuracy() const\n{\n    return static_cast<num_t>(correct_)\n           / static_cast<num_t>(correct_ + incorrect_);\n}\nnum_t CCELossNode::avg_loss() const\n{\n    return cumulative_loss_ / static_cast<num_t>(correct_ + incorrect_);\n}\n\nvoid CCELossNode::reset_score()\n{\n    cumulative_loss_ = num_t{0.0};\n    correct_         = 0;\n    incorrect_       = 0;\n}\n```\n\nThese routines let us observe the performance of our network during training in terms\nof both loss and accuracy.\n\n### Gradient Descent Optimizer\n\nAt some point after loss gradients with respect to model parameters have accumulated,\nthe gradients will need to be used to actually adjust the parameters themselves. This is provided\nby the `GDOptimizer` class implemented as below:\n\n```c++\nclass GDOptimizer : public Optimizer\n{\npublic:\n    // \"Eta\" is the commonly accepted character used to denote the learning\n    // rate. Given a loss gradient dJ/dp for some parameter p, during gradient\n    // descent, p will be adjusted such that p' = p - eta * dJ/dp.\n    GDOptimizer(num_t eta) : eta_{eta} {}\n\n    // This should be invoked at the end of each batch's evaluation. The\n    // interface technically permits the use of different optimizers for\n    // different segments of the computational graph.\n    void train(Node& node) override;\n\nprivate:\n    num_t eta_;\n};\n\nvoid GDOptimizer::train(Node& node)\n{\n    size_t param_count = node.param_count();\n    for (size_t i = 0; i != param_count; ++i)\n    {\n        num_t& param    = *node.param(i);\n        num_t& gradient = *node.gradient(i);\n\n        param = param - eta_ * gradient;\n\n        // Reset the gradient which will be accumulated again in the next\n        // training epoch\n        gradient = num_t{0.0};\n    }\n}\n```\n\nNot shown is the `Optimizer` class interface which simply provides a virtual `train` method.\nAs you implement more sophisticated optimizers, you will find that more state may be needed\nto perform necessary tasks (e.g. computing gradient moving averages). Also implicit in this\nimplementation is that our `Node` classes need to provide an indexing scheme for each parameter\nas well as an accessor for the total number of parameters. For example, accessing the `FFNode` parameters\nis a fairly simple matter:\n\n```c++\nnum_t* FFNode::param(size_t index)\n{\n    if (index < weights_.size())\n    {\n        return &weights_[index];\n    }\n    return &biases_[index - weights_.size()];\n}\n```\n\nThe parameters are indexed 0 through the return value of `Node::param_count()` minus one.\nNote that the optimizer doesn't care whether the parameter accessed in this way is a weight, bias, average, etc.\nAs a trainable parameter, the only thing that matters during gradient descent is the current value\nand the loss gradient.\n\n## Tying it all Together\n\nNow that we have the individual nodes implemented, all that remains is to wire things up and start\ntraining! This is how we can construct a model with a input, hidden, output, and loss nodes,\nall wired sequentially.\n\n```c++\n    Model model{\"ff\"};\n\n    MNIST& mnist = &model.add_node<MNIST>(images, labels);\n\n    FFNode& hidden = model.add_node<FFNode>(\"hidden\", Activation::ReLU, 32, 784);\n\n    FFNode& output\n        = model.add_node<FFNode>(\"output\", Activation::Softmax, 10, 32);\n\n    CCELossNode& loss = &model.add_node<CCELossNode>(\"loss\", 10, batch_size);\n    loss.set_target(mnist.label());\n\n    model.create_edge(hidden, mnist);\n    model.create_edge(output, hidden);\n    model.create_edge(loss, output);\n    \n    // This function should visit all constituent nodes and initialize\n    // their parameters\n    model.init();\n    \n    // Create a gradient descent optimizer with a hardcoded learning rate\n    GDOptimizer optimizer{num_t{0.3}};\n```\n\nAs mentioned before, the \"edges\" are somewhat cosmetic as none of our nodes actually support\nmultiple node inputs or outputs. An actual implementation that would support such a non-sequential\ntopology will likely need a sort of signals and slots abstraction. The interface provided\nhere is strictly to impress on you the importance of the abstraction of our neural network\nas a computational graph, which is critical when additional complexity is added later.\n\nWith this, we are ready to implement the core loop of the training algorithm.\n\n```c++\n    for (size_t i = 0; i != 256; ++i)\n    {\n        for (size_t j = 0; j != 64; ++j)\n        {\n            mnist->forward();\n            loss->reverse();\n        }\n\n        model.train(optimizer);\n    }\n```\n\nHere, we train our model over 256 batches. Each batch consists of 64 samples, and for each\nsample, we invoke `MNIST::forward` and `CCELossNode::reverse`. During the forward pass,\nour `MNIST` node extracts a new sample and label and forwards the sample data to the next node.\nThis data propagates through the network until the final output distribution is passed to the\nloss node and losses are computed. All this occurs within the single line: `mnist->forward()`.\nIn the subsequent line, gradients are computed and passed back until the reverse accumulation\nterminates at the `MNIST` node again. After all gradients for the batch are accumulated, the\nmodel can `train`, which invokes the optimizer on each node to simultaneously adjust all\nmodel parameters for each node.\n\nAfter adding some additional logging, the results of the network look like this:\n\n```\nExecuting training routine\nLoaded images file with 60000 entries\nhidden: 784 -> 32\noutput: 32 -> 10\nInitializing model parameters with seed: 116726080\nAvg Loss: 0.254111\t96.875000% correct\n```\n\nTo evaluate the efficacy of the model, we can serialize all the parameters to disk, load them\nup, disable the training step, and evaluate the model on the test data. For this particular\nrun, the results were as follows:\n\n```\nExecuting evaluation routine\nLoaded images file with 10000 entries\nhidden: 784 -> 32\noutput: 32 -> 10\nAvg Loss: 0.292608\t91.009998% correct\n```\n\nAs you can see, the accuracy dropped on the test data relative to the training data. This\nis a hallmark characterstic of *overfitting*, which is to be expected given that we\nhaven't implemented any regularization whatsoever! That said, 91% accuracy isn't all that bad\nwhen we consider the fact that our model has no notion of pixel-adjacency whatsoever. For\nimage data, convolutional networks are a far more apt architecture than the one chosen for\nthis demonstration.\n\n### Regularization\n\nRegularization will not be implemented as part of this self-contained neural network, but\nit is such a fundamental part of most deep learning frameworks that we'll discuss it here.\n\nOften, the dimensionality of our model will be much higher than what is stricly needed to make\naccurate predictions. This stems from the fact that we seldom no a priori how many features\nare needed for the model to be successful. Thus, the likelihood of overfitting increases as more training\ndata is fed into the model. The primary tool to combat overfitting is *regularization*.\nLoosely speaking, regularization is any strategy employed to restrict the hypothesis\nspace of fit-functions the model can occcupy to prevent overfitting.\n\nWhat is meant by restricting the hypothesis space, you might ask? The idea is to consider\nthe entire family of functions possible spanned by the model's entire parameter vector.\nIf our model has 10000 parameters (many networks will easily exceed this), each unique 10000-dimensional vector\ncorresponds to a possible solution. However, we know it's unlikely that certain parameters\nshould be vastly greater in magnitude than others in a theoretically *optimal* condition.\nModels with \"strange\" parameter vectors that are unlikely to be the optimal solution are likely\nconverged on as a result of overfitting.\nTherefore, it makes sense to consider ways to constrain the space this parameter vector may occupy.\n\nThe most common approach to achieve this is to add an initial penalty term to the loss function\nwhich is a function of the weight. For example, here is the cross-entropy loss with the so-called $L^2$\nregularizer (also known as the ridge regularizer) added:\n\n\n$$-\\sum_{x\\in X} y_x \\log{\\hat{y}_x} + \\frac{\\lambda}{2} \\mathbf{w}^{T}\\mathbf{w}$$\n\nIn a slight abuse of notation, $\\mathbf{w}$ here corresponds to a vector containing every\nweight in our network. The factor $\\lambda$ is a constant we can choose to adjust the penalty\nsize. Note that when a regularizer is used, we *expect training loss to increase*. The tradeoff\nis that we simultaneously *expect test loss to decrease*. Tuning the regularization speed $\\lambda$\nis a routine problem for model fitting in the wild.\n\nBy modifying the loss function, in principal, all loss gradients must change as well. Fortunately,\nas we've only added a quadratic term to the loss,\nthe only change to the gradient will be an additional linear additive term $\\lambda\\mathbf{w}$.\nThis means we don't have to add a ton of code to modify all the gradient calculations thus far.\nInstead, we can simply *decay* the weight based on a percentage of the weight's magnitude\nwhen we adjust the weight after each batch is performed. You will often here this type of regularization\nreferred to as simply *weight decay* for this reason.\n\nTo implement $L^2$ regularization, simply add a percentage of a weight's value to its loss gradient.\nCrucially, do not adjust bias parameters in the same way. We only wish to penalize parameters for\nwhich increased magnitude corresponds with more complex models. Bias parameters are simply scalar\noffsets, regardless of their value and do not scale the inputs. Thus, attempting to regularize them\nwill likely increase *both* training and test error.\n\n## Where to go from here\n\nAt this point, our toy network is complete. With any luck, you've taken away a few key patterns\nthat will aid in both your intuition about how deep learning techniques work, and your efforts\nto actually implement them. The implementation presented here is both far from complete, and\nfar from ideal. Critically missing is adequate visualization for the error rate as a function of\ntraining time, mis-predicted samples, and the model parameters themselves. Without visualization,\nmodel tuning can be time consuming, veering on impossible. In addition, our model training samples\nare always ingested in the order they are provided in the training file. In practice, this sequence\nshould be shuffled to avoid introducing training bias.\n\nHere are a few additional things you can try, in no particular order.\n\n- Add various regularization modes such as $L^2$, $L^1$, or dropout.\n- Track loss reduction momentum to implement *early stopping*, thereby reducing wasted training cycles\n- Implement a convolution node with a variable sized weight filter. You will likely need to implement the max-pooling operation as well.\n- Implement a batch-normalization node.\n- Modify the interfaces provided here so that `Node::forward` and `Node::reverse` also pass slot ids to handle nodes with multiple inputs and outputs.\n- Leverage the slots abstraction above to implement a residual network.\n- Improve efficiency by adding support for SIMD or GPU-based compute kernels.\n- Add multithreading to allow separate batches to be trained simultaneously.\n- Provide alternative optimizers that decay the learning rate over time, or decay the learning rate as a function of loss momentum.\n- Add a \"meta-training\" feature that can tune *hyperparameters* used to configure your model (e.g. learning rate, regularization rate, network depth, layer dimension).\n- Pick a research paper you're interested in and endeavor to implement it end to end.\n\nAs you can see, the sky's the limit and there is simply no end to the amount of work possible to\nimprove a neural network's ability to learn and make inferences. A good body of work is also\nthere to improve tooling around data ingestion, model configuration serialization, automated testing,\ncontinuous learning in the cloud, etc. Crucially though, new research and development is constantly\nin the works in this ever-changing field. On top of studying deep learning as a discipline in and of itself,\nthere is plenty of room for specialization in particular domains, be it computer vision, NLP, epidemiology, or something else.\nMy hope is that for some of you, the neural network in a weekend may take the form of a neural network in a fulfilling career or lifetime.\n\n#### Further Reading\n\nIf you get a single book, *Deep Learning* (listed first in the following table) is highly recommended as a relatively self-complete text with cogent explanations written in a readable style.\nAs you venture into attempting to perform ML tasks in a particular domain, search for a relatively recent highly cited \"survey\" paper, which should introduce\nyou to the main ideas and give you a starting point for further research. [Here](https://arxiv.org/pdf/1907.09408.pdf) is an example of one such survey paper,\nin this case with an emphasis on object detection.\n\n|Title|Authors|Description|\n|---|---|---|\n|*Deep Learning*|Ian Goodfellow, Yoshua Bengio, and Aaron Courville|Seminal text on the theory and practice of using neural networks to learn and perform tasks|\n|*Numerical Methods for Scientists and Engineers*|R. W. Hamming|Excellent general text covering important topics such as floating point precision and various approximation methods|\n|*Standard notations for Deep Learning* ([link](https://cs230.stanford.edu/files/Notation.pdf))|Stanford CS230 Course Notes|Cheatsheet covering standard notation used by many texts and papers|\n|*Neural Networks and Deep Learning* ([link](http://neuralnetworksanddeeplearning.com/index.html))|Michael Nielsen|A gentler introduction to the theory and practice of neural networks|\n|*Automatic Differentiation in Machine Learning: a Survey* ([link](https://arxiv.org/pdf/1502.05767.pdf))|Atılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind|Excellent survey paper documentating the various algorithms used for computational differentiation including viable alternatives to backpropagation|\n"
  },
  {
    "path": "doc/DOC.tex",
    "content": "% Options for packages loaded elsewhere\r\n\\PassOptionsToPackage{unicode}{hyperref}\r\n\\PassOptionsToPackage{hyphens}{url}\r\n%\r\n\\documentclass[\r\n]{article}\r\n\\usepackage{lmodern}\r\n\\usepackage{amssymb,amsmath}\r\n\\usepackage{ifxetex,ifluatex}\r\n\\ifnum 0\\ifxetex 1\\fi\\ifluatex 1\\fi=0 % if pdftex\r\n  \\usepackage[T1]{fontenc}\r\n  \\usepackage[utf8]{inputenc}\r\n  \\usepackage{textcomp} % provide euro and other symbols\r\n\\else % if luatex or xetex\r\n  \\usepackage{unicode-math}\r\n  \\defaultfontfeatures{Scale=MatchLowercase}\r\n  \\defaultfontfeatures[\\rmfamily]{Ligatures=TeX,Scale=1}\r\n\\fi\r\n% Use upquote if available, for straight quotes in verbatim environments\r\n\\IfFileExists{upquote.sty}{\\usepackage{upquote}}{}\r\n\\IfFileExists{microtype.sty}{% use microtype if available\r\n  \\usepackage[]{microtype}\r\n  \\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts\r\n}{}\r\n\\makeatletter\r\n\\@ifundefined{KOMAClassName}{% if non-KOMA class\r\n  \\IfFileExists{parskip.sty}{%\r\n    \\usepackage{parskip}\r\n  }{% else\r\n    \\setlength{\\parindent}{0pt}\r\n    \\setlength{\\parskip}{6pt plus 2pt minus 1pt}}\r\n}{% if KOMA class\r\n  \\KOMAoptions{parskip=half}}\r\n\\makeatother\r\n\\usepackage{xcolor}\r\n\\IfFileExists{xurl.sty}{\\usepackage{xurl}}{} % add URL line breaks if available\r\n\\IfFileExists{bookmark.sty}{\\usepackage{bookmark}}{\\usepackage{hyperref}}\r\n\\hypersetup{\r\n  pdftitle={C++ Neural Network in a Weekend},\r\n  pdfauthor={Jeremy Ong},\r\n  hidelinks,\r\n  pdfcreator={LaTeX via pandoc}}\r\n\\urlstyle{same} % disable monospaced font for URLs\r\n\\usepackage{color}\r\n\\usepackage{fancyvrb}\r\n\\newcommand{\\VerbBar}{|}\r\n\\newcommand{\\VERB}{\\Verb[commandchars=\\\\\\{\\}]}\r\n\\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\\\\{\\}}\r\n% Add ',fontsize=\\small' for more characters per line\r\n\\newenvironment{Shaded}{}{}\r\n\\newcommand{\\AlertTok}[1]{\\textcolor[rgb]{1.00,0.00,0.00}{\\textbf{#1}}}\r\n\\newcommand{\\AnnotationTok}[1]{\\textcolor[rgb]{0.38,0.63,0.69}{\\textbf{\\textit{#1}}}}\r\n\\newcommand{\\AttributeTok}[1]{\\textcolor[rgb]{0.49,0.56,0.16}{#1}}\r\n\\newcommand{\\BaseNTok}[1]{\\textcolor[rgb]{0.25,0.63,0.44}{#1}}\r\n\\newcommand{\\BuiltInTok}[1]{#1}\r\n\\newcommand{\\CharTok}[1]{\\textcolor[rgb]{0.25,0.44,0.63}{#1}}\r\n\\newcommand{\\CommentTok}[1]{\\textcolor[rgb]{0.38,0.63,0.69}{\\textit{#1}}}\r\n\\newcommand{\\CommentVarTok}[1]{\\textcolor[rgb]{0.38,0.63,0.69}{\\textbf{\\textit{#1}}}}\r\n\\newcommand{\\ConstantTok}[1]{\\textcolor[rgb]{0.53,0.00,0.00}{#1}}\r\n\\newcommand{\\ControlFlowTok}[1]{\\textcolor[rgb]{0.00,0.44,0.13}{\\textbf{#1}}}\r\n\\newcommand{\\DataTypeTok}[1]{\\textcolor[rgb]{0.56,0.13,0.00}{#1}}\r\n\\newcommand{\\DecValTok}[1]{\\textcolor[rgb]{0.25,0.63,0.44}{#1}}\r\n\\newcommand{\\DocumentationTok}[1]{\\textcolor[rgb]{0.73,0.13,0.13}{\\textit{#1}}}\r\n\\newcommand{\\ErrorTok}[1]{\\textcolor[rgb]{1.00,0.00,0.00}{\\textbf{#1}}}\r\n\\newcommand{\\ExtensionTok}[1]{#1}\r\n\\newcommand{\\FloatTok}[1]{\\textcolor[rgb]{0.25,0.63,0.44}{#1}}\r\n\\newcommand{\\FunctionTok}[1]{\\textcolor[rgb]{0.02,0.16,0.49}{#1}}\r\n\\newcommand{\\ImportTok}[1]{#1}\r\n\\newcommand{\\InformationTok}[1]{\\textcolor[rgb]{0.38,0.63,0.69}{\\textbf{\\textit{#1}}}}\r\n\\newcommand{\\KeywordTok}[1]{\\textcolor[rgb]{0.00,0.44,0.13}{\\textbf{#1}}}\r\n\\newcommand{\\NormalTok}[1]{#1}\r\n\\newcommand{\\OperatorTok}[1]{\\textcolor[rgb]{0.40,0.40,0.40}{#1}}\r\n\\newcommand{\\OtherTok}[1]{\\textcolor[rgb]{0.00,0.44,0.13}{#1}}\r\n\\newcommand{\\PreprocessorTok}[1]{\\textcolor[rgb]{0.74,0.48,0.00}{#1}}\r\n\\newcommand{\\RegionMarkerTok}[1]{#1}\r\n\\newcommand{\\SpecialCharTok}[1]{\\textcolor[rgb]{0.25,0.44,0.63}{#1}}\r\n\\newcommand{\\SpecialStringTok}[1]{\\textcolor[rgb]{0.73,0.40,0.53}{#1}}\r\n\\newcommand{\\StringTok}[1]{\\textcolor[rgb]{0.25,0.44,0.63}{#1}}\r\n\\newcommand{\\VariableTok}[1]{\\textcolor[rgb]{0.10,0.09,0.49}{#1}}\r\n\\newcommand{\\VerbatimStringTok}[1]{\\textcolor[rgb]{0.25,0.44,0.63}{#1}}\r\n\\newcommand{\\WarningTok}[1]{\\textcolor[rgb]{0.38,0.63,0.69}{\\textbf{\\textit{#1}}}}\r\n\\usepackage{longtable,booktabs}\r\n% Correct order of tables after \\paragraph or \\subparagraph\r\n\\usepackage{etoolbox}\r\n\\makeatletter\r\n\\patchcmd\\longtable{\\par}{\\if@noskipsec\\mbox{}\\fi\\par}{}{}\r\n\\makeatother\r\n% Allow footnotes in longtable head/foot\r\n\\IfFileExists{footnotehyper.sty}{\\usepackage{footnotehyper}}{\\usepackage{footnote}}\r\n\\makesavenoteenv{longtable}\r\n\\usepackage{graphicx}\r\n\\makeatletter\r\n\\def\\maxwidth{\\ifdim\\Gin@nat@width>\\linewidth\\linewidth\\else\\Gin@nat@width\\fi}\r\n\\def\\maxheight{\\ifdim\\Gin@nat@height>\\textheight\\textheight\\else\\Gin@nat@height\\fi}\r\n\\makeatother\r\n% Scale images if necessary, so that they will not overflow the page\r\n% margins by default, and it is still possible to overwrite the defaults\r\n% using explicit options in \\includegraphics[width, height, ...]{}\r\n\\setkeys{Gin}{width=\\maxwidth,height=\\maxheight,keepaspectratio}\r\n% Set default figure placement to htbp\r\n\\makeatletter\r\n\\def\\fps@figure{htbp}\r\n\\makeatother\r\n\\setlength{\\emergencystretch}{3em} % prevent overfull lines\r\n\\providecommand{\\tightlist}{%\r\n  \\setlength{\\itemsep}{0pt}\\setlength{\\parskip}{0pt}}\r\n\\setcounter{secnumdepth}{-\\maxdimen} % remove section numbering\r\n\\usepackage{amsmath}\r\n\\usepackage{tikz}\r\n\\usepackage{mathtools}\r\n\\usepackage{amsthm}\r\n\\usepackage{amssymb}\r\n\\usepackage{bm}\r\n\\usetikzlibrary{positioning}\r\n\\usetikzlibrary{arrows}\r\n\\usetikzlibrary{shapes}\r\n\\usetikzlibrary{calc}\r\n\\ifluatex\r\n  \\usepackage{selnolig}  % disable illegal ligatures\r\n\\fi\r\n\r\n\\title{C++ Neural Network in a Weekend}\r\n\\author{Jeremy Ong}\r\n\\date{}\r\n\r\n\\begin{document}\r\n\\maketitle\r\n\r\n\\hypertarget{introduction}{%\r\n\\subsection{Introduction}\\label{introduction}}\r\n\r\nWould you like to write a neural network from start to finish? Are you\r\nperhaps shaky on some of the fundamental concepts and derivations, such\r\nas categorical cross-entropy loss or backpropagation? Alternatively,\r\nwould you like an introduction to machine learning without relying on\r\n``magical'' frameworks that seem to perform AI miracles with only a few\r\nlines of code (and just as little intuition)? If so, this article was\r\nwritten for you.\r\n\r\nDeep learning as a technology and discipline has been booming. Nearly\r\nevery facet of deep learning is teeming with progress and healthy\r\ncompetition to achieve state of the art performance and efficiency. It's\r\nno surprise that resources tend to emphasize the ``latest and greatest''\r\nin feats such as object recognition, natural language parsing, ``deep\r\nfakes'', and more. In contrast, fewer resources expand as much on the\r\npractical \\emph{engineering} aspects of deep learning. That is, how\r\nshould a deep learning framework be structured? How do you go about\r\nrolling your own infrastructure instead of relying on Keras, Pytorch,\r\nTensorflow, or any of the other dominant frameworks? Whether you wish to\r\nwrite your own for learning purposes, or if you need to deploy a neural\r\nnetwork on a constrained (i.e.~embedded) device, there is plenty to be\r\ngained from authoring a neural network from scratch.\r\n\r\nThe neural network outlined here is hosted on\r\n\\href{https://github.com/jeremyong/cpp_nn_in_a_weekend}{github} and has\r\nenough abstractions to vaguely resemble a production network, without\r\nbeing overly engineered as to be indigestible in a sitting or two. The\r\ntraining and test data provided is the venerable\r\n\\href{http://yann.lecun.com/exdb/mnist/}{MNIST} dataset of handwritten\r\ndigits. While more exotic (and original) datasets exist, MNIST is chosen\r\nhere because its sheer ubiquity guarantees you can find corresponding\r\nliterature to help drive further experimentation, or troubleshoot when\r\nthings go wrong.\r\n\r\n\\hypertarget{background}{%\r\n\\subsection{Background}\\label{background}}\r\n\r\nThis section serves as a moderately high-level description of the major\r\nmathematical underpinnings of neural networks and may be safely skipped\r\nby those who prefer to jump straight to the code.\r\n\r\nSuppose we have a task we would like a machine learning model to\r\ncomplete (e.g.~recognizing handwritten digits). At a high level, we need\r\nto perform the following tasks:\r\n\r\n\\begin{enumerate}\r\n\\def\\labelenumi{\\arabic{enumi}.}\r\n\\tightlist\r\n\\item\r\n  First, we must conceptualize the task as a ``function'' such that the\r\n  inputs and outputs of the task can be described in a concrete\r\n  mathematical sense (amenable for programmability).\r\n\\item\r\n  Second, we need a way to quantify the degree to which our model is\r\n  performing poorly against a known set of correct answers. This is\r\n  typically denoted as the \\emph{loss} or \\emph{objective} function of\r\n  the model.\r\n\\item\r\n  Third, we need an \\emph{optimization strategy} which will describe how\r\n  to adjust the model after feedback is provided regarding the model's\r\n  performance as per the loss function described above.\r\n\\item\r\n  Fourth, we need a \\emph{regularization strategy} to address\r\n  inadvertently tuning the model with a high degree of specificity to\r\n  our training data, at the cost of generalized performance when\r\n  handling inputs not yet encountered.\r\n\\item\r\n  Fifth, we need an \\emph{architecture} for our model, including how\r\n  inputs are transformed into outputs and an enumaration of all the\r\n  adjustable parameters the model supports.\r\n\\item\r\n  Finally, we need a robust \\emph{implementation} that executes the\r\n  above within memory and execution budgets, accounting for\r\n  floating-point stability, reproducibility, and a number of other\r\n  engineering-related matters.\r\n\\end{enumerate}\r\n\r\n\\emph{Deep learning} is distinct from other machine learning models in\r\nthat the architecture is heavily over-parameterized and based on simpler\r\n\\emph{building blocks} as opposed to bespoke components. The building\r\nblocks used are neurons, or particular arrangements of neurons,\r\ntypically organized as layers. Over the course of training a deep\r\nlearning model, it is expected that \\emph{features} of the inputs are\r\nlearned and manifested as various parameter values in these neurons.\r\nThis is in contrast to traditional machine learning, where features are\r\nnot learned, but implemented directly.\r\n\r\n\\hypertarget{categorical-cross-entropy-loss}{%\r\n\\subsubsection{Categorical Cross-Entropy\r\nLoss}\\label{categorical-cross-entropy-loss}}\r\n\r\nMore concretely, the task at hand is to train a model to recognize a 28\r\nby 28 pixel handwritten greyscale digit. For simplicity, our model will\r\ninterpret the data as a flattened 784-dimensional vector. Instead of\r\ndescribing the architecture of the model first, we'll start with\r\nunderstanding what the model should output and how to assess the model's\r\nperformance. The output of our model will be a 10-dimensional vector,\r\nrepresenting the probability distribution of the supplied input. That\r\nis, each element of the output vector indicates the model's estimation\r\nof the probability that the digit's value matches the corresponding\r\nelement index. For example, if the model outputs:\r\n\r\n\\[M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]\\]\r\n\r\nfor some input image \\(\\mathbf{I}\\), we interpret this to mean that the\r\nmodel believes there is an equal chance of the examined digit to be a 2\r\nor a 3.\r\n\r\nNext, we should consider how to quantify the model's loss. Suppose, for\r\nexample, that the image \\(\\mathbf{I}\\) actually corresponded to the\r\ndigit ``7'' (our model made a horrible prediction!), how might we\r\npenalize the model? In this case, we know that the \\emph{actual}\r\nprobability distribution is the following:\r\n\r\n\\[\\left[0, 0, 0, 0, 0, 0, 0, 1, 0, 0\\right]\\]\r\n\r\nThis is known as a ``one-hot'' encoded vector, but it may be helpful to\r\nthink of it as a probability distribution given a set of events that are\r\nmutually exclusive (a digit cannot be both a ``7'' \\emph{and} a ``3''\r\nfor instance).\r\n\r\nFortunately, information theory provides us some guidance on defining an\r\neasy-to-compute loss function which quantifies the dissimilarities\r\nbetween two probability distributions. If the probability of of an event\r\n\\(E\\) is given as \\(P(E)\\), then the \\emph{entropy} of this event is\r\ngiven as \\(-\\log P(E)\\). The negation ensures that this is a positive\r\nquantity, and by inspection, the entropy increases as an event becomes\r\nless likely. Conversely, in the limit as \\(P(E)\\) approaches \\(1\\), the\r\nentropy shrinks to \\(0\\). While several interpretations of entropy are\r\npossible, the pertinent interpretation here is that entropy is a\r\n\\emph{measure of the information conveyed when a particular event\r\noccurs}. That the ``sun rose this morning'' is a fairly mundane\r\nobservation but being told ``the sun exploded'' is sure to pique your\r\nattention. Because we are reasonably certain that the sun rises each\r\nmorning (with near 100\\% confidence), that ``the sun rises'' is an event\r\nthat conveys little additional information when it occurs.\r\n\r\nLet's consider next entropy in the context of a probability\r\ndistribution. Given a discrete random variable \\(X\\) which can take on\r\nvalues \\(x_0, \\dots, x_{n-1}\\) with probabilities\r\n\\(p(x_0), \\dots, p(x_{n-1})\\), the entropy of the random variable \\(X\\)\r\nis defined as:\r\n\r\n\\[H(X) = -\\sum_{x \\in X} p(x) \\log p(x)\\]\r\n\r\nFor example, suppose \\(W\\) is a binary random variable that represents\r\ntoday's weather which can either be ``sunny'' or ``rainy'' (a binary\r\nrandom variable). The entropy \\(H(W)\\) can be given as:\r\n\r\n\\[H(W) = -S\\log S - (1 - S) \\log (1 - S)\\]\r\n\r\nwhere \\(S\\) is the probability of a sunny day, and hence \\(1 - S\\) is\r\nthe probability of a rainy day. As a binary random variable, the\r\nsummation over weighted entropies expands to only two terms. What does\r\nthis quantity mean? If we were to describe it in words, each term of the\r\nsum in the entropy calculation corresponds to the information of a\r\nparticular event, weighted by the probability of the event. Thus, the\r\nentropy of the distribution is literally the \\emph{expected amount of\r\ninformation contained in an event} for a given distribution. If we plot\r\n\\(-S\\log S - (1 - S) \\log(1 - S)\\) as a function of \\(S\\), we will see\r\nsomething like this:\r\n\r\n\\begin{figure}\r\n\\centering\r\n\\includegraphics{plots/6094492350593652429.png}\r\n\\caption{}\r\n\\end{figure}\r\n\r\nAs a minor note, while \\(\\log 0\\) is an undefined quantity, information\r\ntheorists accept that \\(\\lim_{p\\rightarrow 0} p\\log p = 0\\) by\r\nconvention. Intuitively, the expected entropy should be unaffected by\r\nthe set of impossible events.\r\n\r\nAs you might expect, when the distribution is 50-50, the uncertainty of\r\na binary is maximal, and by extension the amount of information\r\ncontained in each event is maximized too. Put another way, if you lived\r\nin an area where it was always sunny, you wouldn't \\emph{learn anything}\r\nif someone told you it was sunny today. However, in a tropical region\r\ncharacterized by capricious weather, information conveyed about the\r\nweather is far more meaningful.\r\n\r\nIn the previous example, we weighted the event entropies according to\r\nthe event's probability distribution. What would happen if, instead, we\r\nused weights corresponding to a \\emph{different} probability\r\ndistribution? This is known as the \\emph{cross entropy}:\r\n\r\n\\[H(p, q) = -\\sum_{x \\in X} p(x)\\log q(x)\\]\r\n\r\nTo get some intuition about this, first, we note that if\r\n\\(p(x) = q(x), \\forall x\\in X\\), the cross entropy trivially matches the\r\nself-entropy. Let's go back to our binary entropy example and visualize\r\nwhat it looks like if we chose a completely \\emph{incorrect}\r\ndistribution. Specifically, suppose we computed the cross entropy where\r\nif the probability of a sunny day is \\(S\\), we weight the entropy with\r\n\\(1 - S\\) instead of \\(S\\) as in the self-entropy formula.\r\n\r\n\\begin{figure}\r\n\\centering\r\n\\includegraphics{plots/-6767785830879840565.png}\r\n\\caption{}\r\n\\end{figure}\r\n\r\nIf you compare the values with the previous figure, you'll see that the\r\ncross entropy diverges from the self-entropy everywhere except \\(0.5\\),\r\nwhere \\(S = 1 - S\\). The difference between the cross entropy\r\n\\(H(p, q)\\) and entropy \\(H(p)\\) provides then, a \\emph{measure of\r\nerror} between the presumed distribution \\(q\\) and the true distribution\r\n\\(p\\). This difference is also known as the\r\n\\href{https://en.wikipedia.org/wiki/Kullback\\%E2\\%80\\%93Leibler_divergence}{Kullback-Leibler\r\ndivergence} or KL divergence for short.\r\n\r\nGiven that the entropy of a given probability distribution \\(p\\) is\r\nconstant, then \\(H(p)\\) must be constant as well. This is why in\r\npractice, we will generally seek to minimize the cross entropy between\r\n\\(p\\) and a predicted distribution \\(q\\), which by extension will\r\nminimize the Kullback-Leibler divergence as well.\r\n\r\nNow, we have the tools to know if our model is succeeding or not! Given\r\nan estimation of a sample's label as before:\r\n\r\n\\[M(\\mathbf{I}) = \\left[0, 0, 0.5, 0.5, 0, 0, 0, 0, 0, 0\\right]\\]\r\n\r\nwe will treat our model's output as a predicted probability distribution\r\nof the sample digit's classification from 0 to 9. Then, we compute the\r\ncross entropy between this predction and the true distribution, which\r\nwill be in the form of a one-hot vector. Supposing the actual digit is 3\r\nin this particular case (\\(P(7) = 1\\)):\r\n\r\n\\[ \\sum_{x\\in \\{0,\\dots, 9\\}} -P(x) \\log Q(x) = -P(3) \\log(Q(3)) = \\log(0.5) \\approx 0.301 \\]\r\n\r\nLet's make a few observations before continuing. First, for a one-hot\r\nvector, the entropy is 0 (can you see why?). Second, by pretending the\r\ncorrect digit above is \\(3\\) and not, say, \\(7\\), we conveniently\r\navoided \\(\\log 0\\) showing up in the final expression. A common method\r\nto avoid this is to add a small \\(\\epsilon\\) to the log argument to\r\navoid this singularity, but we'll discuss this in more detail later.\r\n\r\n\\hypertarget{creating-our-approximation-function-with-a-neural-network}{%\r\n\\subsubsection{Creating our Approximation Function with a Neural\r\nNetwork}\\label{creating-our-approximation-function-with-a-neural-network}}\r\n\r\nNow that we know how to evaluate our model, we'll need to decide how to\r\ngo about making predictions in the form of a probability distribution.\r\nOur model will need to take as inputs, 28x28 images (which as mentioned\r\nbefore, will be flattened to 784x1 vectors for simplicity). Let's\r\nenumerate the properties our model will need:\r\n\r\n\\begin{enumerate}\r\n\\def\\labelenumi{\\arabic{enumi}.}\r\n\\tightlist\r\n\\item\r\n  Parameterization - our model will need parameters we can adjust to\r\n  ``fit'' the model to the data\r\n\\item\r\n  Nonlinearity - it is assuredly not the case that the probability\r\n  distribution can be modeled with a set of linear equations\r\n\\item\r\n  Differentiability - the gradient of our model's output with respect to\r\n  any given parameter indicates the \\emph{impact} of that parameter on\r\n  the final result\r\n\\end{enumerate}\r\n\r\nThere are an infinite number of functions that fit this criteria, but\r\nhere, we'll use a simple feedforward network with a single hidden layer.\r\n\r\n\\begin{center}\r\n\\begin{tikzpicture}[x=1.5cm, y=1cm, >=stealth]\r\n\r\n\\tikzset{%\r\n    every neuron/.style = {\r\n        circle,\r\n        draw,\r\n        minimum size=0.5cm\r\n    },\r\n    neuron missing/.style = {\r\n        draw=none,\r\n        scale=1.5,\r\n        text height=0.3333cm,\r\n        execute at begin node=\\color{black}$\\vdots$\r\n    }\r\n}\r\n\r\n\r\n\\foreach \\m/\\l [count=\\y] in {1,2,3,missing,missing,783,784}\r\n  \\node [every neuron/.try, neuron \\m/.try] (input-\\m) at (0,2.5-\\y) {};\r\n\r\n\\foreach \\m [count=\\y] in {1,2,3,missing,4}\r\n  \\node [every neuron/.try, neuron \\m/.try ] (hidden-\\m) at (2,2-\\y*1.15) {};\r\n\r\n\\foreach \\m [count=\\y] in {1,2,missing,10}\r\n  \\node [every neuron/.try, neuron \\m/.try ] (output-\\m) at (4,1.25-\\y) {};\r\n\r\n\\foreach \\l in {1,2,3,783,784}\r\n  \\draw [<-] (input-\\l) -- ++(-1,0)\r\n    node [above, midway] {$x_{\\l}^{[0]}$};\r\n\r\n\\foreach \\l [count=\\i] in {1,2,3}\r\n  \\node [above] at (hidden-\\i.north) {$h_\\l^{[1]}$};\r\n\r\n\\node [below] at (hidden-4.south) {$h_n^{[1]}$};\r\n\r\n\\foreach \\l in {1,2,10}\r\n  \\draw [->] (output-\\l) -- ++(1,0)\r\n    node [above, midway] {$\\hat{y}_{\\l}^{[2]}$};\r\n\r\n\\foreach \\i in {1,2,3,783,784}\r\n{\r\n  \\draw [->] (input-\\i) -- (hidden-4);\r\n  \\foreach \\j in {1,...,3}\r\n    \\draw [->] (input-\\i) -- (hidden-\\j);\r\n}\r\n\r\n\\foreach \\i in {1,2,3,4}\r\n{\r\n  \\draw [->] (hidden-\\i) -- (output-10);\r\n  \\foreach \\j in {1,...,2}\r\n    \\draw [->] (hidden-\\i) -- (output-\\j);\r\n}\r\n\r\n\\foreach \\l [count=\\x from 0] in {Input, Hidden, Output}\r\n  \\node [align=center, above] at (\\x*2,2) {\\l \\\\ layer};\r\n\r\n\\end{tikzpicture}\r\n\\end{center}\r\n\r\nA few quick notes regarding notation: a superscript of the form \\([i]\\)\r\nis used to denote the \\(i\\)th layer. A subscript is used to denote a\r\nparticular element within a layer or vector. The vector \\(\\mathbf{x}\\)\r\nis usually reserved for training samples, and the vector \\(\\mathbf{y}\\)\r\nis typically reserved for sample labels (i.e.~the desired ``answer'' for\r\na given sample). The vector \\(\\hat{\\mathbf{y}}\\) is used to denote a\r\nmodel's predicted labels for a given input.\r\n\r\nOn the far left, we have the input layer with \\(784\\) nodes\r\ncorresponding to each of the 28 by 28 pixels in an individual sample.\r\nEach \\(x_i^{(0)}\\) is a floating point value between 0 and 1 inclusive.\r\nBecause the data is encoded with 8 bits of precision, there are 256\r\npossible values for each input. Each of the 784 input values fan out to\r\neach of the nodes in the hidden layer without modification.\r\n\r\nIn the center hidden layer, we have a variable number of nodes that each\r\nreceive all 784 inputs, perform some processing, and fan out the result\r\nto the output nodes on the far right. That is, each node in the hidden\r\nlayer transforms a \\(\\mathbb{R}^{784}\\) vector into a scalar output, so\r\nas a whole, the \\(n\\) nodes collectively need to map\r\n\\(\\mathbb{R}^\\rightarrow \\mathbb{R}^n\\). The simplest way to do this is\r\nwith an \\(n\\times 784\\) matrix (treating inputs as column vectors).\r\nModeling the hidden layer this way, each of the \\(n\\) nodes in the\r\nhidden layer is associated with a single row in our\r\n\\(\\mathbb{R}^{n\\times 784}\\) matrix. Each entry of this matrix is\r\nreferred to as a \\emph{weight}.\r\n\r\nWe still have two issues we need to address however. First, a matrix\r\nprovides a linear mapping between two spaces, and linear maps take \\(0\\)\r\nto \\(0\\) (you can visualize such maps as planes through the origin).\r\nThus, such fully-connected layers typically add a \\emph{bias} to each\r\noutput node to turn the map into an affine map. This enables the model\r\nto respond zeroes in the input. Thus, the hidden layer as a whole has\r\nnow both a weight matrix, and also a bias vector. A linear mapping with\r\na constant bias is commonly referred to as an \\emph{affine map}.\r\n\r\nThe second issue is that our hidden layer's now-affine mapping still\r\nscales linearly with the input, and one of our requirements for our\r\napproximation function was nonlinearity (a strict prerequisite for\r\nuniversality). Thus, we perform one final non-linear operation the\r\nresult of the affine map. This is known as the \\emph{activation\r\nfunction}, and an infinite number of choices present itself here. In\r\npractice, the \\emph{rectifier function}, defined below, is a perennial\r\nchoice.\r\n\r\n\\[f(x) = \\max(0, x)\\]\r\n\r\n\\begin{figure}\r\n\\centering\r\n\\includegraphics{plots/-1637788021081228918.png}\r\n\\caption{}\r\n\\end{figure}\r\n\r\nThe rectifier is popular for having a number of desirable properties.\r\n\r\n\\begin{enumerate}\r\n\\def\\labelenumi{\\arabic{enumi}.}\r\n\\tightlist\r\n\\item\r\n  Easy to compute\r\n\\item\r\n  Easy to differentiate (except at 0, which has not been found to be a\r\n  problem in practice)\r\n\\item\r\n  Sparse activation, which aids in addressing model overfitting and\r\n  ``unlearning'' useful weights\r\n\\end{enumerate}\r\n\r\nAs our hidden layer units will use this rectifier just before emitting\r\nits final output to the next layer, our hidden units may be called\r\n\\emph{rectified linear units} or ReLUs for short.\r\n\r\nSummarizing our hidden layer, the output of each unit in the layer can\r\nbe written as:\r\n\r\n\\[a_i^{[1]} = \\max(0, W_{i}^{[1]} \\cdot \\mathbf{x}^{[0]} + b_i^{[1]})\\]\r\n\r\nIt's common to refer to the final activated output of a neural network\r\nlayer as the vector \\(\\mathbf{a}\\), and the result of the internal\r\naffine map \\(\\mathbf{z}\\). Using this notation and considering the\r\noutput of the hidden layer as a whole as a vector quantity, we can\r\nwrite:\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\mathbf{z}^{[1]} &= \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]} \\\\\r\n\\mathbf{a}^{[1]} &= \\max(\\mathbf{0}, \\mathbf{z}^{[1]}) \\\\\r\n\\mathbf{a}^{[1]}, \\mathbf{b}^{[1]} &\\in \\mathbb{R}^n \\\\\r\n\\mathbf{W}^{[1]} &\\in \\mathbb{R}^{n\\times 784} \\\\\r\n\\mathbf{x}^{[0]} &\\in \\mathbb{R}^{784}\r\n\\end{aligned}\r\n\\]\r\n\r\nThe last layer to consider is the output layer. As with the hidden\r\nlayer, we need a dimensionality transform, in this case, taking vectors\r\nin \\(\\mathbb{R}^n\\) and mapping them to vectors in \\(\\mathbb{R}^{10}\\)\r\n(corresponding to the 10 possible digits in the target output). As\r\nbefore, we will use an affine map with the appropriately sized weight\r\nmatrix and bias vector. Here, however, the rectifier isn't suitable as\r\nan activation function because we want to emit a probability\r\ndistribution. To be a valid probability distribution, each output of the\r\nhidden layer must be in the range \\([0, 1]\\), and the sum of all outputs\r\nmust equal \\(1\\). The most common activation function used to achieve\r\nthis is the \\emph{softmax function}:\r\n\r\n\\[\\mathrm{softmax}(\\mathbf{z})_i = \\frac{\\exp(z_i)}{\\sum_j \\exp(z_j)}\\]\r\n\r\nGiven a vector input \\(z\\), each component of the softmax output (as a\r\nvector quantity) is given as per the expression above. The exponential\r\nfunctions conveniently map negative numbers to positive numbers, and the\r\ndenominator ensures that all outputs will be between 0 and 1, and sum to\r\n1 as desired. There are other reasons why an exponential function is\r\nused here, stemming from our choice of a loss function (based on the\r\nunderpinning notion of maximum-likelihood estimation), but we won't get\r\ninto that in too much detail here (consult the further reading section\r\nat the end to learn more). Suffice it to say that an additional benefit\r\nof the exponential function is its clean interaction with the logarithm\r\nused in our choice of loss function, especially when we will need to\r\ncompute gradients in the next section.\r\n\r\nSummarizing our neural network architecture, with two weight matrices\r\nand two bias vectors, we can construct two affine maps which map vectors\r\nin \\(\\mathbb{R}^{784}\\) to \\(\\mathbb{R}^n\\) to \\(\\mathbb{R}^{10}\\).\r\nPrior to forwarding the results of one affine map as the input of the\r\nnext, we employ an activation function to add non-linearity to the\r\nmodel. First, we use a linear rectifier and second, we use a softmax\r\nfunction, ensuring that we end up with a nice discrete probability\r\ndistribution with 10 possible events corresponding to the 10 digits.\r\n\r\nOur network is small enough that we can actually write out the entire\r\nprocess as a single function using the notation we've built so far:\r\n\r\n\\[f(\\mathbf{x}^{[0]}) = \\mathbf{y}^{[2]} = \\mathrm{softmax}\\left(\\mathbf{W}^{[2]}\\left(\\max\\left(\\mathbf{0}, \\mathbf{W}^{[1]}\\mathbf{x}^{[0]} + \\mathbf{b}^{[1]}\\right) \\right) + \\mathbf{b}^{[2]} \\right)\\]\r\n\r\n\\hypertarget{optimizing-our-network}{%\r\n\\subsubsection{Optimizing our network}\\label{optimizing-our-network}}\r\n\r\nWe now have a model given above which can turn our 784 dimensional\r\ninputs into a 10-element probability distribution, \\emph{and} we have a\r\nway to evaluate how accuracy of each prediction. Next, we need a\r\nreliable way to improve the model based on the feedback provided by our\r\nloss function. This is known as function \\emph{optimization}, and most\r\nmethods of model optimization are based on the principle of\r\n\\emph{gradient descent}.\r\n\r\nThe idea is quite simple. Given a function with a set of parameters\r\nwhich we'll denote \\(\\bm{\\theta}\\), the partial derivative of that\r\nfunction with respect to a given parameter \\(\\theta_i \\in \\bm{\\theta}\\)\r\ntells us the overall \\emph{impact} of \\(\\theta_i\\) on the final result.\r\nIn our model, we have many parameters; each weight and bias constitutes\r\nan individually tunable parameter. Thus, our strategy should be, given a\r\nset of input samples, compute the loss our model produces for each\r\nsample. Then, compute the partial derivatives of that loss with respect\r\nto \\emph{every parameter} in our model. Finally, adjust each parameter\r\nin proportion to its impact on the final loss. Mathematically, this\r\nprocess is described below (note that the superscript \\((i)\\) is used to\r\ndenote the \\(i\\)-th sample):\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\mathrm{Total~Loss} &= \\sum_i J(\\mathbf{x}^{(i)}; \\bm\\theta) \\\\\r\n\\mathrm{Compute}~ &\\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in \\bm\\theta \\\\\r\n\\mathrm{Adjust}~ & \\theta_j \\rightarrow \\theta_j - \\eta \\sum_i \\frac{\\partial J(\\mathbf{x}^{(i)})}{\\partial \\theta_j} ~\\forall ~\\theta_j \\in\\bm\\theta\\\\\r\n\\end{aligned}\r\n\\]\r\n\r\nHere, there is some flexibility in the choice of \\(\\eta\\), often\r\nreferred to as the \\emph{learning rate}. A small \\(\\eta\\) promotes more\r\nconservative and accurate steps, but at the cost of our model being more\r\ncostly to update. A large \\(\\eta\\) on the other hand results in larger\r\nupdates to our model per training cycle, but may result in instability.\r\nUpdating in the above fashion should adjust the model such that it will\r\nproduce a smaller loss given the same inputs.\r\n\r\nIn practice, the size of the input set may be very large, rendering it\r\nintractable to evaluate the model on every single training sample in the\r\nsum above before adjusting parameters. Thus, a common strategy is to use\r\n\\emph{stochastic gradient descent} (abbrev. SGD) and perform\r\nloss-gradient-based adjustments after evaluating smaller batches of\r\nsamples. Concretely, the MNIST handwritten digits database contains\r\n60,000 training samples. If we were to train our model using gradient\r\ndescent in the strictest sense, we would execute the following\r\npseudocode:\r\n\r\n\\begin{verbatim}\r\nmodel.init()\r\n\r\nfor i in num_training_cycles\r\n    loss <- 0\r\n\r\n    for n in 60000\r\n        x <- MNIST.data[n]\r\n        y <- model.predict(x)\r\n        loss += loss(y, MNIST.labels[n])\r\n    \r\n    model.gradient_descent(loss)\r\n\\end{verbatim}\r\n\r\nIn contrast, SGD pseudocode would look like:\r\n\r\n\\begin{verbatim}\r\nmodel.init()\r\n\r\nfor i in num_batches\r\n    loss <- 0\r\n    for j in batch_size\r\n        x <- MNIST.data[n]\r\n        y <- model.predict(x)\r\n        loss += loss(y, MNIST.labels[n])\r\n    \r\n    model.gradient_descent(loss)\r\n\\end{verbatim}\r\n\r\nSGD is very similar, but the batch size can be much smaller than the\r\namount of training data available. This enables the model to get more\r\nfrequent updates and waste fewer cycles especially at the start of\r\ntraining when the model is likely wildly inaccurate.\r\n\r\nWhen it comes time to compute the gradients, we are fortunate to have\r\nmade the prescient choice of constructing our model solely with\r\nelementary functions in a manner conducive to relatively painless\r\ndifferentiation. However, we still must exercise care as there is plenty\r\nof bookkeeping involved. We will evaluate loss-gradients with respect to\r\nindividual parameters when we walkthrough the implementation later, but\r\nfor now, let's establish a few preliminary results.\r\n\r\nRecall that our choice of loss function was the categorical cross\r\nentropy function, reproduced below:\r\n\r\n\\[J_{CE}(\\mathbf{\\hat{y}}, \\mathbf{y}) = -\\sum_{i} y_i \\log{\\hat{y}_i}\\]\r\n\r\nThe index \\(i\\) is enumerated over the set of possible outcomes\r\n(i.e.~the set of digits from 0 to 9). The quantities \\(y_i\\) are the\r\nelements of the one-hot label corresponding to the correct outcome, and\r\n\\(\\hat{\\mathbf{y}}\\) is the discrete probability distribution emitted by\r\nour model. We compute \\(\\partial J_{CE}/\\partial \\hat{y}_i\\) like so:\r\n\r\n\\[\\frac{\\partial J_{CE}}{\\partial \\hat{y}_i} = -\\frac{y_i}{\\hat{y}_i}\\]\r\n\r\nNotice that for a one-hot vector, this partial derivative vanishes\r\nwhenever \\(i\\) corresponds to an incorrect outcome.\r\n\r\nWorking backwards in our model, we next provide the partial derivative\r\nof the softmax function:\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\mathrm{softmax}(\\mathbf{z})_i &= \\frac{\\exp{z_i}}{\\sum_j \\exp{z_j}} \\\\\r\n\\frac{\\partial \\left(\\mathrm{softmax}(\\mathbf{z})_i\\right)}{\\partial z_k} &=\r\n\\begin{dcases}\r\n\\frac{\\left(\\sum_j\\exp{z_j}\\right)\\exp{z_i} - \\exp{2z_i}}{\\left(\\sum_j\\exp{z_j}\\right)^2}& i = k \\\\\r\n\\frac{-\\exp{z_i}\\exp{z_k}}{\\left(\\sum_j\\exp{z_j}\\right)^2}& i \\neq k\r\n\\end{dcases} \\\\\r\n&= \\begin{cases}\r\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = k \\\\\r\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_k & i \\neq k\r\n\\end{cases}\r\n\\end{aligned}\r\n\\]\r\n\r\nThe last set of equations follow from factorizing and rearranging the\r\nexpressions preceding it. It's often confusing to newer practitioners\r\nthat the partial derivative of softmax needs this unique treatment. The\r\nkey observation is that softmax is a vector-function. It accepts a\r\nvector as an input and emits a vector as an output. It also ``mixes''\r\nthe input components, thereby imposing a functional dependence of\r\n\\emph{every output component} on \\emph{every input component}. The lone\r\n\\(\\exp{z_i}\\) in the numerator of the softmax equation creates an\r\nasymmetric dependence of the output component on the input components.\r\n\r\nFinally, let's consider the partial derivative of the linear rectifier.\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\mathrm{ReLU}(z) &= \\max(0, z) \\\\\r\n\\frac{\\partial \\mathrm{ReLU}(z)}{\\partial z} &=\r\n\\begin{cases}\r\n0 & z < 0 \\\\\r\n\\mathrm{undefined} & z = 0 \\\\\r\nz & z > 0\r\n\\end{cases}\r\n\\end{aligned}\r\n\\]\r\n\r\nWhile the partial derivative \\emph{exactly} at 0 is undefined, in\r\npractice, the derivative is simply assigned to 0. Why the\r\nnon-differentiability at 0 isn't an issue has been a subject of\r\npractical debate for a long time. Here is a simple line of thinking to\r\njustify the apparent issue. Consider a rectifier function that is nudged\r\n\\emph{ever so slightly} to the right such that the inflection point is\r\n\\(\\epsilon / 2\\), where \\(\\epsilon\\) is the smallest positive floating\r\npoint number the machine can represent. In this case, the model will\r\nnever produce a value that sits directly on this inflection point, and\r\nas far as the computer is concerned, we never encounter a point where\r\nthis function is non-differentiable. We can even imagine an\r\ninfinitesimal curve that smooths out the function at that inflection\r\npoint if we want. Either way, experimentally, the linear rectifier\r\nremains one of the most effective activation functions for reasons\r\nmentioned, so we have no reason to discredit it over a technicality.\r\n\r\nNow that we can compute partial derivatives of all the nonlinear\r\nfunctions in our neural network (and presumbly the linear functions as\r\nwell), we are prepared to compute loss gradients with respect to any\r\nparameter in the network. Our tool of choice is the venerable chain rule\r\nof calculus:\r\n\r\n\\[\\left.\\frac{\\partial f(g(x))}{\\partial x}\\right\\rvert_x = \\left.\\frac{\\partial f}{\\partial g}\\right\\rvert_{g(x)} \\left.\\frac{\\partial g}{\\partial x}\\right\\rvert_x\\]\r\n\r\nThis gives us the partial derivative of a composite function\r\n\\(f\\circ g\\) evaluated at a particular value of \\(x\\). Our model itself\r\nis a series of composite functions, and as we can now compute the\r\npartials of each individual component in the model, we are ready to\r\nbegin implementation in the next section.\r\n\r\n\\hypertarget{setting-up}{%\r\n\\subsection{Setting up}\\label{setting-up}}\r\n\r\nOur project will leverage \\href{https://cmake.org}{CMake} as the\r\nmeta-build system to support as many operating systems and compilers as\r\npossible. A modern C++ compiler will also be needed to compile the code.\r\nAs of this writing, the code has been tested with GCC 10.1.0 and Clang\r\n10.0.0. You should feel free to simply adapt the code to your compiler\r\nand build system of choice. To emphasize the independent nature of this\r\nproject, \\emph{no further dependencies are needed}. At your discretion,\r\nyou may opt to use external testing frameworks, matrix and math\r\nlibraries, data structures, or any other external dependency as you see\r\nfit. If you're a newer C++ practitioner, you are welcome to model the\r\nstructure of the final project hosted on Github\r\n\\href{https://github.com/jeremyong/nn_in_a_weekend}{here}.\r\n\r\nIn addition, you will need the data hosted on the MNIST database website\r\nlinked \\href{http://yann.lecun.com/exdb/mnist/}{here}. The four files\r\navailable there consist of training images, training labels, test\r\nimages, and test labels.\r\n\r\nIt is highly recommended that you attempt to clone the repository and\r\nget things running (instructions on the README will always be kept up to\r\ndate). The code presented in this article will not be completely\r\nexhaustive, but will touch on all the major points, eschewing only\r\nvarious rudimentary helpers functions or uninteresting details for\r\nbrevity. Alternatively, a valid approach may be to simply follow along\r\nthe implementation notes below and attempt to blaze your own trail. Both\r\nbranches are viable approaches for learning.\r\n\r\n\\hypertarget{implementation}{%\r\n\\subsection{Implementation}\\label{implementation}}\r\n\r\n\\hypertarget{the-computational-graph}{%\r\n\\subsubsection{The Computational Graph}\\label{the-computational-graph}}\r\n\r\nThe network we will be constructing is purely sequential. Inputs flow\r\nfrom left to right and the only connections made are between one layer\r\nand the layer immediately succeeding it. In reality, many\r\nproduction-grade neural networks specialized for computer vision,\r\nnatural language processing, and other domains rely on architectures\r\nthat are non-sequential. Examples include ResNet, which introduces\r\nconnections between layers that are not adjacent, and various recurrent\r\nneural networks, which have a cyclic topology (outputs of the model are\r\nfed back as inputs to the model). Thus, it's useful to think of the\r\nmodel as a whole as \\emph{computational graph}. While we won't be\r\nemploying any complicated computational graph topologies here, we will\r\nstill structure the code with this notion in mind. Each layer of our\r\nnetwork will be modeled as a \\texttt{Node} with data flowing forwards\r\nand backwards through the node during training. Providing support for a\r\nfully general computational graph (i.e.~non-sequential) is outside the\r\nscope of this tutorial, but some scaffolding will be provided should you\r\nwant to extend it yourself in the future. For now, here is the interface\r\nwe'll use:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\PreprocessorTok{\\#include }\\ImportTok{\\textless{}cstdint\\textgreater{}}\r\n\\PreprocessorTok{\\#include }\\ImportTok{\\textless{}string\\textgreater{}}\r\n\\PreprocessorTok{\\#include }\\ImportTok{\\textless{}vector\\textgreater{}}\r\n\r\n\\KeywordTok{using} \\DataTypeTok{num\\_t}\\NormalTok{ = }\\DataTypeTok{float}\\NormalTok{;}\r\n\\KeywordTok{using} \\DataTypeTok{rne\\_t}\\NormalTok{ = }\\BuiltInTok{std::}\\NormalTok{mt19937;}\r\n\r\n\\CommentTok{// To be defined later. This class encapsulates all the nodes in our graph }\r\n\\KeywordTok{class}\\NormalTok{ Model;}\r\n\r\n\\KeywordTok{class}\\NormalTok{ Node}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n\\NormalTok{    Node(Model\\& model, }\\BuiltInTok{std::}\\NormalTok{string name);}\r\n    \r\n    \\CommentTok{// Nodes must describe how they should be initialized}\r\n    \\KeywordTok{virtual} \\DataTypeTok{void}\\NormalTok{ init(}\\DataTypeTok{rne\\_t}\\NormalTok{\\& rne) = }\\DecValTok{0}\\NormalTok{;}\r\n    \r\n    \\CommentTok{// During forward propagation, nodes transform input data and feed results}\r\n    \\CommentTok{// to all subsequent nodes}\r\n    \\KeywordTok{virtual} \\DataTypeTok{void}\\NormalTok{ forward(}\\DataTypeTok{num\\_t}\\NormalTok{* inputs) = }\\DecValTok{0}\\NormalTok{;}\r\n\r\n    \\CommentTok{// During reverse propagation, nodes receive loss gradients to its previous}\r\n    \\CommentTok{// outputs and compute gradients with respect to each tunable parameter}\r\n    \\KeywordTok{virtual} \\DataTypeTok{void}\\NormalTok{ reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* gradients) = }\\DecValTok{0}\\NormalTok{;}\r\n    \r\n    \\CommentTok{// If the node has tunable parameters, this method should be overridden}\r\n    \\CommentTok{// to reflect the quantity of tunable parameters}\r\n    \\KeywordTok{virtual} \\DataTypeTok{size\\_t}\\NormalTok{ param\\_count() }\\AttributeTok{const} \\KeywordTok{noexcept}\\NormalTok{ \\{ }\\ControlFlowTok{return} \\DecValTok{0}\\NormalTok{; \\}}\r\n    \r\n    \\CommentTok{// Accessor for parameter by index}\r\n    \\KeywordTok{virtual} \\DataTypeTok{num\\_t}\\NormalTok{* param(}\\DataTypeTok{size\\_t}\\NormalTok{ index) \\{ }\\ControlFlowTok{return} \\KeywordTok{nullptr}\\NormalTok{; \\}}\r\n    \r\n    \\CommentTok{// Access for loss{-}gradient with respect to a parameter specified by index}\r\n    \\KeywordTok{virtual} \\DataTypeTok{num\\_t}\\NormalTok{* gradient(}\\DataTypeTok{size\\_t}\\NormalTok{ index) \\{ }\\ControlFlowTok{return} \\KeywordTok{nullptr}\\NormalTok{; \\}}\r\n    \r\n    \\CommentTok{// Human{-}readable name for debugging purposes}\r\n    \\BuiltInTok{std::}\\NormalTok{string }\\AttributeTok{const}\\NormalTok{\\& name() }\\AttributeTok{const} \\KeywordTok{noexcept}\\NormalTok{ \\{ }\\ControlFlowTok{return} \\VariableTok{name\\_}\\NormalTok{; \\}}\r\n    \r\n    \\CommentTok{// Information dump for debugging purposes}\r\n    \\KeywordTok{virtual} \\DataTypeTok{void}\\NormalTok{ print() }\\AttributeTok{const}\\NormalTok{ = }\\DecValTok{0}\\NormalTok{;}\r\n\r\n\\KeywordTok{protected}\\NormalTok{:}\r\n    \\KeywordTok{friend} \\KeywordTok{class}\\NormalTok{ Model;}\r\n    \r\n\\NormalTok{    Model\\& }\\VariableTok{model\\_}\\NormalTok{;}\r\n    \\BuiltInTok{std::}\\NormalTok{string }\\VariableTok{name\\_}\\NormalTok{;}\r\n    \\CommentTok{// Nodes that precede this node in the computational graph}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}Node*\\textgreater{} }\\VariableTok{antecedents\\_}\\NormalTok{;}\r\n    \\CommentTok{// Nodes that succeed this node in the computational graph}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}Node*\\textgreater{} }\\VariableTok{subsequents\\_}\\NormalTok{;}\r\n\\NormalTok{\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThe bulwark of the implementation will consist of implementing this\r\ninterface for all the nodes in our network. We will need to implement\r\nthis interface for each of the nodes shown in the diagram below.\r\n\r\n\\begin{center}\r\n\r\n\\tikzstyle{block} = [rectangle, draw, text width=6em, text centered, rounded corners, minimum height=4em]\r\n\r\n\\begin{tikzpicture}[node distance = 3cm, auto]\r\n\r\n\\node [block] (MNIST) {MNIST};\r\n\\node [block, right of=MNIST] (hidden) {Hidden (ReLU)};\r\n\\node [block, right of=hidden] (output) {Output (Softmax)};\r\n\\node [block, right of=output, dashed] (loss) {Loss (Cross-entropy)};\r\n\r\n\\draw [->] (MNIST.10) -- (hidden.170);\r\n\\draw [->] (hidden.10) -- (output.170);\r\n\\draw [<-, dashed] (hidden.350) -- (output.190);\r\n\\draw [->] (output.10) -- (loss.170);\r\n\\draw [<-, dashed] (output.350) -- (loss.190);\r\n\\draw [->,dashed] (loss.south) -- ($(loss.south) + (0, -.5cm) $) -- node[below]{Label query} ($(MNIST.south) + (0, -.5cm) $)-- (MNIST.south);\r\n\r\n\\end{tikzpicture}\r\n\\end{center}\r\n\r\nThe first node (\\texttt{MNIST}) will be responsible for acquiring new\r\ntraining samples and feeding it to the next layer for processing. In\r\naddition, it will provide an accessor that the final categorical\r\ncross-entropy loss node will use to query the correct label for that\r\nsample (the ``label query''). The hidden node will perform the affine\r\ntransform and apply the linear rectification activation. The output node\r\nwill also perform an affine transform, but will then apply the softmax\r\nfunction. Finally, the loss node will compute the loss of the predicted\r\ndistribution based on the queried label for a given sample.\r\n\r\nIn the figure above, solid arrows from left to right indicate data flow\r\nduring the \\emph{feedforward} or \\emph{evaluation} portion of the\r\nmodel's execution. Each solid arrow corresponds to a data vector emitted\r\nby the source, and ingested by the destination. The dashed arrows from\r\nright to left indicate data flow during the \\emph{backpropagation} or\r\n\\emph{reverse accumulation} portion of the algorithm. These arrows\r\ncorrespond to gradient vectors of the evaluated loss with respect to the\r\noutputs passed during the feedforward phase. For example, as seen above,\r\nthe hidden node is expected to forward data to the output node\r\n(\\(\\mathbf{a}^{[1]}\\)). Later, after the model prediction has been\r\ncomputed and the loss evaluated, the gradient of the loss with respect\r\nto those outputs is expected (\\(\\partial J_{CE}/\\partial a^{[1]}_i\\) for\r\neach \\(a_i^{[1]}\\) in \\(\\mathbf{a}^{[1]}\\)).\r\n\r\nWhen simply evaluating the model (without training), the final loss node\r\nwill simply be omitted from the graph. In addition, no back-propagation\r\nof gradients will occur as the model parameters are ossified during\r\nevaluation.\r\n\r\nThe model class interface shown below will be used to house all the\r\nnodes in the computational graph, and provide various routines that are\r\nuseful for operating over all constituent nodes as a collection.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\KeywordTok{class}\\NormalTok{ Model}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n\\NormalTok{    Model(}\\BuiltInTok{std::}\\NormalTok{string name);}\r\n    \r\n    \\CommentTok{// Add a node to the model, forwarding arguments to the node\\textquotesingle{}s constructor}\r\n    \\KeywordTok{template}\\NormalTok{ \\textless{}}\\KeywordTok{typename} \\DataTypeTok{Node\\_t}\\NormalTok{, }\\KeywordTok{typename}\\NormalTok{... T\\textgreater{}}\r\n    \\DataTypeTok{Node\\_t}\\NormalTok{\\& add\\_node(T\\&\\&... args)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{nodes\\_}\\NormalTok{.emplace\\_back(}\r\n            \\BuiltInTok{std::}\\NormalTok{make\\_unique\\textless{}}\\DataTypeTok{Node\\_t}\\NormalTok{\\textgreater{}(*}\\KeywordTok{this}\\NormalTok{, }\\BuiltInTok{std::}\\NormalTok{forward\\textless{}T\\textgreater{}(args)...));}\r\n        \\ControlFlowTok{return} \\KeywordTok{reinterpret\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{Node\\_t}\\NormalTok{\\&\\textgreater{}(*}\\VariableTok{nodes\\_}\\NormalTok{.back());}\r\n\\NormalTok{    \\}}\r\n\r\n    \\CommentTok{// Create a dependency between two constituent nodes}\r\n    \\DataTypeTok{void}\\NormalTok{ create\\_edge(Node\\& dst, Node\\& src);}\r\n\r\n    \\CommentTok{// Initialize the parameters of all nodes with the provided seed. If the}\r\n    \\CommentTok{// seed is 0, a new random seed is chosen instead. Returns the seed used.}\r\n    \\DataTypeTok{rne\\_t}\\NormalTok{::}\\DataTypeTok{result\\_type}\\NormalTok{ init(}\\DataTypeTok{rne\\_t}\\NormalTok{::}\\DataTypeTok{result\\_type}\\NormalTok{ seed = }\\DecValTok{0}\\NormalTok{);}\r\n\r\n    \\CommentTok{// Adjust all model parameters of constituent nodes using the}\r\n    \\CommentTok{// provided optimizer (shown later)}\r\n    \\DataTypeTok{void}\\NormalTok{ train(Optimizer\\& optimizer);}\r\n\r\n    \\BuiltInTok{std::}\\NormalTok{string }\\AttributeTok{const}\\NormalTok{\\& name() }\\AttributeTok{const} \\KeywordTok{noexcept}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{return} \\VariableTok{name\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{void}\\NormalTok{ print() }\\AttributeTok{const}\\NormalTok{;}\r\n\r\n    \\CommentTok{// Routines for saving and loading model parameters to and from disk}\r\n    \\DataTypeTok{void}\\NormalTok{ save(}\\BuiltInTok{std::}\\NormalTok{ofstream\\& out);}\r\n    \\DataTypeTok{void}\\NormalTok{ load(}\\BuiltInTok{std::}\\NormalTok{ifstream\\& in);}\r\n\r\n\\KeywordTok{private}\\NormalTok{:}\r\n    \\KeywordTok{friend} \\KeywordTok{class}\\NormalTok{ Node;}\r\n\r\n    \\BuiltInTok{std::}\\NormalTok{string }\\VariableTok{name\\_}\\NormalTok{;}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\BuiltInTok{std::}\\NormalTok{unique\\_ptr\\textless{}Node\\textgreater{}\\textgreater{} }\\VariableTok{nodes\\_}\\NormalTok{;}\r\n\\NormalTok{\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\n\\hypertarget{training-data-and-labels}{%\r\n\\subsubsection{Training Data and\r\nLabels}\\label{training-data-and-labels}}\r\n\r\nAll machine learning pipelines must consider how to ingest data and\r\nlabels. Data refers to the information the model is expected to use to\r\nmake inferences and predictions. Labels correspond to the ``correct\r\nanswer'' for each data sample, used to compute losses and train the\r\nmodel. The interface of the MNIST data parser is shows below as an\r\nimplemented \\texttt{Node} class.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\KeywordTok{class}\\NormalTok{ MNIST : }\\KeywordTok{public}\\NormalTok{ Node}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n    \\KeywordTok{constexpr} \\AttributeTok{static} \\DataTypeTok{size\\_t}\\NormalTok{ DIM = }\\DecValTok{28}\\NormalTok{ * }\\DecValTok{28}\\NormalTok{;}\r\n    \r\n    \\CommentTok{// The constructor receives an input filestream corresponding to the}\r\n    \\CommentTok{// data samples and labels}\r\n\\NormalTok{    MNIST(Model\\& model, }\\BuiltInTok{std::}\\NormalTok{ifstream\\& images, }\\BuiltInTok{std::}\\NormalTok{ifstream\\& labels);}\r\n    \r\n    \\CommentTok{// This is an input node and has no parameters to initialize}\r\n    \\DataTypeTok{void}\\NormalTok{ init(}\\DataTypeTok{rne\\_t}\\NormalTok{\\&) }\\KeywordTok{override}\\NormalTok{ \\{\\}}\r\n    \r\n    \\CommentTok{// Read the next sample and label and forward the data}\r\n    \\DataTypeTok{void}\\NormalTok{ forward(}\\DataTypeTok{num\\_t}\\NormalTok{* data = }\\KeywordTok{nullptr}\\NormalTok{) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// No optimization is done in this node so this is a no{-}op}\r\n    \\DataTypeTok{void}\\NormalTok{ reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* gradients = }\\KeywordTok{nullptr}\\NormalTok{) }\\KeywordTok{override}\\NormalTok{ \\{\\}}\r\n    \r\n    \\DataTypeTok{void}\\NormalTok{ print() }\\AttributeTok{const} \\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// Consume the next sample and label from the file streams}\r\n    \\DataTypeTok{void}\\NormalTok{ read\\_next();}\r\n    \r\n    \\CommentTok{// Accessor for the most recently read sample}\r\n    \\DataTypeTok{num\\_t} \\AttributeTok{const}\\NormalTok{* data() }\\AttributeTok{const} \\KeywordTok{noexcept}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{return} \\VariableTok{data\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n    \r\n    \\CommentTok{// Accessor for the most recently read label}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{* label() }\\AttributeTok{const} \\KeywordTok{noexcept}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{return} \\VariableTok{label\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n    \r\n    \\CommentTok{// Quick ASCII visualization of the last digit read}\r\n    \\DataTypeTok{void}\\NormalTok{ print\\_last();}\r\n    \r\n\\KeywordTok{private}\\NormalTok{:}\r\n    \\BuiltInTok{std::}\\NormalTok{ifstream\\& }\\VariableTok{images\\_}\\NormalTok{;}\r\n    \\BuiltInTok{std::}\\NormalTok{ifstream\\& }\\VariableTok{labels\\_}\\NormalTok{;}\r\n    \\DataTypeTok{uint32\\_t} \\VariableTok{image\\_count\\_}\\NormalTok{;}\r\n\r\n    \\DataTypeTok{char} \\VariableTok{buf\\_}\\NormalTok{[DIM];}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{data\\_}\\NormalTok{[DIM];}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{label\\_}\\NormalTok{[}\\DecValTok{10}\\NormalTok{];}\r\n\\NormalTok{\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nIn the constructor, we must verify that the files passed as arguments\r\nare valid MNIST data and label files. Both files start with distinct\r\n``magic values'' as a quick sanity check. The sample file starts with\r\n2051 encoded as a 4-byte big-endian unsigned integer, whereas the label\r\nfile starts with 2049. For the data file, the magic number is followed\r\nby the image count and image dimensions. The label file magic number is\r\nfollowed by the label count (expected to match the image count).\r\n\r\nTo consume big-endian unsigned integers from the file stream, we'll use\r\na simple routine:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ read\\_be(}\\BuiltInTok{std::}\\NormalTok{ifstream\\& in, }\\DataTypeTok{uint32\\_t}\\NormalTok{* out)}\r\n\\NormalTok{\\{}\r\n    \\DataTypeTok{char}\\NormalTok{* buf = }\\KeywordTok{reinterpret\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{char}\\NormalTok{*\\textgreater{}(out);}\r\n\\NormalTok{    in.read(buf, }\\DecValTok{4}\\NormalTok{);}\r\n\r\n    \\BuiltInTok{std::}\\NormalTok{swap(buf[}\\DecValTok{0}\\NormalTok{], buf[}\\DecValTok{3}\\NormalTok{]);}\r\n    \\BuiltInTok{std::}\\NormalTok{swap(buf[}\\DecValTok{1}\\NormalTok{], buf[}\\DecValTok{2}\\NormalTok{]);}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nIf you happen to be using a big-endian processor, you will not need to\r\nperform the byte swaps, but most desktop and mobile architectures are\r\nlittle-endian.\r\n\r\nThe implementation that parses the magic numbers and various other\r\ndescriptors is produced below:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\NormalTok{MNIST::MNIST(Model\\& model, }\\BuiltInTok{std::}\\NormalTok{ifstream\\& images, }\\BuiltInTok{std::}\\NormalTok{ifstream\\& labels)}\r\n\\NormalTok{    : Node\\{model, }\\StringTok{\"MNIST input\"}\\NormalTok{\\}}\r\n\\NormalTok{    , }\\VariableTok{images\\_}\\NormalTok{\\{images\\}}\r\n\\NormalTok{    , }\\VariableTok{labels\\_}\\NormalTok{\\{labels\\}}\r\n\\NormalTok{\\{}\r\n    \\CommentTok{// Confirm that passed input file streams are well{-}formed MNIST data sets}\r\n    \\DataTypeTok{uint32\\_t}\\NormalTok{ image\\_magic;}\r\n\\NormalTok{    read\\_be(images, \\&image\\_magic);}\r\n    \\ControlFlowTok{if}\\NormalTok{ (image\\_magic != }\\DecValTok{2051}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{throw} \\BuiltInTok{std::}\\NormalTok{runtime\\_error\\{}\\StringTok{\"Images file appears to be malformed\"}\\NormalTok{\\};}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{    read\\_be(images, \\&}\\VariableTok{image\\_count\\_}\\NormalTok{);}\r\n\r\n    \\DataTypeTok{uint32\\_t}\\NormalTok{ labels\\_magic;}\r\n\\NormalTok{    read\\_be(labels, \\&labels\\_magic);}\r\n    \\ControlFlowTok{if}\\NormalTok{ (labels\\_magic != }\\DecValTok{2049}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{throw} \\BuiltInTok{std::}\\NormalTok{runtime\\_error\\{}\\StringTok{\"Labels file appears to be malformed\"}\\NormalTok{\\};}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{uint32\\_t}\\NormalTok{ label\\_count;}\r\n\\NormalTok{    read\\_be(labels, \\&label\\_count);}\r\n    \\ControlFlowTok{if}\\NormalTok{ (label\\_count != }\\VariableTok{image\\_count\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{throw} \\BuiltInTok{std::}\\NormalTok{runtime\\_error(}\r\n            \\StringTok{\"Label count did not match the number of images supplied\"}\\NormalTok{);}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{uint32\\_t}\\NormalTok{ rows;}\r\n    \\DataTypeTok{uint32\\_t}\\NormalTok{ columns;}\r\n\\NormalTok{    read\\_be(images, \\&rows);}\r\n\\NormalTok{    read\\_be(images, \\&columns);}\r\n    \\ControlFlowTok{if}\\NormalTok{ (rows != }\\DecValTok{28}\\NormalTok{ || columns != }\\DecValTok{28}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{throw} \\BuiltInTok{std::}\\NormalTok{runtime\\_error\\{}\r\n            \\StringTok{\"Expected 28x28 images, non{-}MNIST data supplied\"}\\NormalTok{\\};}\r\n\\NormalTok{    \\}}\r\n\r\n\\NormalTok{    printf(}\\StringTok{\"Loaded images file with }\\SpecialCharTok{\\%d}\\StringTok{ entries}\\SpecialCharTok{\\textbackslash{}n}\\StringTok{\"}\\NormalTok{, }\\VariableTok{image\\_count\\_}\\NormalTok{);}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nNext, let's implement the \\texttt{MNIST::read\\_next}, which will consume\r\nthe next sample and label from the file streams:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ MNIST::read\\_next()}\r\n\\NormalTok{\\{}\r\n    \\VariableTok{images\\_}\\NormalTok{.read(}\\VariableTok{buf\\_}\\NormalTok{, DIM);}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{ inv = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\} / }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{255.0}\\NormalTok{\\};}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != DIM; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{data\\_}\\NormalTok{[i] = }\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{uint8\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{buf\\_}\\NormalTok{[i]) * inv;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{char}\\NormalTok{ label;}\r\n    \\VariableTok{labels\\_}\\NormalTok{.read(\\&label, }\\DecValTok{1}\\NormalTok{);}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\DecValTok{10}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{label\\_}\\NormalTok{[i] = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n\\NormalTok{    \\}}\r\n    \\VariableTok{label\\_}\\NormalTok{[}\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{uint8\\_t}\\NormalTok{\\textgreater{}(label)] = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\};}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nFor the labels, note that the label is encoded as a single unsigned\r\ndigit, but we convert it to a 1-hot encoding for loss computation\r\npurposes later. If your application can assume that the labels will be\r\none-hot encoded, this conversion may not be necessary and a more\r\nefficient implementation is possible.\r\n\r\nTo verify our work, let's write up a quick-and-dirty ASCII printer for\r\nthe last read digit and try our parser out. If you have a rendering\r\nbackend (written in say, Vulkan, D3D12, OpenGL, etc.) at your disposal,\r\nyou may wish to use that instead for a cleaner visualization.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ MNIST::print\\_last()}\r\n\\NormalTok{\\{}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\DecValTok{10}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{label\\_}\\NormalTok{[i] == }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\})}\r\n\\NormalTok{        \\{}\r\n\\NormalTok{            printf(}\\StringTok{\"This is a }\\SpecialCharTok{\\%zu}\\StringTok{:}\\SpecialCharTok{\\textbackslash{}n}\\StringTok{\"}\\NormalTok{, i);}\r\n            \\ControlFlowTok{break}\\NormalTok{;}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\DecValTok{28}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\DataTypeTok{size\\_t}\\NormalTok{ offset = i * }\\DecValTok{28}\\NormalTok{;}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\DecValTok{28}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n            \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{data\\_}\\NormalTok{[offset + j] \\textgreater{} }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.5}\\NormalTok{\\})}\r\n\\NormalTok{            \\{}\r\n                \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{data\\_}\\NormalTok{[offset + j] \\textgreater{} }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.9}\\NormalTok{\\})}\r\n\\NormalTok{                \\{}\r\n\\NormalTok{                    printf(}\\StringTok{\"\\#\"}\\NormalTok{);}\r\n\\NormalTok{                \\}}\r\n                \\ControlFlowTok{else} \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{data\\_}\\NormalTok{[offset + j] \\textgreater{} }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.7}\\NormalTok{\\})}\r\n\\NormalTok{                \\{}\r\n\\NormalTok{                    printf(}\\StringTok{\"*\"}\\NormalTok{);}\r\n\\NormalTok{                \\}}\r\n                \\ControlFlowTok{else}\r\n\\NormalTok{                \\{}\r\n\\NormalTok{                    printf(}\\StringTok{\".\"}\\NormalTok{);}\r\n\\NormalTok{                \\}}\r\n\\NormalTok{            \\}}\r\n            \\ControlFlowTok{else}\r\n\\NormalTok{            \\{}\r\n\\NormalTok{                printf(}\\StringTok{\" \"}\\NormalTok{);}\r\n\\NormalTok{            \\}}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{        printf(}\\StringTok{\"}\\SpecialCharTok{\\textbackslash{}n}\\StringTok{\"}\\NormalTok{);}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{    printf(}\\StringTok{\"}\\SpecialCharTok{\\textbackslash{}n}\\StringTok{\"}\\NormalTok{);}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nOn my machine, consuming the evaluation data and printing it produces\r\nthe following result (the first sample from the test data is shown):\r\n\r\n\\begin{verbatim}\r\nThis is a 7:\r\n                            \r\n       *..                  \r\n      *#####********.       \r\n          .*#*####*##.      \r\n                   ##       \r\n                   #*       \r\n                  ##        \r\n                 .##        \r\n                 ##         \r\n                .#*         \r\n                *#          \r\n                #*          \r\n               ##           \r\n              *#.           \r\n             *#*            \r\n             ##             \r\n            *#              \r\n           .##              \r\n           ###              \r\n           ##*              \r\n           #*\r\n\\end{verbatim}\r\n\r\nso we can be somewhat confident that our MNIST data ingestor is working\r\nproperly. The only remaining routine we need to implement is\r\n\\texttt{MNIST::forward} which should consume the next sample, and\r\nforward the data to all subsequent nodes in the graph.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ MNIST::forward(}\\DataTypeTok{num\\_t}\\NormalTok{* data)}\r\n\\NormalTok{\\{}\r\n\\NormalTok{    read\\_next();}\r\n    \\ControlFlowTok{for}\\NormalTok{ (Node* node : }\\VariableTok{subsequents\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        node{-}\\textgreater{}forward(}\\VariableTok{data\\_}\\NormalTok{);}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nSuch an interface ensures our \\texttt{MNIST} node will be interoperable\r\nwith networks that aren't purely sequential.\r\n\r\n\\hypertarget{the-feedforward-node}{%\r\n\\subsubsection{The Feedforward Node}\\label{the-feedforward-node}}\r\n\r\nThe hidden and output nodes have much in common and so will be\r\nimplemented in terms of a single feedforward node class. The feedforward\r\nnode will need a configurable activation function and dimensionality.\r\nHere's the interface for the \\texttt{FFNode}:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\KeywordTok{enum} \\KeywordTok{class}\\NormalTok{ Activation}\r\n\\NormalTok{\\{}\r\n\\NormalTok{    ReLU,}\r\n\\NormalTok{    Softmax}\r\n\\NormalTok{\\};}\r\n\r\n\\KeywordTok{class}\\NormalTok{ FFNode : }\\KeywordTok{public}\\NormalTok{ Node}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n    \\CommentTok{// A feedforward node is defined by the activation}\r\n    \\CommentTok{// function and input/output dimensionality}\r\n\\NormalTok{    FFNode(Model\\& model,}\r\n           \\BuiltInTok{std::}\\NormalTok{string name,}\r\n\\NormalTok{           Activation activation,}\r\n           \\DataTypeTok{uint16\\_t}\\NormalTok{ output\\_size,}\r\n           \\DataTypeTok{uint16\\_t}\\NormalTok{ input\\_size);}\r\n\r\n    \\DataTypeTok{void}\\NormalTok{ init(}\\DataTypeTok{rne\\_t}\\NormalTok{\\& rne) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// The input data should have size input\\_size\\_}\r\n    \\DataTypeTok{void}\\NormalTok{ forward(}\\DataTypeTok{num\\_t}\\NormalTok{* inputs) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// The gradient data should have size output\\_size\\_}\r\n    \\DataTypeTok{void}\\NormalTok{ reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* gradients) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\DataTypeTok{size\\_t}\\NormalTok{ param\\_count() }\\AttributeTok{const} \\KeywordTok{noexcept} \\KeywordTok{override}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// Weight matrix entries + bias entries}\r\n        \\ControlFlowTok{return}\\NormalTok{ (}\\VariableTok{input\\_size\\_}\\NormalTok{ + }\\DecValTok{1}\\NormalTok{) * }\\VariableTok{output\\_size\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{num\\_t}\\NormalTok{* param(}\\DataTypeTok{size\\_t}\\NormalTok{ index);}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{* gradient(}\\DataTypeTok{size\\_t}\\NormalTok{ index);}\r\n\r\n    \\DataTypeTok{void}\\NormalTok{ print() }\\AttributeTok{const} \\KeywordTok{override}\\NormalTok{;}\r\n\r\n\\KeywordTok{private}\\NormalTok{:}\r\n\\NormalTok{    Activation }\\VariableTok{activation\\_}\\NormalTok{;}\r\n    \\DataTypeTok{uint16\\_t} \\VariableTok{output\\_size\\_}\\NormalTok{;}\r\n    \\DataTypeTok{uint16\\_t} \\VariableTok{input\\_size\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{/////////////////////}\r\n    \\CommentTok{// Node Parameters //}\r\n    \\CommentTok{/////////////////////}\r\n\r\n    \\CommentTok{// weights\\_.size() := output\\_size\\_ * input\\_size\\_}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{weights\\_}\\NormalTok{;}\r\n    \\CommentTok{// biases\\_.size() := output\\_size\\_}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{biases\\_}\\NormalTok{;}\r\n    \\CommentTok{// activations\\_.size() := output\\_size\\_}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{activations\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{////////////////////}\r\n    \\CommentTok{// Loss Gradients //}\r\n    \\CommentTok{////////////////////}\r\n\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{activation\\_gradients\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{// During the training cycle, parameter loss gradients are accumulated in}\r\n    \\CommentTok{// the following buffers.}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{weight\\_gradients\\_}\\NormalTok{;}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{bias\\_gradients\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{// This buffer is used to store temporary gradients used in a SINGLE}\r\n    \\CommentTok{// backpropagation pass. Note that this does not accumulate like the weight}\r\n    \\CommentTok{// and bias gradients do.}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{input\\_gradients\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{// The last input is needed to compute loss gradients with respect to the}\r\n    \\CommentTok{// weights during backpropagation}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{* }\\VariableTok{last\\_input\\_}\\NormalTok{;}\r\n\\NormalTok{\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nCompared to the \\texttt{MNIST} node, the \\texttt{FFNode} uses a lot more\r\nstate to track all tunable parameters (weight matrix elements and\r\nbiases), as well as the loss gradients corresponding to each parameter.\r\nThe loss gradients must be kept because, remember, utilizing them to\r\nactually adjust the parameters is performed only after \\texttt{N}\r\nsamples have been evaluated, where \\texttt{N} is the chosen batch size\r\nin our stochastic gradient descent algorithm. If the purpose of some of\r\nthe class members here is still opaque, they will show up later when\r\nimplement backpropagation.\r\n\r\nFirst, we must decide how to initialize the weights and biases of our\r\nnode. When deciding on a scheme, there are a few key principles to keep\r\nin mind. First, the initialization must exhibit symmetry of any sort.\r\nFor example, if all the parameters are initialized to the same random\r\nvalue, the loss gradients with respect to all individual parameters will\r\nbe identical, and our network will be no better than a network with a\r\nsingle parameter. In addition, we do not want the parameters to be\r\ninitialized such that they are too large, or too small. Most papers that\r\ndiscuss weight initialization strive to ensure that the loss gradients\r\nremain in a realm where floating point number retain precision (in the\r\nrange \\([1, 2)\\)). The other criteria is that parameters should\r\ngenerally be initialized such that they are roughly similar in\r\nmagnitude. Parameters that deviate too far from the mean are likely to\r\neither dominate loss gradients, or produce too small a signal to\r\ncontribute. Proper parameter initialization is but a small part of\r\naddressing the larger problem common in neural networks known as the\r\nproblem of \\emph{exploding and vanishing gradients}. Here, we present\r\nthe implementation with a couple references if you wish to dig deeper.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ FFNode::init(}\\DataTypeTok{rne\\_t}\\NormalTok{\\& rne)}\r\n\\NormalTok{\\{}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{ sigma;}\r\n    \\ControlFlowTok{switch}\\NormalTok{ (}\\VariableTok{activation\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n    \\ControlFlowTok{case}\\NormalTok{ Activation::ReLU:}\r\n        \\CommentTok{// Kaiming He, et. al. weight initialization for ReLU networks}\r\n        \\CommentTok{// https://arxiv.org/pdf/1502.01852.pdf}\r\n        \\CommentTok{//}\r\n        \\CommentTok{// Suggests using a normal distribution with variance := 2 / n\\_in}\r\n\\NormalTok{        sigma = }\\BuiltInTok{std::}\\NormalTok{sqrt(}\\FloatTok{2.0}\\NormalTok{ / }\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{input\\_size\\_}\\NormalTok{));}\r\n        \\ControlFlowTok{break}\\NormalTok{;}\r\n    \\ControlFlowTok{case}\\NormalTok{ Activation::Softmax:}\r\n    \\ControlFlowTok{default}\\NormalTok{:}\r\n        \\CommentTok{// LeCun initialization as suggested in \"Self{-}Normalizing Neural}\r\n        \\CommentTok{// Networks\"}\r\n        \\CommentTok{// https://arxiv.org/pdf/1706.02515.pdf}\r\n\\NormalTok{        sigma = }\\BuiltInTok{std::}\\NormalTok{sqrt(}\\FloatTok{1.0}\\NormalTok{ / }\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{input\\_size\\_}\\NormalTok{));}\r\n        \\ControlFlowTok{break}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\CommentTok{// }\\AlertTok{NOTE}\\CommentTok{: Unfortunately, the C++ standard does not guarantee that the results}\r\n    \\CommentTok{// obtained from a distribution function will be identical given the same}\r\n    \\CommentTok{// inputs across different compilers and platforms. A production ML}\r\n    \\CommentTok{// framework will likely implement its own distributions to provide}\r\n    \\CommentTok{// deterministic results.}\r\n    \\KeywordTok{auto}\\NormalTok{ dist = }\\BuiltInTok{std::}\\NormalTok{normal\\_distribution\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}\\{}\\FloatTok{0.0}\\NormalTok{, sigma\\};}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{num\\_t}\\NormalTok{\\& w : }\\VariableTok{weights\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        w = dist(rne);}\r\n\\NormalTok{    \\}}\r\n\r\n    \\CommentTok{// }\\AlertTok{NOTE}\\CommentTok{: Setting biases to zero is a common practice, as is initializing the}\r\n    \\CommentTok{// bias to a small value (e.g. on the order of 0.01). It is unclear if the}\r\n    \\CommentTok{// latter produces a consistent result over the former, but the thinking is}\r\n    \\CommentTok{// that a non{-}zero bias will ensure that the neuron always \"fires\" at the}\r\n    \\CommentTok{// beginning to produce a signal.}\r\n    \\CommentTok{//}\r\n    \\CommentTok{// Here, we initialize all biases to a small number, but the reader should}\r\n    \\CommentTok{// consider experimenting with other approaches.}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{num\\_t}\\NormalTok{\\& b : }\\VariableTok{biases\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        b = }\\FloatTok{0.01}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThe common theme is that the distribution of random weights scales\r\nroughly as the inverse square root of the input vector size. This way,\r\nthe distribution of the node's output will fall in a ``nice'' range with\r\nrespect to floating-point precision. Other initialization schemes are of\r\ncourse possible, and in some cases critical depending on the choice of\r\nactivation function.\r\n\r\nWith weights and biases initialized, it's time to implement\r\n\\texttt{FFNode::forward}. The straightforward plan is, for both the ReLU\r\nand softmax nodes, first perform the affine transform\r\n\\(\\mathbf{W}\\mathbf{x} + \\mathbf{b}\\), then perform the activation\r\nfunction which will be one of the linear rectifier or the softmax\r\nfunction. Here's what this looks like:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ FFNode::forward(}\\DataTypeTok{num\\_t}\\NormalTok{* inputs)}\r\n\\NormalTok{\\{}\r\n    \\CommentTok{// Remember the last input data for backpropagation later}\r\n    \\VariableTok{last\\_input\\_}\\NormalTok{ = inputs;}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// For each output vector, compute the dot product of the input data}\r\n        \\CommentTok{// with the weight vector add the bias}\r\n\r\n        \\DataTypeTok{num\\_t}\\NormalTok{ z\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n\r\n        \\DataTypeTok{size\\_t}\\NormalTok{ offset = i * }\\VariableTok{input\\_size\\_}\\NormalTok{;}\r\n\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n\\NormalTok{            z += }\\VariableTok{weights\\_}\\NormalTok{[offset + j] * inputs[j];}\r\n\\NormalTok{        \\}}\r\n        \\CommentTok{// Add neuron bias}\r\n\\NormalTok{        z += }\\VariableTok{biases\\_}\\NormalTok{[i];}\r\n\r\n        \\ControlFlowTok{switch}\\NormalTok{ (}\\VariableTok{activation\\_}\\NormalTok{)}\r\n\\NormalTok{        \\{}\r\n        \\ControlFlowTok{case}\\NormalTok{ Activation::ReLU:}\r\n            \\VariableTok{activations\\_}\\NormalTok{[i] = }\\BuiltInTok{std::}\\NormalTok{max(z, }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\});}\r\n            \\ControlFlowTok{break}\\NormalTok{;}\r\n        \\ControlFlowTok{case}\\NormalTok{ Activation::Softmax:}\r\n        \\ControlFlowTok{default}\\NormalTok{:}\r\n            \\VariableTok{activations\\_}\\NormalTok{[i] = }\\BuiltInTok{std::}\\NormalTok{exp(z);}\r\n            \\ControlFlowTok{break}\\NormalTok{;}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{activation\\_}\\NormalTok{ == Activation::Softmax)}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// softmax(z)\\_i = exp(z\\_i) / \\textbackslash{}sum\\_j(exp(z\\_j))}\r\n        \\DataTypeTok{num\\_t}\\NormalTok{ sum\\_exp\\_z\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{        \\{}\r\n            \\CommentTok{// }\\AlertTok{NOTE}\\CommentTok{: with exploding gradients, it is quite easy for this}\r\n            \\CommentTok{// exponential function to overflow, which will result in NaNs}\r\n            \\CommentTok{// infecting the network.}\r\n\\NormalTok{            sum\\_exp\\_z += }\\VariableTok{activations\\_}\\NormalTok{[i];}\r\n\\NormalTok{        \\}}\r\n        \\DataTypeTok{num\\_t}\\NormalTok{ inv\\_sum\\_exp\\_z = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\} / sum\\_exp\\_z;}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{activations\\_}\\NormalTok{[i] *= inv\\_sum\\_exp\\_z;}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\CommentTok{// Forward activation data to all subsequent nodes in the computational}\r\n    \\CommentTok{// graph}\r\n    \\ControlFlowTok{for}\\NormalTok{ (Node* subsequent : }\\VariableTok{subsequents\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        subsequent{-}\\textgreater{}forward(}\\VariableTok{activations\\_}\\NormalTok{.data());}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nAs before, we forward all final results to all subsequent nodes even\r\nthough there will only be a single subsequent node in this case.\r\nWhenever writing code as above, it is prudent to consider all potential\r\ncorner cases which could result in the myriad issues that arise in\r\nfloating-point computation:\r\n\r\n\\begin{itemize}\r\n\\tightlist\r\n\\item\r\n  Loss of precision\r\n\\item\r\n  Floating point overflow and underflow\r\n\\item\r\n  Divide by zero\r\n\\end{itemize}\r\n\r\nLoss of precision easily occurs when in a number of situations, such as\r\nsubtracting two quantities of similar size, or adding and multiplying\r\nquantities with greatly different magnitudes. Floating point overflow\r\nand underflow occur typically when repeatedly performing an operation\r\nsuch that an accumulator explodes to \\(\\infty\\) or \\(-\\infty\\). In this\r\ncase, the use of \\texttt{std::exp} is one operation that sticks out. We\r\nwill not implement a stable softmax here, but the following identity can\r\nbe used to improve its stability should you need it:\r\n\r\n\\[\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i = \\mathrm{softmax}(\\mathbf{z})_i\\]\r\n\r\nIn this expression, \\(\\mathbf{C}\\) is a constant vector where all its\r\nelements are equal in value. Expanding the definition of softmax in the\r\nLHS gives:\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\mathrm{softmax}(\\mathbf{z} + \\mathbf{C})_i &= \\frac{\\exp{(z_i + C)}}{\\sum_i\\exp{(z_i + C})} \\\\\r\n&= \\frac\r\n    {\\exp{z_i}\\exp{C}}\r\n    {\\left(\\sum_i\\exp{z_i}\\right)\\exp C} \\\\\r\n&= \\mathrm{softmax}(\\mathbf{z})_i && \\blacksquare\r\n\\end{aligned}\r\n\\]\r\n\r\nThus, if we are considered about saturating \\texttt{std::exp} with a\r\nlarge argument, we can simply set \\(C\\) to be the additive inverse of\r\nthe \\(z_i\\) with the greatest magnitude within \\(\\mathbf{z}\\).\r\nPerforming this each time we apply softmax will usually maintain the\r\narguments of the softmax within a reasonable range (unless elements of\r\n\\(z_i\\) explode in opposite directions).\r\n\r\nAs a practical implementor's trick, it is possible to enable floating\r\npoint exception traps to throw an exception when a \\texttt{NaN} is\r\ngenerated in a floating point register. Using libc for example, we can\r\ntrap floating point exceptions using\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\PreprocessorTok{\\#include }\\ImportTok{\\textless{}cfenv\\textgreater{}}\r\n\r\n\\NormalTok{feenableexcept(FE\\_INVALID | FE\\_OVERFLOW);}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nIt is also possible to trap exceptions specifically in regions where you\r\nanticipate a potential issue (which enhances the overall throughput of\r\nthe network). In the interest of brevity, please consult your compiler's\r\ndocumentation for how to do this.\r\n\r\nOne observation you might have made is the first line of our routine.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\VariableTok{last\\_input\\_}\\NormalTok{ = inputs;}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nHere, we retain a pointer to the data ingested by the feedforward node\r\nfor a full training cycle. Before delving into any derivations, let's\r\nfirst present the code for the backpropagation of gradients through our\r\nfeedforward node and dissect it immediately afterwards.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ FFNode::reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* gradients)}\r\n\\NormalTok{\\{}\r\n    \\CommentTok{// First, we compute dJ/dz as dJ/dg(z) * dg(z)/dz and store it in our}\r\n    \\CommentTok{// activations array}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// dg(z)/dz}\r\n        \\DataTypeTok{num\\_t}\\NormalTok{ activation\\_grad\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n        \\ControlFlowTok{switch}\\NormalTok{ (}\\VariableTok{activation\\_}\\NormalTok{)}\r\n\\NormalTok{        \\{}\r\n        \\ControlFlowTok{case}\\NormalTok{ Activation::ReLU:}\r\n            \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{activations\\_}\\NormalTok{[i] \\textgreater{} }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\})}\r\n\\NormalTok{            \\{}\r\n\\NormalTok{                activation\\_grad = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\};}\r\n\\NormalTok{            \\}}\r\n            \\ControlFlowTok{else}\r\n\\NormalTok{            \\{}\r\n\\NormalTok{                activation\\_grad = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n\\NormalTok{            \\}}\r\n            \\CommentTok{// dJ/dz = dJ/dg(z) * dg(z)/dz}\r\n            \\VariableTok{activation\\_gradients\\_}\\NormalTok{[i] = gradients[i] * activation\\_grad;}\r\n            \\ControlFlowTok{break}\\NormalTok{;}\r\n        \\ControlFlowTok{case}\\NormalTok{ Activation::Softmax:}\r\n        \\ControlFlowTok{default}\\NormalTok{:}\r\n            \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{            \\{}\r\n                \\ControlFlowTok{if}\\NormalTok{ (i == j)}\r\n\\NormalTok{                \\{}\r\n\\NormalTok{                    activation\\_grad += }\\VariableTok{activations\\_}\\NormalTok{[i]}\r\n\\NormalTok{                                       * (}\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{1.0}\\NormalTok{\\} {-} }\\VariableTok{activations\\_}\\NormalTok{[i])}\r\n\\NormalTok{                                       * gradients[j];}\r\n\\NormalTok{                \\}}\r\n                \\ControlFlowTok{else}\r\n\\NormalTok{                \\{}\r\n\\NormalTok{                    activation\\_grad}\r\n\\NormalTok{                        += {-}}\\VariableTok{activations\\_}\\NormalTok{[i] * }\\VariableTok{activations\\_}\\NormalTok{[j] * gradients[j];}\r\n\\NormalTok{                \\}}\r\n\\NormalTok{            \\}}\r\n\r\n            \\VariableTok{activation\\_gradients\\_}\\NormalTok{[i] = activation\\_grad;}\r\n            \\ControlFlowTok{break}\\NormalTok{;}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// dJ/db\\_i = dJ/dg(z\\_i) * dJ(g\\_i)/dz\\_i.}\r\n        \\VariableTok{bias\\_gradients\\_}\\NormalTok{[i] += }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[i];}\r\n\\NormalTok{    \\}}\r\n\r\n    \\BuiltInTok{std::}\\NormalTok{fill(}\\VariableTok{input\\_gradients\\_}\\NormalTok{.begin(), }\\VariableTok{input\\_gradients\\_}\\NormalTok{.end(), }\\DecValTok{0}\\NormalTok{);}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\DataTypeTok{size\\_t}\\NormalTok{ offset = i * }\\VariableTok{input\\_size\\_}\\NormalTok{;}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{input\\_gradients\\_}\\NormalTok{[j]}\r\n\\NormalTok{                += }\\VariableTok{weights\\_}\\NormalTok{[offset + j] * }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[i];}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{weight\\_gradients\\_}\\NormalTok{[j * }\\VariableTok{input\\_size\\_}\\NormalTok{ + i]}\r\n\\NormalTok{                += }\\VariableTok{last\\_input\\_}\\NormalTok{[i] * }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[j];}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (Node* node : }\\VariableTok{antecedents\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        node{-}\\textgreater{}reverse(}\\VariableTok{input\\_gradients\\_}\\NormalTok{.data());}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThis code is likely more difficult to digest, so let's break it down\r\ninto parts. During reverse accumulation (aka backpropagation), we will\r\nbe given the loss gradients with respect to all of the outputs from the\r\nmost recent forward pass, written mathematically as\r\n\\(\\partial J_{CE}/\\partial a_i\\) for each output scalar \\(a_i\\). Given\r\nthat information, we need to perform the following tasks:\r\n\r\n\\begin{enumerate}\r\n\\def\\labelenumi{\\arabic{enumi}.}\r\n\\tightlist\r\n\\item\r\n  Compute \\(\\partial J_{CE}/\\partial w_{ij}\\) for each weight in our\r\n  weight matrix\r\n\\item\r\n  Compute \\(\\partial J_{CE}/\\partial b_i\\) for each bias in our bias\r\n  vector\r\n\\item\r\n  Compute \\(\\partial J_{CE}/\\partial x_i\\) for each input scalar in the\r\n  most recent forward pass\r\n\\item\r\n  Propagate all the loss gradients with respect to the inputs in step 3\r\n  back to the antecedent nodes\r\n\\end{enumerate}\r\n\r\nAs all outputs pass through an activation function, we will need to\r\ncompute \\(\\partial J_{CE}/\\partial g(\\mathbf{z})_i\\) where \\(g\\) is one\r\nof the linear rectifier or softmax function corresponding to a\r\nparticular component of the output vector. Both derivatives are computed\r\nin the background section, so we'll just recite the results here. For\r\nthe linear rectifier, \\(\\partial J_{CE}/\\partial g(\\mathbf{z})_i\\) will\r\nsimply be 1 if \\(a_i \\neq 0\\), and 0 otherwise. The softmax gradient is\r\nslightly more involved, but because every output of the softmax\r\ncontributes additively to the loss, we require a sum of gradients here:\r\n\r\n\\[\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} = \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\r\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = j \\\\\r\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j & i \\neq j\r\n\\end{cases}\\]\r\n\r\nThe factor \\(\\partial J_{CE}/\\partial a_i\\) comes from the chain rule\r\nand is passed in from the subsequent node. These intermediate\r\nexpressions are computed, scaled by \\(\\partial a_i/\\partial z_i\\), and\r\nthen stored in \\texttt{activation\\_gradients\\_} in the top portion of\r\n\\texttt{FFNode::reverse}. Equivalently by the chain rule, we are caching\r\nin \\texttt{activation\\_gradients\\_} \\(\\partial J_{CE}/\\partial z_i\\) for\r\neach \\(i\\). Because the loss gradients with respect to every parameter\r\nand input have a functional dependence on the activation function\r\ngradients, all results computed in tasks 1 through 4 above will depend\r\non \\texttt{activation\\_gradients\\_}.\r\n\r\n\\hypertarget{computing-bias-gradients}{%\r\n\\paragraph{Computing bias gradients}\\label{computing-bias-gradients}}\r\n\r\nThe bias gradients are the easiest to compute due to how they show up in\r\nthe expression. Since a node's output is given as\r\n\r\n\\[a_i = g\\left(\\mathbf{W}_i \\cdot \\mathbf{x} + b_i = z_i\\right)\\]\r\n\r\nfor some activation function \\(g\\), the derivative with respect to\r\n\\(b_i\\) is just\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\frac{\\partial{a_i}}{\\partial b_i} &= \\frac{\\partial g}{\\partial z_i}\\frac{\\partial z_i}{\\partial b_i} \\\\\r\n&= \\frac{\\partial g}{\\partial z_i}\r\n\\end{aligned}\r\n\\]\r\n\r\nThus we can simply accumulate the result stored in\r\n\\texttt{activation\\_gradients\\_} as the loss gradient with respect to\r\neach bias. Please take note! The code that performs this update is\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{bias\\_gradients\\_}\\NormalTok{[i] += }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[i];}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThe following code would \\emph{not} be correct:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\CommentTok{// }\\AlertTok{NOTE}\\CommentTok{: WRONG! Will only alone batch sizes of 1}\r\n        \\VariableTok{bias\\_gradients\\_}\\NormalTok{[i] = }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[i];}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nAs the admonition in the comment suggests, while it's helpful to\r\nconceptualize the loss gradient as something that resets every time we\r\nperform a forward and reverse pass of a training sample, in actuality,\r\nwe require the gradients with respect to the \\emph{cumulative mean loss\r\naccrued while evaluating the entire batch} for stochastic gradient\r\ndescent. Luckily, because the losses per sample accumulate additively,\r\nthe gradients of the loss with respect to all parameters in the model\r\nalso update additively.\r\n\r\n\\hypertarget{computing-the-weight-gradients}{%\r\n\\paragraph{Computing the weight\r\ngradients}\\label{computing-the-weight-gradients}}\r\n\r\nThe weight gradients are slightly more involved than the bias gradients,\r\nbut are still relatively easy to compute with a bit of bookkeeping. For\r\nany given weight \\(w_{ij}\\), we can observe that such a weight\r\nparticipates only in the evaluation of \\(z_i\\). That is:\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\frac{\\partial \\mathbf{z}}{\\partial w_{ij}} &= \\frac{\\partial z_i}{\\partial w_{ij}} \\\\\r\n&= \\frac{\\partial (\\mathbf{w}_{i} \\cdot \\mathbf{x}) + b_i}{\\partial w_{ij}} \\\\\r\n&= x_j \\\\\r\n\\end{aligned}\r\n\\]\r\n\r\n\\[\r\n\\boxed{\\frac{\\partial J_{CE}}{\\partial w_{ij}} = \\frac{\\partial J_{CE}}{\\partial a_i}\\frac{\\partial a_i}{\\partial z_i}x_j}\r\n\\]\r\n\r\nThe boxed result shows the final loss gradient with respect to a weight\r\nparameter. The weight gradient accumulation appears in the following\r\ncode, where all \\(N \\times M\\) weights are updated in a couple of nested\r\nloops:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{weight\\_gradients\\_}\\NormalTok{[j * }\\VariableTok{input\\_size\\_}\\NormalTok{ + i]}\r\n\\NormalTok{                += }\\VariableTok{last\\_input\\_}\\NormalTok{[i] * }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[j];}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\n\\hypertarget{computing-the-input-gradients}{%\r\n\\paragraph{Computing the input\r\ngradients}\\label{computing-the-input-gradients}}\r\n\r\nThe last set of gradients we need to compute are the loss gradients with\r\nrespect to the inputs, to be forwarded to the antecedent node. This\r\ncalculation is similar to the calculation of the weight gradients in\r\nterms of the linear dependence. However, it is important to note that a\r\ngiven input participates in the computation of \\emph{all} output\r\nscalars. Thus, we expect each individual input gradient to be a\r\nsummation.\r\n\r\n\\[\r\n\\frac{\\partial J_{CE}}{\\partial x_i} = \\sum_j \\frac{\\partial J_{CE}}{\\partial a_j}\\frac{\\partial a_j}{\\partial z_j}w_{ij}\r\n\\]\r\n\r\nThe code that computes the input gradients is defined here:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\BuiltInTok{std::}\\NormalTok{fill(}\\VariableTok{input\\_gradients\\_}\\NormalTok{.begin(), }\\VariableTok{input\\_gradients\\_}\\NormalTok{.end(), }\\DecValTok{0}\\NormalTok{);}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{output\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\DataTypeTok{size\\_t}\\NormalTok{ offset = i * }\\VariableTok{input\\_size\\_}\\NormalTok{;}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{input\\_gradients\\_}\\NormalTok{[j]}\r\n\\NormalTok{                += }\\VariableTok{weights\\_}\\NormalTok{[offset + j] * }\\VariableTok{activation\\_gradients\\_}\\NormalTok{[i];}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nNote that unlike the weight and bias gradients which accumulate while\r\ntraining an entire batch of samples, the input gradients here are\r\nephemeral and reset every pass since the only depend on the evaluation\r\nof an individual sample.\r\n\r\nFinally, to complete the \\texttt{FFNode::reverse} method, the input\r\ngradients computed are based backwards for use in an antecedent node's\r\ngradient update (reproduced below). The code as presented \\emph{does not\r\nwork} with non-sequential computational graphs, but is meant to provide\r\na starting point for futher experimentation.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\ControlFlowTok{for}\\NormalTok{ (Node* node : }\\VariableTok{antecedents\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        node{-}\\textgreater{}reverse(}\\VariableTok{input\\_gradients\\_}\\NormalTok{.data());}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\n\\hypertarget{the-categorical-cross-entropy-loss-node}{%\r\n\\subsubsection{The Categorical Cross-Entropy Loss\r\nNode}\\label{the-categorical-cross-entropy-loss-node}}\r\n\r\nThe last node we need to implement is the node which computes the\r\ncategorical cross-entropy of the prediction. A possible class definition\r\nfor such this node is shown below:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\KeywordTok{class}\\NormalTok{ CCELossNode : }\\KeywordTok{public}\\NormalTok{ Node}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n\\NormalTok{    CCELossNode(Model\\& model,}\r\n                \\BuiltInTok{std::}\\NormalTok{string name,}\r\n                \\DataTypeTok{uint16\\_t}\\NormalTok{ input\\_size,}\r\n                \\DataTypeTok{size\\_t}\\NormalTok{ batch\\_size);}\r\n\r\n    \\CommentTok{// No initialization is needed for this node}\r\n    \\DataTypeTok{void}\\NormalTok{ init(}\\DataTypeTok{rne\\_t}\\NormalTok{\\&) }\\KeywordTok{override}\\NormalTok{ \\{\\}}\r\n\r\n    \\DataTypeTok{void}\\NormalTok{ forward(}\\DataTypeTok{num\\_t}\\NormalTok{* inputs) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// As a loss node, the argument to this method is ignored (the gradient of}\r\n    \\CommentTok{// the loss with respect to itself is unity)}\r\n    \\DataTypeTok{void}\\NormalTok{ reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* gradients = }\\KeywordTok{nullptr}\\NormalTok{) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\DataTypeTok{void}\\NormalTok{ print() }\\AttributeTok{const} \\KeywordTok{override}\\NormalTok{;}\r\n\r\n    \\CommentTok{// During training, this must be set to the expected target distribution}\r\n    \\CommentTok{// for a given sample}\r\n    \\DataTypeTok{void}\\NormalTok{ set\\_target(}\\DataTypeTok{num\\_t} \\AttributeTok{const}\\NormalTok{* target)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{target\\_}\\NormalTok{ = target;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\DataTypeTok{num\\_t}\\NormalTok{ accuracy() }\\AttributeTok{const}\\NormalTok{;}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{ avg\\_loss() }\\AttributeTok{const}\\NormalTok{;}\r\n    \\DataTypeTok{void}\\NormalTok{ reset\\_score();}\r\n\r\n\\KeywordTok{private}\\NormalTok{:}\r\n    \\DataTypeTok{uint16\\_t} \\VariableTok{input\\_size\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{// We minimize the average loss, not the net loss so that the losses}\r\n    \\CommentTok{// produced do not scale with batch size (which allows us to keep training}\r\n    \\CommentTok{// parameters constant)}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{inv\\_batch\\_size\\_}\\NormalTok{;}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{loss\\_}\\NormalTok{;}\r\n    \\DataTypeTok{num\\_t} \\AttributeTok{const}\\NormalTok{* }\\VariableTok{target\\_}\\NormalTok{;}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{* }\\VariableTok{last\\_input\\_}\\NormalTok{;}\r\n    \\CommentTok{// Stores the last active classification in the target one{-}hot encoding}\r\n    \\DataTypeTok{size\\_t} \\VariableTok{active\\_}\\NormalTok{;}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{cumulative\\_loss\\_}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n    \\CommentTok{// Store running counts of correct and incorrect predictions}\r\n    \\DataTypeTok{size\\_t} \\VariableTok{correct\\_}\\NormalTok{   = }\\DecValTok{0}\\NormalTok{;}\r\n    \\DataTypeTok{size\\_t} \\VariableTok{incorrect\\_}\\NormalTok{ = }\\DecValTok{0}\\NormalTok{;}\r\n    \\BuiltInTok{std::}\\NormalTok{vector\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{} }\\VariableTok{gradients\\_}\\NormalTok{;}\r\n\\NormalTok{\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThe \\texttt{CCELossNode} is similar to other nodes in that it implements\r\na forward pass for computing the loss of a given sample, and a reverse\r\npass to compute gradients of that loss and pass them back to the\r\nantecedent node. Distinct from the previous nodes is that the argument\r\nto \\texttt{CCELossNode::reverse} is ignored as the loss node is not\r\nexpected to have any subsequents.\r\n\r\nThe implementation of \\texttt{CCELossNode::forward} follows from the\r\ndefinition of cross-entropy, recalled here with some modifications:\r\n\r\n\\[J_{CE}(\\hat{\\mathbf{y}}, \\mathbf{y}) = -\\sum_j y_j \\log{\\left(\\max(\\hat{y}_j, \\epsilon) \\right)} \\]\r\n\r\n\\(J\\) is the common symbol ascribed to the cost or objective function,\r\nwhile \\(\\hat{y}\\) and \\(y\\) refer to the predicted distribution and\r\ncorrect distribution respectively. In addition, the argument of the\r\nlogarithm is clamped with a small \\(\\epsilon\\) to avoid a numerical\r\nsingularity. The implementation is as follows:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ CCELossNode::forward(}\\DataTypeTok{num\\_t}\\NormalTok{* data)}\r\n\\NormalTok{\\{}\r\n    \\DataTypeTok{num\\_t}\\NormalTok{ max\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n    \\DataTypeTok{size\\_t}\\NormalTok{ max\\_index;}\r\n\r\n    \\VariableTok{loss\\_}\\NormalTok{ = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{if}\\NormalTok{ (data[i] \\textgreater{} max)}\r\n\\NormalTok{        \\{}\r\n\\NormalTok{            max\\_index = i;}\r\n\\NormalTok{            max       = data[i];}\r\n\\NormalTok{        \\}}\r\n\r\n        \\VariableTok{loss\\_}\\NormalTok{ {-}= }\\VariableTok{target\\_}\\NormalTok{[i]}\r\n\\NormalTok{                 * }\\BuiltInTok{std::}\\NormalTok{log(}\r\n                     \\BuiltInTok{std::}\\NormalTok{max(data[i], }\\BuiltInTok{std::}\\NormalTok{numeric\\_limits\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}::epsilon()));}\r\n\r\n        \\ControlFlowTok{if}\\NormalTok{ (}\\VariableTok{target\\_}\\NormalTok{[i] != }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\})}\r\n\\NormalTok{        \\{}\r\n            \\VariableTok{active\\_}\\NormalTok{ = i;}\r\n\\NormalTok{        \\}}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{if}\\NormalTok{ (max\\_index == }\\VariableTok{active\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        ++}\\VariableTok{correct\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n    \\ControlFlowTok{else}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        ++}\\VariableTok{incorrect\\_}\\NormalTok{;}\r\n\\NormalTok{    \\}}\r\n\r\n    \\VariableTok{cumulative\\_loss\\_}\\NormalTok{ += }\\VariableTok{loss\\_}\\NormalTok{;}\r\n\r\n    \\CommentTok{// Store the data pointer to compute gradients later}\r\n    \\VariableTok{last\\_input\\_}\\NormalTok{ = data;}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nAs with the feedforward node, a pointer to the inputs to the node is\r\npreserved to compute gradients later. A bit of bookkeeping is also done\r\nso we can track accuracy and accumulate loss during batch. The\r\nderivative of the loss of an individual sample with respect to the\r\ninputs is also fairly straightforward.\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\frac{\\partial J_{CE}}{\\partial{\\hat{y}_i}} &= \\frac{\\partial \\left(-\\sum_j y_j\\log{\\left(\\max(\\hat{y}_j, \\epsilon)\\right)}\\right)}{\\partial \\hat{y}_i} \\\\\r\n&= -\\frac{y_i}{\\max(\\hat{y}_i, \\epsilon)}\r\n\\end{aligned}\r\n\\]\r\n\r\nThe implementation is similarly straightforward. As with the other nodes\r\nwith loss gradients, the loss gradients with respect to all inputs are\r\nforwarded to antecedent nodes.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ CCELossNode::reverse(}\\DataTypeTok{num\\_t}\\NormalTok{* data)}\r\n\\NormalTok{\\{}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\VariableTok{input\\_size\\_}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\VariableTok{gradients\\_}\\NormalTok{[i] = {-}}\\VariableTok{inv\\_batch\\_size\\_}\\NormalTok{ * }\\VariableTok{target\\_}\\NormalTok{[i]}\r\n\\NormalTok{            / }\\BuiltInTok{std::}\\NormalTok{max(}\\VariableTok{last\\_input\\_}\\NormalTok{[i], }\\BuiltInTok{std::}\\NormalTok{numeric\\_limits\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}::epsilon());}\r\n\\NormalTok{    \\}}\r\n\r\n    \\ControlFlowTok{for}\\NormalTok{ (Node* node : }\\VariableTok{antecedents\\_}\\NormalTok{)}\r\n\\NormalTok{    \\{}\r\n\\NormalTok{        node{-}\\textgreater{}reverse(}\\VariableTok{gradients\\_}\\NormalTok{.data());}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nOne thing to keep in mind here is that this implementation is \\emph{not}\r\nthe most efficient implementation possible for a softmax layer feeding\r\nto a cross-entropy loss function by any stretch. The code and derivation\r\nhere is completely general for arbitrary sample probability\r\ndistributions. If, however, we can assume that the target distribution\r\nis one-hot encoded, then all gradients in this node will either be 0 or\r\n\\(-1/\\hat{y}_k\\) where \\(k\\) is the active label in the one-hot target.\r\nUpon substitution in the previous layer, it should be clear that\r\nimportant cancellations are possible that dramatically simplify the\r\ngradient computations in the softmax layer. Here's the simplification,\r\nagain assuming that the \\(k\\)th index is the correct label:\r\n\r\n\\[\r\n\\begin{aligned}\r\n\\frac{\\partial J_{CE}}{\\partial \\mathrm{softmax}(\\mathbf{z})_i} &= \\frac{\\partial J_{CE}}{\\partial a_i}\\sum_{j} \\begin{cases}\r\n\\mathrm{softmax}(\\mathbf{z})_i\\left(1 - \\mathrm{softmax}(\\mathbf{z})_i\\right) & i = j \\\\\r\n-\\mathrm{softmax}(\\mathbf{z})_i \\mathrm{softmax}(\\mathbf{z})_j & i \\neq j\r\n\\end{cases} \\\\\r\n&= \\begin{dcases}\r\n-\\frac{\\mathrm{softmax}(\\mathbf{z})_k(1 - \\mathrm{softmax}(\\mathbf{z}_k))}{\\mathrm{softmax}(\\mathbf{z})_k} & i = k \\\\\r\n\\frac{\\mathrm{softmax}(\\mathbf{z})_i\\mathrm{softmax}(\\mathbf{z})_k}{\\mathrm{softmax}(\\mathbf{z})_k} & i \\neq k\\\\\r\n\\end{dcases} \\\\\r\n&= \\begin{dcases}\r\n\\mathrm{softmax}(\\mathbf{z})_k - 1 & i = k \\\\\r\n\\mathrm{softmax}(\\mathbf{z})_i & i \\neq k\r\n\\end{dcases}\r\n\\end{aligned}\r\n\\]\r\n\r\nWhen following the computation above, remember that\r\n\\(\\partial J_{CE} / \\partial a_i\\) is 0 for all \\(i \\neq k\\). Thus, the\r\nonly term in the sum that survives is the term corresponding to\r\n\\(j = k\\), at which point we break out the differentation depending on\r\nwhether \\(i = k\\) or \\(i \\neq k\\).\r\n\r\nThis is an elegant result! Essentially, the gradient of a the loss with\r\nrespect to an emitted probability \\(p(x)\\) is simply \\(p(x)\\) if \\(x\\)\r\nwas not the correct label, and \\(p(x) - 1\\) if it was. Considering the\r\neffect of gradient descent, this should check out with our intuition.\r\nThe optimizer seeks to suppress probabilities predicted that should have\r\nbeen 0, and increase probabilities predicted that should have been 1.\r\nCheck for yourself that after gradient descent is performed, the\r\ngradients derived here will nudge the model in the appropriate\r\ndirection.\r\n\r\nThis sort of optimization highlights an important observation about\r\nbackpropagation, namely, that backpropagation does not guarantee any\r\nsort of optimality beyond a worst-case performance ceiling. Several\r\nproduction neural networks have architectures that employ heuristics to\r\nidentify optimizations such as this one, but the problem of generating a\r\nperfect computational strategy is NP and so not covered here. The code\r\nprovided here will remain in the general form, despite being slower in\r\nthe interest of maintaining generality and not adding complexity, but\r\nyou are encouraged to consider abstractions to permit this type of\r\noptimization in your own architecture (a useful keyword to aid your\r\nresearch is \\emph{common subexpression elimination} or \\emph{CSE} for\r\nshort).\r\n\r\nThe last thing we need to provide for \\texttt{CCELossNode} are a few\r\nhelper routines:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{void}\\NormalTok{ CCELossNode::print() }\\AttributeTok{const}\r\n\\NormalTok{\\{}\r\n    \\BuiltInTok{std::}\\NormalTok{printf(}\\StringTok{\"Avg Loss: }\\SpecialCharTok{\\%f\\textbackslash{}t\\%f\\%\\%}\\StringTok{ correct}\\SpecialCharTok{\\textbackslash{}n}\\StringTok{\"}\\NormalTok{, avg\\_loss(), accuracy() * }\\FloatTok{100.0}\\NormalTok{);}\r\n\\NormalTok{\\}}\r\n\r\n\\DataTypeTok{num\\_t}\\NormalTok{ CCELossNode::accuracy() }\\AttributeTok{const}\r\n\\NormalTok{\\{}\r\n    \\ControlFlowTok{return} \\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{correct\\_}\\NormalTok{)}\r\n\\NormalTok{           / }\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{correct\\_}\\NormalTok{ + }\\VariableTok{incorrect\\_}\\NormalTok{);}\r\n\\NormalTok{\\}}\r\n\\DataTypeTok{num\\_t}\\NormalTok{ CCELossNode::avg\\_loss() }\\AttributeTok{const}\r\n\\NormalTok{\\{}\r\n    \\ControlFlowTok{return} \\VariableTok{cumulative\\_loss\\_}\\NormalTok{ / }\\KeywordTok{static\\_cast}\\NormalTok{\\textless{}}\\DataTypeTok{num\\_t}\\NormalTok{\\textgreater{}(}\\VariableTok{correct\\_}\\NormalTok{ + }\\VariableTok{incorrect\\_}\\NormalTok{);}\r\n\\NormalTok{\\}}\r\n\r\n\\DataTypeTok{void}\\NormalTok{ CCELossNode::reset\\_score()}\r\n\\NormalTok{\\{}\r\n    \\VariableTok{cumulative\\_loss\\_}\\NormalTok{ = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n    \\VariableTok{correct\\_}\\NormalTok{         = }\\DecValTok{0}\\NormalTok{;}\r\n    \\VariableTok{incorrect\\_}\\NormalTok{       = }\\DecValTok{0}\\NormalTok{;}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThese routines let us observe the performance of our network during\r\ntraining in terms of both loss and accuracy.\r\n\r\n\\hypertarget{gradient-descent-optimizer}{%\r\n\\subsubsection{Gradient Descent\r\nOptimizer}\\label{gradient-descent-optimizer}}\r\n\r\nAt some point after loss gradients with respect to model parameters have\r\naccumulated, the gradients will need to be used to actually adjust the\r\nparameters themselves. This is provided by the \\texttt{GDOptimizer}\r\nclass implemented as below:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\KeywordTok{class}\\NormalTok{ GDOptimizer : }\\KeywordTok{public}\\NormalTok{ Optimizer}\r\n\\NormalTok{\\{}\r\n\\KeywordTok{public}\\NormalTok{:}\r\n    \\CommentTok{// \"Eta\" is the commonly accepted character used to denote the learning}\r\n    \\CommentTok{// rate. Given a loss gradient dJ/dp for some parameter p, during gradient}\r\n    \\CommentTok{// descent, p will be adjusted such that p\\textquotesingle{} = p {-} eta * dJ/dp.}\r\n\\NormalTok{    GDOptimizer(}\\DataTypeTok{num\\_t}\\NormalTok{ eta) : }\\VariableTok{eta\\_}\\NormalTok{\\{eta\\} \\{\\}}\r\n\r\n    \\CommentTok{// This should be invoked at the end of each batch\\textquotesingle{}s evaluation. The}\r\n    \\CommentTok{// interface technically permits the use of different optimizers for}\r\n    \\CommentTok{// different segments of the computational graph.}\r\n    \\DataTypeTok{void}\\NormalTok{ train(Node\\& node) }\\KeywordTok{override}\\NormalTok{;}\r\n\r\n\\KeywordTok{private}\\NormalTok{:}\r\n    \\DataTypeTok{num\\_t} \\VariableTok{eta\\_}\\NormalTok{;}\r\n\\NormalTok{\\};}\r\n\r\n\\DataTypeTok{void}\\NormalTok{ GDOptimizer::train(Node\\& node)}\r\n\\NormalTok{\\{}\r\n    \\DataTypeTok{size\\_t}\\NormalTok{ param\\_count = node.param\\_count();}\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != param\\_count; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\DataTypeTok{num\\_t}\\NormalTok{\\& param    = *node.param(i);}\r\n        \\DataTypeTok{num\\_t}\\NormalTok{\\& gradient = *node.gradient(i);}\r\n\r\n\\NormalTok{        param = param {-} }\\VariableTok{eta\\_}\\NormalTok{ * gradient;}\r\n\r\n        \\CommentTok{// Reset the gradient which will be accumulated again in the next}\r\n        \\CommentTok{// training epoch}\r\n\\NormalTok{        gradient = }\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.0}\\NormalTok{\\};}\r\n\\NormalTok{    \\}}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nNot shown is the \\texttt{Optimizer} class interface which simply\r\nprovides a virtual \\texttt{train} method. As you implement more\r\nsophisticated optimizers, you will find that more state may be needed to\r\nperform necessary tasks (e.g.~computing gradient moving averages). Also\r\nimplicit in this implementation is that our \\texttt{Node} classes need\r\nto provide an indexing scheme for each parameter as well as an accessor\r\nfor the total number of parameters. For example, accessing the\r\n\\texttt{FFNode} parameters is a fairly simple matter:\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\DataTypeTok{num\\_t}\\NormalTok{* FFNode::param(}\\DataTypeTok{size\\_t}\\NormalTok{ index)}\r\n\\NormalTok{\\{}\r\n    \\ControlFlowTok{if}\\NormalTok{ (index \\textless{} }\\VariableTok{weights\\_}\\NormalTok{.size())}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{return}\\NormalTok{ \\&}\\VariableTok{weights\\_}\\NormalTok{[index];}\r\n\\NormalTok{    \\}}\r\n    \\ControlFlowTok{return}\\NormalTok{ \\&}\\VariableTok{biases\\_}\\NormalTok{[index {-} }\\VariableTok{weights\\_}\\NormalTok{.size()];}\r\n\\NormalTok{\\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nThe parameters are indexed 0 through the return value of\r\n\\texttt{Node::param\\_count()} minus one. Note that the optimizer doesn't\r\ncare whether the parameter accessed in this way is a weight, bias,\r\naverage, etc. As a trainable parameter, the only thing that matters\r\nduring gradient descent is the current value and the loss gradient.\r\n\r\n\\hypertarget{tying-it-all-together}{%\r\n\\subsection{Tying it all Together}\\label{tying-it-all-together}}\r\n\r\nNow that we have the individual nodes implemented, all that remains is\r\nto wire things up and start training! This is how we can construct a\r\nmodel with a input, hidden, output, and loss nodes, all wired\r\nsequentially.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n\\NormalTok{    Model model\\{}\\StringTok{\"ff\"}\\NormalTok{\\};}\r\n\r\n\\NormalTok{    MNIST\\& mnist = \\&model.add\\_node\\textless{}MNIST\\textgreater{}(images, labels);}\r\n\r\n\\NormalTok{    FFNode\\& hidden = model.add\\_node\\textless{}FFNode\\textgreater{}(}\\StringTok{\"hidden\"}\\NormalTok{, Activation::ReLU, }\\DecValTok{32}\\NormalTok{, }\\DecValTok{784}\\NormalTok{);}\r\n\r\n\\NormalTok{    FFNode\\& output}\r\n\\NormalTok{        = model.add\\_node\\textless{}FFNode\\textgreater{}(}\\StringTok{\"output\"}\\NormalTok{, Activation::Softmax, }\\DecValTok{10}\\NormalTok{, }\\DecValTok{32}\\NormalTok{);}\r\n\r\n\\NormalTok{    CCELossNode\\& loss = \\&model.add\\_node\\textless{}CCELossNode\\textgreater{}(}\\StringTok{\"loss\"}\\NormalTok{, }\\DecValTok{10}\\NormalTok{, batch\\_size);}\r\n\\NormalTok{    loss.set\\_target(mnist.label());}\r\n\r\n\\NormalTok{    model.create\\_edge(hidden, mnist);}\r\n\\NormalTok{    model.create\\_edge(output, hidden);}\r\n\\NormalTok{    model.create\\_edge(loss, output);}\r\n    \r\n    \\CommentTok{// This function should visit all constituent nodes and initialize}\r\n    \\CommentTok{// their parameters}\r\n\\NormalTok{    model.init();}\r\n    \r\n    \\CommentTok{// Create a gradient descent optimizer with a hardcoded learning rate}\r\n\\NormalTok{    GDOptimizer optimizer\\{}\\DataTypeTok{num\\_t}\\NormalTok{\\{}\\FloatTok{0.3}\\NormalTok{\\}\\};}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nAs mentioned before, the ``edges'' are somewhat cosmetic as none of our\r\nnodes actually support multiple node inputs or outputs. An actual\r\nimplementation that would support such a non-sequential topology will\r\nlikely need a sort of signals and slots abstraction. The interface\r\nprovided here is strictly to impress on you the importance of the\r\nabstraction of our neural network as a computational graph, which is\r\ncritical when additional complexity is added later.\r\n\r\nWith this, we are ready to implement the core loop of the training\r\nalgorithm.\r\n\r\n\\begin{Shaded}\r\n\\begin{Highlighting}[]\r\n    \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ i = }\\DecValTok{0}\\NormalTok{; i != }\\DecValTok{256}\\NormalTok{; ++i)}\r\n\\NormalTok{    \\{}\r\n        \\ControlFlowTok{for}\\NormalTok{ (}\\DataTypeTok{size\\_t}\\NormalTok{ j = }\\DecValTok{0}\\NormalTok{; j != }\\DecValTok{64}\\NormalTok{; ++j)}\r\n\\NormalTok{        \\{}\r\n\\NormalTok{            mnist{-}\\textgreater{}forward();}\r\n\\NormalTok{            loss{-}\\textgreater{}reverse();}\r\n\\NormalTok{        \\}}\r\n\r\n\\NormalTok{        model.train(optimizer);}\r\n\\NormalTok{    \\}}\r\n\\end{Highlighting}\r\n\\end{Shaded}\r\n\r\nHere, we train our model over 256 batches. Each batch consists of 64\r\nsamples, and for each sample, we invoke \\texttt{MNIST::forward} and\r\n\\texttt{CCELossNode::reverse}. During the forward pass, our\r\n\\texttt{MNIST} node extracts a new sample and label and forwards the\r\nsample data to the next node. This data propagates through the network\r\nuntil the final output distribution is passed to the loss node and\r\nlosses are computed. All this occurs within the single line:\r\n\\texttt{mnist-\\textgreater{}forward()}. In the subsequent line,\r\ngradients are computed and passed back until the reverse accumulation\r\nterminates at the \\texttt{MNIST} node again. After all gradients for the\r\nbatch are accumulated, the model can \\texttt{train}, which invokes the\r\noptimizer on each node to simultaneously adjust all model parameters for\r\neach node.\r\n\r\nAfter adding some additional logging, the results of the network look\r\nlike this:\r\n\r\n\\begin{verbatim}\r\nExecuting training routine\r\nLoaded images file with 60000 entries\r\nhidden: 784 -> 32\r\noutput: 32 -> 10\r\nInitializing model parameters with seed: 116726080\r\nAvg Loss: 0.254111  96.875000% correct\r\n\\end{verbatim}\r\n\r\nTo evaluate the efficacy of the model, we can serialize all the\r\nparameters to disk, load them up, disable the training step, and\r\nevaluate the model on the test data. For this particular run, the\r\nresults were as follows:\r\n\r\n\\begin{verbatim}\r\nExecuting evaluation routine\r\nLoaded images file with 10000 entries\r\nhidden: 784 -> 32\r\noutput: 32 -> 10\r\nAvg Loss: 0.292608  91.009998% correct\r\n\\end{verbatim}\r\n\r\nAs you can see, the accuracy dropped on the test data relative to the\r\ntraining data. This is a hallmark characterstic of \\emph{overfitting},\r\nwhich is to be expected given that we haven't implemented any\r\nregularization whatsoever! That said, 91\\% accuracy isn't all that bad\r\nwhen we consider the fact that our model has no notion of\r\npixel-adjacency whatsoever. For image data, convolutional networks are a\r\nfar more apt architecture than the one chosen for this demonstration.\r\n\r\n\\hypertarget{regularization}{%\r\n\\subsubsection{Regularization}\\label{regularization}}\r\n\r\nRegularization will not be implemented as part of this self-contained\r\nneural network, but it is such a fundamental part of most deep learning\r\nframeworks that we'll discuss it here.\r\n\r\nOften, the dimensionality of our model will be much higher than what is\r\nstricly needed to make accurate predictions. This stems from the fact\r\nthat we seldom no a priori how many features are needed for the model to\r\nbe successful. Thus, the likelihood of overfitting increases as more\r\ntraining data is fed into the model. The primary tool to combat\r\noverfitting is \\emph{regularization}. Loosely speaking, regularization\r\nis any strategy employed to restrict the hypothesis space of\r\nfit-functions the model can occcupy to prevent overfitting.\r\n\r\nWhat is meant by restricting the hypothesis space, you might ask? The\r\nidea is to consider the entire family of functions possible spanned by\r\nthe model's entire parameter vector. If our model has 10000 parameters\r\n(many networks will easily exceed this), each unique 10000-dimensional\r\nvector corresponds to a possible solution. However, we know it's\r\nunlikely that certain parameters should be vastly greater in magnitude\r\nthan others in a theoretically \\emph{optimal} condition. Models with\r\n``strange'' parameter vectors that are unlikely to be the optimal\r\nsolution are likely converged on as a result of overfitting. Therefore,\r\nit makes sense to consider ways to constrain the space this parameter\r\nvector may occupy.\r\n\r\nThe most common approach to achieve this is to add an initial penalty\r\nterm to the loss function which is a function of the weight. For\r\nexample, here is the cross-entropy loss with the so-called \\(L^2\\)\r\nregularizer (also known as the ridge regularizer) added:\r\n\r\n\\[-\\sum_{x\\in X} y_x \\log{\\hat{y}_x} + \\frac{\\lambda}{2} \\mathbf{w}^{T}\\mathbf{w}\\]\r\n\r\nIn a slight abuse of notation, \\(\\mathbf{w}\\) here corresponds to a\r\nvector containing every weight in our network. The factor \\(\\lambda\\) is\r\na constant we can choose to adjust the penalty size. Note that when a\r\nregularizer is used, we \\emph{expect training loss to increase}. The\r\ntradeoff is that we simultaneously \\emph{expect test loss to decrease}.\r\nTuning the regularization speed \\(\\lambda\\) is a routine problem for\r\nmodel fitting in the wild.\r\n\r\nBy modifying the loss function, in principal, all loss gradients must\r\nchange as well. Fortunately, as we've only added a quadratic term to the\r\nloss, the only change to the gradient will be an additional linear\r\nadditive term \\(\\lambda\\mathbf{w}\\). This means we don't have to add a\r\nton of code to modify all the gradient calculations thus far. Instead,\r\nwe can simply \\emph{decay} the weight based on a percentage of the\r\nweight's magnitude when we adjust the weight after each batch is\r\nperformed. You will often here this type of regularization referred to\r\nas simply \\emph{weight decay} for this reason.\r\n\r\nTo implement \\(L^2\\) regularization, simply add a percentage of a\r\nweight's value to its loss gradient. Crucially, do not adjust bias\r\nparameters in the same way. We only wish to penalize parameters for\r\nwhich increased magnitude corresponds with more complex models. Bias\r\nparameters are simply scalar offsets, regardless of their value and do\r\nnot scale the inputs. Thus, attempting to regularize them will likely\r\nincrease \\emph{both} training and test error.\r\n\r\n\\hypertarget{where-to-go-from-here}{%\r\n\\subsection{Where to go from here}\\label{where-to-go-from-here}}\r\n\r\nAt this point, our toy network is complete. With any luck, you've taken\r\naway a few key patterns that will aid in both your intuition about how\r\ndeep learning techniques work, and your efforts to actually implement\r\nthem. The implementation presented here is both far from complete, and\r\nfar from ideal. Critically missing is adequate visualization for the\r\nerror rate as a function of training time, mis-predicted samples, and\r\nthe model parameters themselves. Without visualization, model tuning can\r\nbe time consuming, veering on impossible. In addition, our model\r\ntraining samples are always ingested in the order they are provided in\r\nthe training file. In practice, this sequence should be shuffled to\r\navoid introducing training bias.\r\n\r\nHere are a few additional things you can try, in no particular order.\r\n\r\n\\begin{itemize}\r\n\\tightlist\r\n\\item\r\n  Add various regularization modes such as \\(L^2\\), \\(L^1\\), or dropout.\r\n\\item\r\n  Track loss reduction momentum to implement \\emph{early stopping},\r\n  thereby reducing wasted training cycles\r\n\\item\r\n  Implement a convolution node with a variable sized weight filter. You\r\n  will likely need to implement the max-pooling operation as well.\r\n\\item\r\n  Implement a batch-normalization node.\r\n\\item\r\n  Modify the interfaces provided here so that \\texttt{Node::forward} and\r\n  \\texttt{Node::reverse} also pass slot ids to handle nodes with\r\n  multiple inputs and outputs.\r\n\\item\r\n  Leverage the slots abstraction above to implement a residual network.\r\n\\item\r\n  Improve efficiency by adding support for SIMD or GPU-based compute\r\n  kernels.\r\n\\item\r\n  Add multithreading to allow separate batches to be trained\r\n  simultaneously.\r\n\\item\r\n  Provide alternative optimizers that decay the learning rate over time,\r\n  or decay the learning rate as a function of loss momentum.\r\n\\item\r\n  Add a ``meta-training'' feature that can tune \\emph{hyperparameters}\r\n  used to configure your model (e.g.~learning rate, regularization rate,\r\n  network depth, layer dimension).\r\n\\item\r\n  Pick a research paper you're interested in and endeavor to implement\r\n  it end to end.\r\n\\end{itemize}\r\n\r\nAs you can see, the sky's the limit and there is simply no end to the\r\namount of work possible to improve a neural network's ability to learn\r\nand make inferences. A good body of work is also there to improve\r\ntooling around data ingestion, model configuration serialization,\r\nautomated testing, continuous learning in the cloud, etc. Crucially\r\nthough, new research and development is constantly in the works in this\r\never-changing field. On top of studying deep learning as a discipline in\r\nand of itself, there is plenty of room for specialization in particular\r\ndomains, be it computer vision, NLP, epidemiology, or something else. My\r\nhope is that for some of you, the neural network in a weekend may take\r\nthe form of a neural network in a fulfilling career or lifetime.\r\n\r\n\\hypertarget{further-reading}{%\r\n\\paragraph{Further Reading}\\label{further-reading}}\r\n\r\nIf you get a single book, \\emph{Deep Learning} (listed first in the\r\nfollowing table) is highly recommended as a relatively self-complete\r\ntext with cogent explanations written in a readable style. As you\r\nventure into attempting to perform ML tasks in a particular domain,\r\nsearch for a relatively recent highly cited ``survey'' paper, which\r\nshould introduce you to the main ideas and give you a starting point for\r\nfurther research. \\href{https://arxiv.org/pdf/1907.09408.pdf}{Here} is\r\nan example of one such survey paper, in this case with an emphasis on\r\nobject detection.\r\n\r\n\\begin{longtable}[]{@{}lll@{}}\r\n\\toprule\r\n\\begin{minipage}[b]{0.30\\columnwidth}\\raggedright\r\nTitle\\strut\r\n\\end{minipage} & \\begin{minipage}[b]{0.30\\columnwidth}\\raggedright\r\nAuthors\\strut\r\n\\end{minipage} & \\begin{minipage}[b]{0.30\\columnwidth}\\raggedright\r\nDescription\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\midrule\r\n\\endhead\r\n\\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\n\\emph{Deep Learning}\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nIan Goodfellow, Yoshua Bengio, and Aaron Courville\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nSeminal text on the theory and practice of using neural networks to\r\nlearn and perform tasks\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\n\\emph{Numerical Methods for Scientists and Engineers}\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nR. W. Hamming\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nExcellent general text covering important topics such as floating point\r\nprecision and various approximation methods\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\n\\emph{Standard notations for Deep Learning}\r\n(\\href{https://cs230.stanford.edu/files/Notation.pdf}{link})\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nStanford CS230 Course Notes\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nCheatsheet covering standard notation used by many texts and\r\npapers\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\n\\emph{Neural Networks and Deep Learning}\r\n(\\href{http://neuralnetworksanddeeplearning.com/index.html}{link})\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nMichael Nielsen\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nA gentler introduction to the theory and practice of neural\r\nnetworks\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\n\\emph{Automatic Differentiation in Machine Learning: a Survey}\r\n(\\href{https://arxiv.org/pdf/1502.05767.pdf}{link})\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nAtılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul,\r\nJeffrey Mark Siskind\\strut\r\n\\end{minipage} & \\begin{minipage}[t]{0.30\\columnwidth}\\raggedright\r\nExcellent survey paper documentating the various algorithms used for\r\ncomputational differentiation including viable alternatives to\r\nbackpropagation\\strut\r\n\\end{minipage}\\tabularnewline\r\n\\bottomrule\r\n\\end{longtable}\r\n\r\n\\end{document}\r\n"
  },
  {
    "path": "doc/Makefile",
    "content": "all: pdf html\r\n\r\nclean:\r\n\trm -rf *.svg plots DOC.pdf DOC.html\r\n\r\ntex:\r\n\tpandoc -F pandoc-plot -s DOC.md -o DOC.tex\r\n\r\npdf:\r\n\tpandoc -F pandoc-plot -s --katex DOC.md -o DOC.pdf\r\n\r\nhtml:\r\n\tpandoc -L tikz.lua -F pandoc-plot -s --katex DOC.md -o DOC.html\r\n\r\nepub:\r\n\tpandoc -t epub3 --webtex -L tikz.lua -F pandoc-plot -s DOC.md -o DOC.epub\r\n"
  },
  {
    "path": "doc/plots/-1637788021081228918.txt",
    "content": "# Generated by pandoc-plot 0.8.0.0\r\n\r\nimport matplotlib.pyplot as plt\r\nimport array as arr\r\nimport math as math\r\n\r\nf = arr.array('f')\r\nf.append(0)\r\nf.append(0)\r\nf.append(1)\r\nx = arr.array('f')\r\nx.append(-1)\r\nx.append(-0)\r\nx.append(1)\r\n\r\nplt.figure()\r\nplt.plot(x, f)\r\nplt.xlabel('$x$')\r\nplt.ylabel('$\\max(0, x)$')\r\nplt.title('Rectifier function')"
  },
  {
    "path": "doc/plots/-6767785830879840565.txt",
    "content": "# Generated by pandoc-plot 0.8.0.0\r\n\r\nimport matplotlib.pyplot as plt\r\nimport array as arr\r\nimport math as math\r\n\r\ns = arr.array('f')\r\nh = arr.array('f')\r\n\r\nlast = 0\r\nn = 30\r\nfor i in range(0, n):\r\n    last += 1 / (n + 1)\r\n    s.append(last)\r\n    h.append(-(1 - last) * math.log(last) - last * math.log(1 - last))\r\n\r\nplt.figure()\r\nplt.plot(s, h)\r\nplt.xlabel('$S$')\r\nplt.ylabel('$-(1-S)\\log S - S\\log (1 - S)$')\r\nplt.title('Cross entropy with mismatched distribution')"
  },
  {
    "path": "doc/plots/6094492350593652429.txt",
    "content": "# Generated by pandoc-plot 0.8.0.0\r\n\r\nimport matplotlib.pyplot as plt\r\nimport array as arr\r\nimport math as math\r\n\r\ns = arr.array('f')\r\ns.append(0)\r\nh = arr.array('f')\r\nh.append(0)\r\n\r\nlast = 0\r\nn = 30\r\nfor i in range(0, n):\r\n    last += 1 / (n + 1)\r\n    s.append(last)\r\n    h.append(-last * math.log(last) - (1 - last) * math.log(1 - last))\r\n\r\ns.append(1.0)\r\nh.append(0)\r\n\r\nplt.figure()\r\nplt.plot(s, h)\r\nplt.xlabel('$S$')\r\nplt.ylabel('$H(S) = -S\\log S - (1 - S)\\log (1 - S)$')\r\nplt.title('Binary Entropy')"
  },
  {
    "path": "doc/tikz.lua",
    "content": "local system = require 'pandoc.system'\n\nlocal tikz_doc_template = [[\n\\documentclass{standalone}\n\\usepackage{xcolor}\n\\usepackage{tikz}\n\\usetikzlibrary{positioning,calc,arrows}\n\\renewenvironment{center} {} {}\n\\begin{document}\n\\nopagecolor\n%s\n\\end{document}\n]]\n\nlocal function tikz2image(src, filetype, outfile)\n  system.with_temporary_directory('tikz2image', function (tmpdir)\n    system.with_working_directory(tmpdir, function()\n      local f = io.open('tikz.tex', 'w')\n      f:write(tikz_doc_template:format(src))\n      f:close()\n      os.execute('pdflatex tikz.tex')\n      if filetype == 'pdf' then\n        os.rename('tikz.pdf', outfile)\n      else\n        os.execute('pdf2svg tikz.pdf ' .. outfile)\n      end\n    end)\n  end)\nend\n\nextension_for = {\n  html = 'svg',\n  html4 = 'svg',\n  html5 = 'svg',\n  latex = 'pdf',\n  beamer = 'pdf' }\n\nlocal function file_exists(name)\n  local f = io.open(name, 'r')\n  if f ~= nil then\n    io.close(f)\n    return true\n  else\n    return false\n  end\nend\n\nlocal function starts_with(start, str)\n  return str:sub(1, #start) == start\nend\n\n\nfunction RawBlock(el)\n  if starts_with('\\\\begin{center}', el.text) then\n    local filetype = extension_for[FORMAT] or 'svg'\n    local fname = system.get_working_directory() .. '/' ..\n        pandoc.sha1(el.text) .. '.' .. filetype\n    if not file_exists(fname) then\n      tikz2image(el.text, filetype, fname)\n    end\n    return pandoc.Para({pandoc.Image({}, fname)})\n  else\n   return el\n  end\nend\n"
  },
  {
    "path": "src/CCELossNode.cpp",
    "content": "#include \"CCELossNode.hpp\"\n#include <limits>\n\nCCELossNode::CCELossNode(Model& model,\n                         std::string name,\n                         uint16_t input_size,\n                         size_t batch_size)\n    : Node{model, std::move(name)}\n    , input_size_{input_size}\n    , inv_batch_size_{num_t{1.0} / static_cast<num_t>(batch_size)}\n{\n    // When we deliver a gradient back, we deliver just the loss gradient with\n    // respect to any input and the index that was \"hot\" in the second argument.\n    gradients_.resize(input_size_);\n}\n\nvoid CCELossNode::forward(num_t* data)\n{\n    // The cross-entropy categorical loss is defined as -\\sum_i(q_i * log(p_i))\n    // where p_i is the predicted probabilty and q_i is the expected probablity\n    //\n    // In information theory, by convention, lim_{x approaches 0}(x log(x)) = 0\n\n    num_t max{0.0};\n    size_t max_index;\n\n    loss_ = num_t{0.0};\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        if (data[i] > max)\n        {\n            max_index = i;\n            max       = data[i];\n        }\n\n        // Because the target vector is one-hot encoded, most of these terms\n        // will be zero, but we leave the full calculation here to be explicit\n        // and in the event we want to compute losses against probability\n        // distributions that arent one-hot. In practice, a faster code path\n        // should be employed if the targets are known to be one-hot\n        // distributions.\n        loss_ -= target_[i]\n                 * std::log(\n                     // Prevent undefined results when taking the log of 0\n                     std::max(data[i], std::numeric_limits<num_t>::epsilon()));\n\n        if (target_[i] != num_t{0.0})\n        {\n            active_ = i;\n        }\n\n        // NOTE: The astute reader may notice that the gradients associated with\n        // many of the loss node's input signals will be zero because the\n        // cross-entropy is performed with respect to a one-hot vector.\n        // Fortunately, because the layer preceding the output layer is a\n        // softmax layer, the gradient from the single term contributing in the\n        // above expression has a dependency on *every* softmax output unit (all\n        // outputs show up in the summation in the softmax denominator).\n    }\n\n    if (max_index == active_)\n    {\n        ++correct_;\n    }\n    else\n    {\n        ++incorrect_;\n    }\n\n    cumulative_loss_ += loss_;\n\n    // Store the data pointer to compute gradients later\n    last_input_ = data;\n}\n\nvoid CCELossNode::reverse(num_t* data)\n{\n    // dJ/dq_i = d(-\\sum_i(p_i log(q_i)))/dq_i = -1 / q_j where j is the index\n    // of the correct classification (loss gradient for a single sample).\n    //\n    // Note the normalization factor where we multiply by the inverse batch\n    // size. This ensures that losses computed by the network are similar in\n    // scale irrespective of batch size.\n\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        gradients_[i] = -inv_batch_size_ * target_[i] / last_input_[i];\n    }\n\n    for (Node* node : antecedents_)\n    {\n        node->reverse(gradients_.data());\n    }\n}\n\nvoid CCELossNode::print() const\n{\n    std::printf(\"Avg Loss: %f\\t%f%% correct\\n\", avg_loss(), accuracy() * 100.0);\n}\n\nnum_t CCELossNode::accuracy() const\n{\n    return static_cast<num_t>(correct_)\n           / static_cast<num_t>(correct_ + incorrect_);\n}\nnum_t CCELossNode::avg_loss() const\n{\n    return cumulative_loss_ / static_cast<num_t>(correct_ + incorrect_);\n}\n\nvoid CCELossNode::reset_score()\n{\n    cumulative_loss_ = num_t{0.0};\n    correct_         = 0;\n    incorrect_       = 0;\n}\n"
  },
  {
    "path": "src/CCELossNode.hpp",
    "content": "#pragma once\n\n#include \"Model.hpp\"\n\n// Categorical Cross-Entropy Loss Node\n// Assumes input data is \"one-hot encoded,\" with size equal to the number of\n// possible classifications, where the \"answer\" has a single \"1\" (aka hot value)\n// in one of the classification positions and zero everywhere else.\n\nclass CCELossNode : public Node\n{\npublic:\n    CCELossNode(Model& model,\n                std::string name,\n                uint16_t input_size,\n                size_t batch_size);\n\n    // No initialization is needed for this node\n    void init(rne_t&) override\n    {}\n\n    void forward(num_t* inputs) override;\n    // As a loss node, the argument to this method is ignored (the gradient of\n    // the loss with respect to itself is unity)\n    void reverse(num_t* gradients = nullptr) override;\n\n    void print() const override;\n\n    void set_target(num_t const* target)\n    {\n        target_ = target;\n    }\n\n    num_t accuracy() const;\n    num_t avg_loss() const;\n    void reset_score();\n\nprivate:\n    uint16_t input_size_;\n\n    // We minimize the average loss, not the net loss so that the losses\n    // produced do not scale with batch size (which allows us to keep training\n    // parameters constant)\n    num_t inv_batch_size_;\n    num_t loss_;\n    num_t const* target_;\n    num_t* last_input_;\n    // Stores the last active classification in the target one-hot encoding\n    size_t active_;\n    num_t cumulative_loss_{0.0};\n    // Store running counts of correct and incorrect predictions\n    size_t correct_   = 0;\n    size_t incorrect_ = 0;\n    std::vector<num_t> gradients_;\n};\n"
  },
  {
    "path": "src/CMakeLists.txt",
    "content": "add_executable(\n    nn\n    main.cpp\n    CCELossNode.cpp\n    FFNode.cpp\n    GDOptimizer.cpp\n    MNIST.cpp\n    Model.cpp\n)\n\ntarget_compile_features(nn PUBLIC cxx_std_17)\n"
  },
  {
    "path": "src/Dual.hpp",
    "content": "#pragma once\n\ntemplate <typename T = float>\nstruct Dual\n{\n    T real_ = T{0.0};\n    T dual_ = T{1.0};\n};\n\ntemplate <typename T>\n[[nodiscard]] Dual<T> operator+(Dual<T>&& a, Dual<T>&& b) noexcept\n{\n    return {a.real_ + b.real_, a.dual_ + b.dual_};\n}\n\ntemplate <typename T>\n[[nodiscard]] Dual<T> operator-(Dual<T>&& a, Dual<T>&& b) noexcept\n{\n    return {a.real_ - b.real_, a.dual_ - b.dual_};\n}\n\n// (a + eb) * (c + ed) = ac + ebc + ead + e^2bd = ac + e(bc + ad)\ntemplate <typename T>\n[[nodiscard]] constexpr Dual<T> operator*(Dual<T>&& a, Dual<T>&& b) noexcept\n{\n    return {\n        a.real_ * b.real_,\n        a.real_ * b.dual_ + b.real_ * a.dual_,\n    };\n}\n"
  },
  {
    "path": "src/FFNode.cpp",
    "content": "#include \"FFNode.hpp\"\n\n#include <algorithm>\n#include <cmath>\n#include <cstdio>\n#include <random>\n\nFFNode::FFNode(Model& model,\n               std::string name,\n               Activation activation,\n               uint16_t output_size,\n               uint16_t input_size)\n    : Node{model, std::move(name)}\n    , activation_{activation}\n    , output_size_{output_size}\n    , input_size_{input_size}\n{\n    std::printf(\"%s: %d -> %d\\n\", name_.c_str(), input_size_, output_size_);\n\n    // The weight parameters of a FF-layer are an NxM matrix\n    weights_.resize(output_size_ * input_size_);\n\n    // Each node in this layer is assigned a bias (so that zero is not\n    // necessarily mapped to zero)\n    biases_.resize(output_size_);\n\n    // The outputs of each neuron within the layer is an \"activation\" in\n    // neuroscience parlance\n    activations_.resize(output_size_);\n\n    activation_gradients_.resize(output_size_);\n    weight_gradients_.resize(output_size_ * input_size_);\n    bias_gradients_.resize(output_size_);\n    input_gradients_.resize(input_size_);\n}\n\nvoid FFNode::init(rne_t& rne)\n{\n    num_t sigma;\n    switch (activation_)\n    {\n    case Activation::ReLU:\n        // Kaiming He, et. al. weight initialization for ReLU networks\n        // https://arxiv.org/pdf/1502.01852.pdf\n        //\n        // Suggests using a normal distribution with variance := 2 / n_in\n        sigma = std::sqrt(2.0 / static_cast<num_t>(input_size_));\n        break;\n    case Activation::Softmax:\n    default:\n        sigma = std::sqrt(1.0 / static_cast<num_t>(input_size_));\n        break;\n    }\n\n    // NOTE: Unfortunately, the C++ standard does not guarantee that the results\n    // obtained from a distribution function will be identical given the same\n    // inputs across different compilers and platforms. A production ML\n    // framework will likely implement its own distributions to provide\n    // deterministic results.\n    auto dist = std::normal_distribution<num_t>{0.0, sigma};\n\n    for (num_t& w : weights_)\n    {\n        w = dist(rne);\n    }\n\n    // NOTE: Setting biases to zero is a common practice, as is initializing the\n    // bias to a small value (e.g. on the order of 0.01). It is unclear if the\n    // latter produces a consistent result over the former, but the thinking is\n    // that a non-zero bias will ensure that the neuron always \"fires\" at the\n    // beginning to produce a signal.\n    //\n    // Here, we initialize all biases to a small number, but the reader should\n    // consider experimenting with other approaches.\n    for (num_t& b : biases_)\n    {\n        b = 0.01;\n    }\n}\n\nvoid FFNode::forward(num_t* inputs)\n{\n    // Remember the last input data for backpropagation later\n    last_input_ = inputs;\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // For each output vector, compute the dot product of the input data\n        // with the weight vector add the bias\n\n        num_t z{0.0};\n\n        size_t offset = i * input_size_;\n\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            z += weights_[offset + j] * inputs[j];\n        }\n        // Add neuron bias\n        z += biases_[i];\n\n        switch (activation_)\n        {\n        case Activation::ReLU:\n            activations_[i] = std::max(z, num_t{0.0});\n            break;\n        case Activation::Softmax:\n        default:\n            activations_[i] = std::exp(z);\n            break;\n        }\n    }\n\n    if (activation_ == Activation::Softmax)\n    {\n        // softmax(z)_i = exp(z_i) / \\sum_j(exp(z_j))\n        num_t sum_exp_z{0.0};\n        for (size_t i = 0; i != output_size_; ++i)\n        {\n            // NOTE: with exploding gradients, it is quite easy for this\n            // exponential function to overflow, which will result in NaNs\n            // infecting the network.\n            sum_exp_z += activations_[i];\n        }\n        num_t inv_sum_exp_z = num_t{1.0} / sum_exp_z;\n        for (size_t i = 0; i != output_size_; ++i)\n        {\n            activations_[i] *= inv_sum_exp_z;\n        }\n    }\n\n    // Forward activation data to all subsequent nodes in the computational\n    // graph\n    for (Node* subsequent : subsequents_)\n    {\n        subsequent->forward(activations_.data());\n    }\n}\n\nvoid FFNode::reverse(num_t* gradients)\n{\n    // We receive a vector of output_size_ gradients of the loss function with\n    // respect to the activations of this node.\n\n    // We need to compute the gradients of the loss function with respect to\n    // each parameter in the node (all weights and biases). In addition, we need\n    // to compute the gradients with respect to the inputs in order to propagate\n    // the gradients further.\n\n    // Notation:\n    //\n    // Subscripts on any of the following vector and matrix quantities are used\n    // to specify a specific element of the vector or matrix.\n    //\n    //   - I is the input vector\n    //   - W is the weight matrix\n    //   - B is the bias vector\n    //   - Z = W*I + B\n    //   - A is our activation function (ReLU or Softmax in this case)\n    //   - L is the total loss (cost)\n    //\n    // The gradient we receive from the subsequent is dJ/dg(Z) which we can use\n    // to compute dJ/dW_{i, j}, dJ/dB_i, and dJ/dI_i\n\n    // First, we compute dJ/dz as dJ/dg(z) * dg(z)/dz and store it in our\n    // activations array\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // dg(z)/dz\n        num_t activation_grad{0.0};\n        switch (activation_)\n        {\n        case Activation::ReLU:\n            // For a ReLU function, the gradient is unity when the activation\n            // exceeds 0.0, and 0.0 otherwise. Technically, the gradient is\n            // undefined at 0, but in practice, defining the gradient at this\n            // point to be 0 isn't an issue\n            if (activations_[i] > num_t{0.0})\n            {\n                activation_grad = num_t{1.0};\n            }\n            else\n            {\n                activation_grad = num_t{0.0};\n            }\n            // dJ/dz = dJ/dg(z) * dg(z)/dz\n            activation_gradients_[i] = gradients[i] * activation_grad;\n            break;\n        case Activation::Softmax:\n        default:\n            // F.T.R. The implementation here correctly computes gradients for\n            // the general softmax function accounting for all received\n            // gradients. However, this step can be optimized significantly if\n            // it is known that the softmax output is being compared to a\n            // one-hot distribution. The softmax output of a given unit is\n            // exp(z_i) / \\sum_j exp(z_j). When the loss gradient with respect\n            // to the softmax outputs is returned, a single i is selected from\n            // among the softmax outputs in a 1-hot encoding, corresponding to\n            // the correct classification for this training sample. Complete the\n            // derivation for the gradient of the softmax assuming a one-hot\n            // distribution and implement the optimized routine.\n\n            for (size_t j = 0; j != output_size_; ++j)\n            {\n                if (i == j)\n                {\n                    activation_grad += activations_[i]\n                                       * (num_t{1.0} - activations_[i])\n                                       * gradients[j];\n                }\n                else\n                {\n                    activation_grad\n                        += -activations_[i] * activations_[j] * gradients[j];\n                }\n            }\n\n            activation_gradients_[i] = activation_grad;\n            break;\n        }\n    }\n\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        // Next, let's compute the partial dJ/db_i. If we hold all the weights\n        // and inputs constant, it's clear that dz/db_i is just 1 (consider\n        // differentiating the line mx + b with respect to b). Thus, dJ/db_i =\n        // dJ/dg(z_i) * dg(z_i)/dz_i.\n        bias_gradients_[i] += activation_gradients_[i];\n    }\n\n    // CAREFUL! Unlike the other gradients, we reset input gradients to 0. These\n    // values are used primarily as a subexpression in computing upstream\n    // gradients and do not participate in the network optimization step (aka\n    // Stochastic Gradient Descent) later.\n    std::fill(input_gradients_.begin(), input_gradients_.end(), num_t{0.0});\n\n    // To compute dz/dI_i, recall that z_i = \\sum_i W_i*I_i + B_i. That is, the\n    // precursor to each activation is a dot-product between a weight vector an\n    // the input plus a bias. Thus, dz/dI_i must be the sum of all weights that\n    // were scaled by I_i during the forward pass.\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        size_t offset = i * input_size_;\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            input_gradients_[j]\n                += weights_[offset + j] * activation_gradients_[i];\n        }\n    }\n\n    for (size_t i = 0; i != input_size_; ++i)\n    {\n        for (size_t j = 0; j != output_size_; ++j)\n        {\n            // Each individual weight shows up in the equation for z once and is\n            // scaled by the corresponding input. Thus, dJ/dw_i = dJ/dg(z_i) *\n            // dg(z_i)/dz_i * dz_i/d_w_ij where the last factor is equal to the\n            // input scaled by w_ij.\n\n            weight_gradients_[j * input_size_ + i]\n                += last_input_[i] * activation_gradients_[j];\n        }\n    }\n\n    for (Node* node : antecedents_)\n    {\n        // Forward loss gradients with respect to the inputs to the previous\n        // node.\n        //\n        // F.T.R. Technically, if the antecedent node has no learnable\n        // parameters, there is no point forwarding gradients to that node.\n        // Furthermore, if no antecedent nodes required any gradients, we could\n        // have skipped computing the gradients for this node altogether. A\n        // simple way to implement this is to add a `parameter_count` virtual\n        // method on the Node interface leverage it to save some work whenever\n        // possible here.\n        node->reverse(input_gradients_.data());\n    }\n}\n\n// F.T.R. It is more efficient to store parameters contiguously so they can be\n// accessed without branching or arithmetic.\nnum_t* FFNode::param(size_t index)\n{\n    if (index < weights_.size())\n    {\n        return &weights_[index];\n    }\n    return &biases_[index - weights_.size()];\n}\n\nnum_t* FFNode::gradient(size_t index)\n{\n    if (index < weights_.size())\n    {\n        return &weight_gradients_[index];\n    }\n    return &bias_gradients_[index - weights_.size()];\n}\n\nvoid FFNode::print() const\n{\n    std::printf(\"%s\\n\", name_.c_str());\n\n    // Consider the input samples as column vectors, and visualize the weights\n    // as a matrix transforming vectors with input_size_ dimension to size_\n    // dimension\n    std::printf(\"Weights (%d x %d)\\n\", output_size_, input_size_);\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        size_t offset = i * input_size_;\n        for (size_t j = 0; j != input_size_; ++j)\n        {\n            std::printf(\"\\t[%zu]%f\", offset + j, weights_[offset + j]);\n        }\n        std::printf(\"\\n\");\n    }\n    std::printf(\"Biases (%d x 1)\\n\", output_size_);\n    for (size_t i = 0; i != output_size_; ++i)\n    {\n        std::printf(\"\\t%f\\n\", biases_[i]);\n    }\n    std::printf(\"\\n\");\n}\n"
  },
  {
    "path": "src/FFNode.hpp",
    "content": "#pragma once\n\n#include \"Model.hpp\"\n\n#include <cstdint>\n#include <vector>\n\n// Fully-connected, feedforward Layer\n\n// A feedforward layer is parameterized by the number of neurons it posesses and\n// the number of neurons in the layer preceding it\nclass FFNode : public Node\n{\npublic:\n    FFNode(Model& model,\n           std::string name,\n           Activation activation,\n           uint16_t output_size,\n           uint16_t input_size);\n\n    // Initialize the parameters of the layer\n    // F.T.R.\n    // Experiment with alternative weight and bias initialization schemes:\n    // 1. Try different distributions for the weight\n    // 2. Try initializing all weights to zero (why is this suboptimal)\n    // 3. Try initializing all the biases to zero\n    void init(rne_t& rne) override;\n\n    // The input vector should have size input_size_\n    void forward(num_t* inputs) override;\n    // The output vector should have size output_size_\n    void reverse(num_t* gradients) override;\n\n    size_t param_count() const noexcept override\n    {\n        // Weight matrix entries + bias entries\n        return (input_size_ + 1) * output_size_;\n    }\n\n    num_t* param(size_t index);\n    num_t* gradient(size_t index);\n\n    void print() const override;\n\nprivate:\n    Activation activation_;\n    uint16_t output_size_;\n    uint16_t input_size_;\n\n    /////////////////////\n    // Node Parameters //\n    /////////////////////\n\n    // weights_.size() := output_size_ * input_size_\n    std::vector<num_t> weights_;\n    // biases_.size() := output_size_\n    std::vector<num_t> biases_;\n    // activations_.size() := output_size_\n    std::vector<num_t> activations_;\n\n    ////////////////////\n    // Loss Gradients //\n    ////////////////////\n\n    std::vector<num_t> activation_gradients_;\n\n    // During the training cycle, parameter loss gradients are accumulated in\n    // the following buffers.\n    std::vector<num_t> weight_gradients_;\n    std::vector<num_t> bias_gradients_;\n\n    // This buffer is used to store temporary gradients used in a SINGLE\n    // backpropagation pass. Note that this does not accumulate like the weight\n    // and bias gradients do.\n    std::vector<num_t> input_gradients_;\n\n    // The last input is needed to compute loss gradients with respect to the\n    // weights during backpropagation\n    num_t* last_input_;\n};\n"
  },
  {
    "path": "src/GDOptimizer.cpp",
    "content": "#include \"GDOptimizer.hpp\"\n#include \"Model.hpp\"\n#include <cmath>\n\nGDOptimizer::GDOptimizer(num_t eta)\n    : eta_{eta}\n{}\n\nvoid GDOptimizer::train(Node& node)\n{\n    size_t param_count = node.param_count();\n    for (size_t i = 0; i != param_count; ++i)\n    {\n        num_t& param    = *node.param(i);\n        num_t& gradient = *node.gradient(i);\n\n        param = param - eta_ * gradient;\n\n        // Reset the gradient which will be accumulated again in the next\n        // training epoch\n        gradient = num_t{0.0};\n    }\n}\n"
  },
  {
    "path": "src/GDOptimizer.hpp",
    "content": "#pragma once\n\n#include \"Model.hpp\"\n\n// Note that this class defines the general gradient descent algorithm. It can\n// be used as part of the *Stochastic* gradient descent algorithm (aka SGD) by\n// invoking it after smaller batches of training data are evaluated.\nclass GDOptimizer : public Optimizer\n{\npublic:\n    // \"Eta\" is the commonly accepted character used to denote the learning\n    // rate. Given a loss gradient dL/dp for some parameter p, during gradient\n    // descent, p will be adjusted such that p' = p - eta * dL/dp.\n    GDOptimizer(num_t eta);\n\n    // This should be invoked at the end of each batch's evaluation. The\n    // interface technically permits the use of different optimizers for\n    // different segments of the computational graph.\n    void train(Node& node) override;\n\nprivate:\n    num_t eta_;\n};\n"
  },
  {
    "path": "src/MNIST.cpp",
    "content": "#include \"MNIST.hpp\"\n\n#include <cstdio>\n#include <stdexcept>\n\n// Read 4 bytes and reverse them to return an unsigned integer on LE\n// architectures\nvoid read_be(std::ifstream& in, uint32_t* out)\n{\n    char* buf = reinterpret_cast<char*>(out);\n    in.read(buf, 4);\n\n    std::swap(buf[0], buf[3]);\n    std::swap(buf[1], buf[2]);\n}\n\nMNIST::MNIST(Model& model, std::ifstream& images, std::ifstream& labels)\n    : Node{model, \"MNIST input\"}\n    , images_{images}\n    , labels_{labels}\n{\n    // Confirm that passed input file streams are well-formed MNIST data sets\n    uint32_t image_magic;\n    read_be(images, &image_magic);\n    if (image_magic != 2051)\n    {\n        throw std::runtime_error{\"Images file appears to be malformed\"};\n    }\n    read_be(images, &image_count_);\n\n    uint32_t labels_magic;\n    read_be(labels, &labels_magic);\n    if (labels_magic != 2049)\n    {\n        throw std::runtime_error{\"Labels file appears to be malformed\"};\n    }\n\n    uint32_t label_count;\n    read_be(labels, &label_count);\n    if (label_count != image_count_)\n    {\n        throw std::runtime_error(\n            \"Label count did not match the number of images supplied\");\n    }\n\n    uint32_t rows;\n    uint32_t columns;\n    read_be(images, &rows);\n    read_be(images, &columns);\n    if (rows != 28 || columns != 28)\n    {\n        throw std::runtime_error{\n            \"Expected 28x28 images, non-MNIST data supplied\"};\n    }\n\n    printf(\"Loaded images file with %d entries\\n\", image_count_);\n}\n\nvoid MNIST::forward(num_t* data)\n{\n    read_next();\n    for (Node* node : subsequents_)\n    {\n        node->forward(data_);\n    }\n}\n\nvoid MNIST::print() const\n{\n    // No learned parameters to display for an MNIST input node\n}\n\nvoid MNIST::read_next()\n{\n    images_.read(buf_, DIM);\n    num_t inv = num_t{1.0} / num_t{255.0};\n    for (size_t i = 0; i != DIM; ++i)\n    {\n        data_[i] = static_cast<uint8_t>(buf_[i]) * inv;\n    }\n\n    char label;\n    labels_.read(&label, 1);\n\n    for (size_t i = 0; i != 10; ++i)\n    {\n        label_[i] = num_t{0.0};\n    }\n    label_[static_cast<uint8_t>(label)] = num_t{1.0};\n}\n\nvoid MNIST::print_last()\n{\n    for (size_t i = 0; i != 10; ++i)\n    {\n        if (label_[i] == num_t{1.0})\n        {\n            printf(\"This is a %zu:\\n\", i);\n            break;\n        }\n    }\n\n    for (size_t i = 0; i != 28; ++i)\n    {\n        size_t offset = i * 28;\n        for (size_t j = 0; j != 28; ++j)\n        {\n            if (data_[offset + j] > num_t{0.5})\n            {\n                if (data_[offset + j] > num_t{0.9})\n                {\n                    printf(\"#\");\n                }\n                else if (data_[offset + j] > num_t{0.7})\n                {\n                    printf(\"*\");\n                }\n                else\n                {\n                    printf(\".\");\n                }\n            }\n            else\n            {\n                printf(\" \");\n            }\n        }\n        printf(\"\\n\");\n    }\n    printf(\"\\n\");\n}\n"
  },
  {
    "path": "src/MNIST.hpp",
    "content": "#pragma once\n\n#include \"Model.hpp\"\n#include <fstream>\n\nclass MNIST : public Node\n{\npublic:\n    constexpr static size_t DIM = 28 * 28;\n\n    MNIST(Model& model, std::ifstream& images, std::ifstream& labels);\n\n    void init(rne_t&) override\n    {}\n\n    // As this is an input node, the argument to this function is ignored\n    void forward(num_t* data = nullptr) override;\n    // Backpropagation is a no-op for input nodes as there are no parameters to\n    // update\n    void reverse(num_t* data = nullptr) override\n    {}\n\n    // Parse the next image and label into memory\n    void read_next();\n\n    void print() const override;\n\n    [[nodiscard]] size_t size() const noexcept\n    {\n        return image_count_;\n    }\n\n    [[nodiscard]] num_t const* data() const noexcept\n    {\n        return data_;\n    }\n\n    [[nodiscard]] num_t* data() noexcept\n    {\n        return data_;\n    }\n\n    [[nodiscard]] num_t* label() noexcept\n    {\n        return label_;\n    }\n\n    [[nodiscard]] num_t const* label() const noexcept\n    {\n        return label_;\n    }\n\n    // Quick ASCII visualization of the last read image. For best results,\n    // ensure that your terminal font is a monospace font.\n    void print_last();\n\nprivate:\n    std::ifstream& images_;\n    std::ifstream& labels_;\n    uint32_t image_count_;\n    // Data from the images file is read as one-byte unsigned values which are\n    // converted to num_t after\n    char buf_[DIM];\n    // All images are resized (with antialiasing) to a 28 x 28 row-major raster\n    num_t data_[DIM];\n    // One-hot encoded label\n    num_t label_[10];\n};\n"
  },
  {
    "path": "src/Model.cpp",
    "content": "#include \"Model.hpp\"\n\nNode::Node(Model& model, std::string name)\n    : model_(model)\n    , name_{std::move(name)}\n{}\n\nModel::Model(std::string name)\n    : name_{std::move(name)}\n{}\n\nvoid Model::create_edge(Node& dst, Node& src)\n{\n    // NOTE: No validation is done to ensure the edge doesn't already exist\n    dst.antecedents_.push_back(&src);\n    src.subsequents_.push_back(&dst);\n}\n\nrne_t::result_type Model::init(rne_t::result_type seed)\n{\n    if (seed == 0)\n    {\n        // Generate a new random seed from the host random device\n        std::random_device rd{};\n        seed = rd();\n    }\n    std::printf(\"Initializing model parameters with seed: %u\\n\", seed);\n\n    rne_t rne{seed};\n\n    for (auto& node : nodes_)\n    {\n        node->init(rne);\n    }\n\n    return seed;\n}\n\nvoid Model::train(Optimizer& optimizer)\n{\n    for (auto&& node : nodes_)\n    {\n        optimizer.train(*node);\n    }\n}\n\nvoid Model::print() const\n{\n    // Invoke \"print\" on each node in the order added\n    for (auto&& node : nodes_)\n    {\n        node->print();\n    }\n}\n\nvoid Model::save(std::ofstream& out)\n{\n    // To save the model to disk, we employ a very simple scheme. All nodes are\n    // looped through in the order they were added to the model. Then, all\n    // advertised learnable parameters are serialized in host byte-order to the\n    // supplied output stream.\n    //\n    // F.T.R. This simplistic method of saving the model to disk isn't very\n    // robust or practical in the real world. For one thing, it contains no\n    // reflection data about the topology of the model. Loading the data relies\n    // on the model being constructed in the same manner it was trained on.\n    // Furthermore, the data will be parsed incorrectly if the program is\n    // recompiled to operate with a different precision. Adopting a more\n    // sensible serialization scheme is left as an exercise.\n    for (auto& node : nodes_)\n    {\n        size_t param_count = node->param_count();\n        for (size_t i = 0; i != param_count; ++i)\n        {\n            out.write(\n                reinterpret_cast<char const*>(node->param(i)), sizeof(num_t));\n        }\n    }\n}\n\nvoid Model::load(std::ifstream& in)\n{\n    for (auto& node : nodes_)\n    {\n        size_t param_count = node->param_count();\n        for (size_t i = 0; i != param_count; ++i)\n        {\n            in.read(reinterpret_cast<char*>(node->param(i)), sizeof(num_t));\n        }\n    }\n}\n"
  },
  {
    "path": "src/Model.hpp",
    "content": "#pragma once\n\n#include <cstdint>\n#include <fstream>\n#include <memory>\n#include <random>\n#include <string>\n#include <vector>\n\n// Default precision: single\nusing num_t = float;\n// Default random number engine: 32-bit Mersenne Twister by Matsumoto and\n// Nishimura, 1998. For generating random numbers with double precision, the\n// 64-bit Mersenne Twister should be used.\nusing rne_t = std::mt19937;\n\nenum class Activation\n{\n    ReLU,\n    Softmax\n};\n\nclass Model;\n\n// Base class of computational nodes in a model\nclass Node\n{\npublic:\n    Node(Model& model, std::string name);\n    virtual ~Node(){};\n\n    // Initialize the parameters of the node with a provided random number\n    // engine.\n    virtual void init(rne_t& rne) = 0;\n\n    // Data is fed forward through the network using a simple generic interface.\n    // We do this to avoid requiring an involved N-dimensional matrix\n    // abstraction. Here, the \"shape\" of the data is dependent on the Node's\n    // implementation and the way a given Node is initialized.\n    //\n    // In practice, this should be replaced with an actual type with a shape\n    // defined by data to permit additional validation. It is also common for\n    // the data object passed here to not contain the data directly (the data\n    // may be located on a GPU for example)\n    virtual void forward(num_t* inputs) = 0;\n\n    // Expected inputs during the reverse accumulation phase are the loss\n    // gradients with respect to each output\n    //\n    // The node is expected to compute the loss gradient with respect to each\n    // parameter and update the parameter according to the model's optimizer,\n    // after which, the gradients with respect to the node inputs are propagated\n    // backwards again.\n    virtual void reverse(num_t* gradients) = 0;\n\n    // Returns the number of learnable parameters in this node. Nodes that are\n    // input or loss nodes have no learnable parameters.\n    virtual size_t param_count() const noexcept\n    {\n        return 0;\n    }\n\n    // Indexing operator for learnable parameters that are mutated during\n    // training. Nodes without learnable parameters should keep this\n    // unimplemented.\n    virtual num_t* param(size_t index)\n    {\n        return nullptr;\n    }\n\n    // Indexing operator for the loss gradient with respect to a learnable\n    // parameter. Used by an optimizer to adjust the corresponding parameter and\n    // potentially for tracking gradient histories (done in more sophisticated\n    // optimizers, e.g. AdaGrad)\n    virtual num_t* gradient(size_t index)\n    {\n        return nullptr;\n    }\n\n    [[nodiscard]] std::string const& name() const noexcept\n    {\n        return name_;\n    }\n\n    // Generic function that displays the contents of the node in some fashion\n    virtual void print() const = 0;\n\nprotected:\n    friend class Model;\n\n    Model& model_;\n    std::string name_;\n    std::vector<Node*> antecedents_;\n    std::vector<Node*> subsequents_;\n};\n\n// Base class of optimizer used to train a model\nclass Optimizer\n{\npublic:\n    virtual void train(Node& node) = 0;\n};\n\nclass Model\n{\npublic:\n    Model(std::string name);\n\n    template <typename Node_t, typename... T>\n    Node_t& add_node(T&&... args)\n    {\n        nodes_.emplace_back(\n            std::make_unique<Node_t>(*this, std::forward<T>(args)...));\n        return reinterpret_cast<Node_t&>(*nodes_.back());\n    }\n\n    void create_edge(Node& dst, Node& src);\n\n    // Initialize the parameters of all nodes with the provided seed. If the\n    // seed is 0, a new random seed is chosen instead. Returns the seed used.\n    rne_t::result_type init(rne_t::result_type seed = 0);\n\n    void train(Optimizer& optimizer);\n\n    [[nodiscard]] std::string const& name() const noexcept\n    {\n        return name_;\n    }\n\n    void print() const;\n\n    void save(std::ofstream& out);\n    void load(std::ifstream& in);\n\nprivate:\n    friend class Node;\n\n    std::string name_;\n    std::vector<std::unique_ptr<Node>> nodes_;\n};\n"
  },
  {
    "path": "src/main.cpp",
    "content": "#include \"CCELossNode.hpp\"\n#include \"FFNode.hpp\"\n#include \"GDOptimizer.hpp\"\n#include \"MNIST.hpp\"\n#include \"Model.hpp\"\n#include <cfenv>\n#include <cstdio>\n#include <cstring>\n#include <filesystem>\n\nstatic constexpr size_t batch_size = 80;\n\nModel create_model(std::ifstream& images,\n                   std::ifstream& labels,\n                   MNIST** mnist,\n                   CCELossNode** loss)\n{\n    // Here we create a simple fully-connected feedforward neural network\n    Model model{\"ff\"};\n\n    *mnist = &model.add_node<MNIST>(images, labels);\n\n    FFNode& hidden = model.add_node<FFNode>(\"hidden\", Activation::ReLU, 32, 784);\n\n    FFNode& output\n        = model.add_node<FFNode>(\"output\", Activation::Softmax, 10, 32);\n\n    *loss = &model.add_node<CCELossNode>(\"loss\", 10, batch_size);\n    (*loss)->set_target((*mnist)->label());\n\n    // F.T.R. The structure of our computational graph is completely sequential.\n    // In fact, the fully connected node and loss node we've implemented here do\n    // not support multiple inputs. Consider adding nodes that support \"skip\"\n    // connections that forward outputs from earlier nodes to downstream nodes\n    // that aren't directly adjacent (such skip nodes are used in the ResNet\n    // architecture)\n    model.create_edge(hidden, **mnist);\n    model.create_edge(output, hidden);\n    model.create_edge(**loss, output);\n    return model;\n}\n\nvoid train(char* argv[])\n{\n    // Uncomment to debug floating point instability in the network\n    // feenableexcept(FE_INVALID | FE_OVERFLOW);\n\n    std::printf(\"Executing training routine\\n\");\n\n    std::ifstream images{\n        std::filesystem::path{argv[0]} / \"train-images-idx3-ubyte\",\n        std::ios::binary};\n\n    std::ifstream labels{\n        std::filesystem::path{argv[0]} / \"train-labels-idx1-ubyte\",\n        std::ios::binary};\n\n    MNIST* mnist;\n    CCELossNode* loss;\n    Model model = create_model(images, labels, &mnist, &loss);\n\n    model.init();\n\n    // The gradient descent optimizer is stateless, but other optimizers may not\n    // be. Some optimizers need to track \"momentum\" or gradient histories.\n    // Others may slow the learning rate for each parameter at different rates\n    // depending on various factors.\n    //\n    // F.T.R. Implement an alternative SGDOptimizer that decays the learning\n    // rate over time and compare the results against this optimizer that learns\n    // at a fixed rate.\n    GDOptimizer optimizer{num_t{0.3}};\n\n    // F.T.R. Here, we've hardcoded the number of batches to train on. In\n    // practice, training should halt when the average loss begins to\n    // vascillate, indicating that the model is starting to overfit the data.\n    // Implement some form of loss-improvement measure to determine when this\n    // inflection point occurs and stop accordingly.\n    size_t i = 0;\n    for (; i != 256; ++i)\n    {\n        loss->reset_score();\n\n        for (size_t j = 0; j != batch_size; ++j)\n        {\n            mnist->forward();\n            loss->reverse();\n        }\n\n        model.train(optimizer);\n    }\n\n    std::printf(\"Ran %zu batches (%zu samples each)\\n\", i, batch_size);\n\n    // Print the average loss computed in the final batch\n    loss->print();\n\n    std::ofstream out{\n        std::filesystem::current_path() / (model.name() + \".params\"),\n        std::ios::binary};\n    model.save(out);\n}\n\nvoid evaluate(char* argv[])\n{\n    std::printf(\"Executing evaluation routine\\n\");\n\n    std::ifstream images{\n        std::filesystem::path{argv[0]} / \"t10k-images-idx3-ubyte\",\n        std::ios::binary};\n\n    std::ifstream labels{\n        std::filesystem::path{argv[0]} / \"t10k-labels-idx1-ubyte\",\n        std::ios::binary};\n\n    MNIST* mnist;\n    CCELossNode* loss;\n    // For the data to be loaded properly, the model must be constructed in the\n    // same manner as it was constructed during training.\n    Model model = create_model(images, labels, &mnist, &loss);\n\n    // Instead of initializing the parameters randomly, here we load it from\n    // disk (saved from a previous training run).\n    std::ifstream params_file{std::filesystem::path{argv[1]}, std::ios::binary};\n    model.load(params_file);\n\n    // Evaluate all 10000 images in the test set and compute the loss average\n    for (size_t i = 0; i != mnist->size(); ++i)\n    {\n        mnist->forward();\n    }\n    loss->print();\n}\n\nint main(int argc, char* argv[])\n{\n    if (argc < 2)\n    {\n        std::printf(\"Supported commands include:\\ntrain\\nevaluate\\n\");\n        return 1;\n    }\n\n    if (strcmp(argv[1], \"train\") == 0)\n    {\n        train(argv + 2);\n    }\n    else if (strcmp(argv[1], \"evaluate\") == 0)\n    {\n        evaluate(argv + 2);\n    }\n    else\n    {\n        std::printf(\"Argument %s is an unrecognized directive.\\n\", argv[1]);\n    }\n\n    return 0;\n}\n"
  }
]