How to Build a Three-Layer Neural Network from Scratch

Neural networks are a powerful class of machine learning models that have achieved remarkable results on a wide variety of tasks, from computer vision to natural language processing. While popular deep learning frameworks like TensorFlow and PyTorch have made it easier than ever to quickly build and train neural networks, I believe there is still immense value in implementing one from scratch. By building a neural network step-by-step from the ground up, you can demystify the "black box", solidify your understanding of the underlying concepts, and gain a deeper appreciation for these incredible models.

In this post, I‘ll walk you through how to build a simple three-layer feedforward neural network in Python from scratch, without using any existing machine learning frameworks. We‘ll cover all the key components, including the network architecture, activation functions, loss function, and gradient descent training. Along the way, I‘ll break down the relevant mathematics and share helpful visualizations. By the end, you‘ll have a working neural network and the foundation to go even deeper into this exciting field.

The Building Blocks of a Neural Network

Before we dive into the code, let‘s define what we mean by a "three-layer neural network" and introduce the key mathematical building blocks. Our network will have the following architecture:

  • Input layer with n neurons, where n is the number of features in our data
  • One hidden layer with m neurons, where m is a hyperparameter we choose
  • Output layer with k neurons, where k is the number of classes in our data
  • Fully-connected (dense) layers, meaning each neuron in a layer is connected to every neuron in the previous layer

Visually, it looks like this:

[Figure showing a diagram of the 3-layer neural network architecture]

The input layer takes in our data, x, as a vector. Each input neuron simply passes the value onto the next layer. The hidden and output layers have weight matrices W and bias vectors b that transform the data as it flows through the network. Specifically, each layer calculates z = Wx + b, then applies a non-linear activation function to z to get the layer‘s output a. This a then becomes the input to the next layer.

Common activation functions include:

  • Sigmoid: maps inputs to a probability between 0 and 1
  • Tanh: similar to sigmoid but maps to a value between -1 and 1
  • ReLU: outputs the input directly if positive, otherwise outputs 0

For our 3-layer network, we‘ll use tanh for the hidden layer and sigmoid for the output layer, which is a common choice for binary classification problems.

[Math notation and equations for activation functions]

To quantify how well our model fits the training data, we define a loss function. A common one for classification is binary cross-entropy:

[Equation for binary cross entropy]

This measures the dissimilarity between the model‘s predicted probability and the true label, which is 0 or 1. To fit the model, we minimize this loss function using gradient descent, an optimization algorithm that iteratively adjusts the model parameters (the weights and biases) in the direction that decreases the loss. The "learning rate" alpha controls the size of these adjustments.

[Equations for gradient descent update rule]

With these foundations in place, let‘s move on to the implementation.

Coding Our Neural Network Step-by-Step

We‘ll build our neural network as a Python class. The full code is available on my [GitHub repo], but we‘ll go through it piece-by-piece here. I‘ll assume basic familiarity with Python and NumPy, but explain the key bits.

First, we initialize the weight matrices and bias vectors in the constructor:

class NeuralNetwork:
  def __init__(self, input_size, hidden_size, output_size):
    self.W1 = np.random.randn(input_size, hidden_size) 
    self.b1 = np.zeros(hidden_size)
    self.W2 = np.random.randn(hidden_size, output_size)
    self.b2 = np.zeros(output_size)

We use NumPy‘s random.randn to initialize the weights as matrices with small random values, and zeros for the biases. The shapes are determined by the layer sizes we specify.

Next, we implement the forward pass—calculating the model‘s predictions:

  def forward(self, x):
    self.z1 = np.dot(x, self.W1) + self.b1
    self.a1 = np.tanh(self.z1)
    self.z2 = np.dot(self.a1, self.W2) + self.b2
    self.a2 = self._sigmoid(self.z2)
    return self.a2

  def _sigmoid(self, x):
    return 1 / (1 + np.exp(-x))

For each layer, we calculate the linear transformation z, apply the activation function to get a, then use that as input to the next layer. np.dot performs matrix multiplication. We implement sigmoid ourselves but use NumPy‘s built-in tanh.

The backward pass is more involved—this is where we calculate the gradients of the loss with respect to each parameter, using the chain rule:

  def backward(self, x, y):
    m = len(x)

    self.dz2 = self.a2 - y
    self.dW2 = np.dot(self.a1.T, self.dz2) / m
    self.db2 = np.sum(self.dz2, axis=0) / m

    self.da1 = np.dot(self.dz2, self.W2.T)
    self.dz1 = self.da1 * (1 - np.square(self.a1))
    self.dW1 = np.dot(x.T, self.dz1) / m
    self.db1 = np.sum(self.dz1, axis=0) / m

  def update_params(self, alpha):
    self.W1 -= alpha * self.dW1  
    self.b1 -= alpha * self.db1
    self.W2 -= alpha * self.dW2
    self.b2 -= alpha * self.db2

We compute the gradient of the loss with respect to each parameter (W and b) at each layer, working backwards. The gradients for the output layer (dW2, db2) are calculated directly from the derivative of the loss function (dz2). For the gradients of the hidden layer (dW1, db1), we use the chain rule to backpropagate the error (dz1) from the layer above it. The gradients are divided by m, the number of examples, to get the average.

Finally, we use these gradients to update the parameters, scaled by the learning rate alpha, in the direction that decreases the loss. This is the gradient descent step.

To train the model, we simply call these methods iteratively:

  def train(self, X, y, alpha=0.1, epochs=1000):
    self.losses = []
    for i in range(epochs):
      self.forward(X)
      self.backward(X, y)  
      self.update_params(alpha)

      loss = self._binary_cross_entropy(y, self.a2)
      self.losses.append(loss)
      if i % 100 == 0:
        print(f"Epoch {i}: loss {loss:.3f}")

  def _binary_cross_entropy(self, y, y_pred):
    return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

  def predict(self, x):  
    return self.forward(x) >= 0.5

We perform a forward pass to get the predictions, backward pass to calculate gradients, then update parameters, repeating this for a set number of epochs. At each step, we also calculate the loss and print it every 100 epochs to monitor progress. The binary cross-entropy loss is implemented in the _binary_cross_entropy method. Finally, we add a predict method that returns the model‘s predicted class label (0 or 1) based on the forward pass output.

Training the Model and Evaluating Performance

To test our neural network, let‘s create a simple synthetic dataset:

# Generate synthetic data  
np.random.seed(0)
X = np.random.randn(100, 2)
y = (X[:, 0] > 0).astype(int)

This creates 100 random 2D points and classifies them based on whether the first feature is positive or negative. We can now instantiate our NeuralNetwork class, train it on this data, and evaluate its performance:

nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)
nn.train(X, y, epochs=1000)

print(f"Training accuracy: {np.mean(nn.predict(X) == y):.3f}")

plt.plot(nn.losses)
plt.xlabel(‘Epochs‘)
plt.ylabel(‘Loss‘)  
plt.show()

After 1000 epochs of training, our model achieves 100% accuracy on this simple dataset. We can visualize how the loss decreased during training:

[Plot of training loss over time]

This plot provides insight into the model‘s learning dynamics. We see the loss rapidly decrease initially, then plateau as the model converges.

To find the best model, we can tune the hyperparameters like the hidden layer size, learning rate, and number of epochs, using techniques like grid search and cross-validation. We can also experiment with different activation functions and loss functions for different types of problems.

Comparison to ML Frameworks and Key Takeaways

Implementing a neural network from scratch like this is a valuable learning experience, but for most practical applications, you‘ll want to leverage the power and convenience of modern deep learning frameworks like TensorFlow and PyTorch. These tools provide highly optimized, scalable, GPU-accelerated implementations of all the components we built here, plus a lot more. They also offer high-level APIs that abstract away much of the nitty-gritty math, making it quicker and easier to build complex models.

However, I hope this exercise has given you a deeper understanding of what‘s really going on under the hood of these impressive frameworks. Some key learnings:

  • Neural networks, at their core, are just a series of linear transformations (matrix multiplications) and non-linear activations, stacked together in layers.
  • The weights and biases are the learnable parameters that determine the model‘s predictions. We start with random initialization.
  • During training, we use forward propagation to calculate the model‘s outputs and loss, backpropagation (with the chain rule) to calculate gradients, and gradient descent to update the parameters to minimize the loss.
  • Hyperparameters like the learning rate, layer sizes, and number of epochs must be tuned to get the best performance.

With a solid grasp of these fundamentals, you‘re well-equipped to dive into more advanced architectures and techniques, like convolutional networks for image data, recurrent networks for sequences, regularization methods like dropout, and optimization algorithms like Adam. I encourage you to extend the code here to build your own versions of these. Happy learning!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *