Understanding AI From the Basics – Part 2: How Neural Networks Learn

Welcome back! In Part 1, we saw how a neural network could add two numbers using a basic feedforward structure. At the end of the article, we even showed you a full C++ implementation that included the forward pass, training loop, loss calculation, and weight updates—all in code.

This naturally leads to the next question: how exactly does a neural network learn to adjust those weights on its own through training? That's exactly what we'll explore here in Part 2.

Before we go further, here’s the full C++ code from Part 1 that implements a simple neural network to learn how to add two numbers. It includes the forward pass, loss calculation, backpropagation, and weight updates:

#include <iostream>
#include <cmath>
using namespace std;

// Sigmoid activation function
float sigmoid(float x) {
    return 1.0f / (1.0f + exp(-x));
}

// Derivative of sigmoid (used in backpropagation)
float sigmoid_derivative(float y) {
    return y * (1 - y); // where y = sigmoid(x)
}

int main() {
    // Initialize weights and biases for 3 hidden nodes
    // These values are chosen manually to show the learning process more clearly.
    // Using both positive and negative weights helps prevent symmetry and provides a diverse starting point.
    float weight1_node1 = 0.5f, weight2_node1 = 0.5f, bias_node1 = 0.0f;      // Node 1 starts with positive influence from both inputs
    float weight1_node2 = -0.5f, weight2_node2 = -0.5f, bias_node2 = 0.0f;    // Node 2 starts with negative influence, encouraging balance
    float weight1_node3 = 0.25f, weight2_node3 = -0.25f, bias_node3 = 0.0f;   // Node 3 is mixed, useful for capturing interactions

    // Initialize output layer weights and bias
    float weight_out_node1 = 0.5f, weight_out_node2 = 0.5f, weight_out_node3 = 0.5f, bias_out = 0.0f;

    float learning_rate = 0.01f;
    int epochs = 1000; // Number of times we loop over the training data

    // Training loop
    for (int epoch = 0; epoch < epochs; ++epoch) {
        float total_loss = 0.0f;
        for (int a = 0; a < 100; ++a) {
            for (int b = 0; b < 100; ++b) {
                float x1 = a / 100.0f;
                float x2 = b / 100.0f;
                float target = (a + b) / 200.0f; // Normalize target sum to [0, 1]

                // ===== Forward Pass =====
                float z1 = x1 * weight1_node1 + x2 * weight2_node1 + bias_node1;
                float h1 = sigmoid(z1);

                float z2 = x1 * weight1_node2 + x2 * weight2_node2 + bias_node2;
                float h2 = sigmoid(z2);

                float z3 = x1 * weight1_node3 + x2 * weight2_node3 + bias_node3;
                float h3 = sigmoid(z3);

                float y_pred = h1 * weight_out_node1 + h2 * weight_out_node2 + h3 * weight_out_node3 + bias_out;

                float error = target - y_pred;
                total_loss += error * error; // Squared error (loss)

                // ===== Backpropagation - Output Layer =====
                float delta_out = error; // Derivative of loss wrt y_pred
                float grad_out_w1 = delta_out * h1; // Gradient for output weight 1
                float grad_out_w2 = delta_out * h2; // Gradient for output weight 2
                float grad_out_w3 = delta_out * h3; // Gradient for output weight 3
                float grad_out_b = delta_out;       // Gradient for output bias

                // ===== Backpropagation - Hidden Layer =====
                // Calculate how each hidden node contributed to the error
                float delta_h1 = delta_out * weight_out_node1 * sigmoid_derivative(h1);
                float delta_h2 = delta_out * weight_out_node2 * sigmoid_derivative(h2);
                float delta_h3 = delta_out * weight_out_node3 * sigmoid_derivative(h3);

                float grad_w1_node1 = delta_h1 * x1; // Gradient for w1 of node1
                float grad_w2_node1 = delta_h1 * x2; // Gradient for w2 of node1
                float grad_b_node1 = delta_h1;       // Gradient for bias of node1

                float grad_w1_node2 = delta_h2 * x1;
                float grad_w2_node2 = delta_h2 * x2;
                float grad_b_node2 = delta_h2;

                float grad_w1_node3 = delta_h3 * x1;
                float grad_w2_node3 = delta_h3 * x2;
                float grad_b_node3 = delta_h3;

                // ===== Update Output Layer Weights =====
                weight_out_node1 += learning_rate * grad_out_w1;
                weight_out_node2 += learning_rate * grad_out_w2;
                weight_out_node3 += learning_rate * grad_out_w3;
                bias_out += learning_rate * grad_out_b;

                // ===== Update Hidden Layer Weights =====
                weight1_node1 += learning_rate * grad_w1_node1;
                weight2_node1 += learning_rate * grad_w2_node1;
                bias_node1 += learning_rate * grad_b_node1;

                weight1_node2 += learning_rate * grad_w1_node2;
                weight2_node2 += learning_rate * grad_w2_node2;
                bias_node2 += learning_rate * grad_b_node2;

                weight1_node3 += learning_rate * grad_w1_node3;
                weight2_node3 += learning_rate * grad_w2_node3;
                bias_node3 += learning_rate * grad_b_node3;
            }
        }
        // Log loss every 100 epochs
        if ((epoch + 1) % 100 == 0 || epoch == 0)
            cout << "[Summary] Epoch " << epoch + 1 << ": Loss = " << total_loss << endl;
    }

    // ===== Test the model with user input =====
    float a, b;
    cout << "
Test sum prediction (a + b)
Enter a: ";
    cin >> a;
    cout << "Enter b: ";
    cin >> b;

    float x1 = a / 100.0f;
    float x2 = b / 100.0f;

    // Forward pass again with trained weights
    float h1 = sigmoid(x1 * weight1_node1 + x2 * weight2_node1 + bias_node1);
    float h2 = sigmoid(x1 * weight1_node2 + x2 * weight2_node2 + bias_node2);
    float h3 = sigmoid(x1 * weight1_node3 + x2 * weight2_node3 + bias_node3);
    float y_pred = h1 * weight_out_node1 + h2 * weight_out_node2 + h3 * weight_out_node3 + bias_out;
    float predicted_sum = y_pred * 200.0f;

    // Output result
    cout << "Predicted sum: " << predicted_sum << "
Actual sum: " << (a + b) << endl;
    return 0;
}

In this part, we'll break down how a neural network learns: from setting up layers and activation functions to adjusting weights through backpropagation. You'll also learn how we train the model using learning rates and epochs—and explore whether a network can go beyond simple tasks like addition.

What nodes and layers actually do in a neural network.
How we set initial weights and adjust them during training.
The role of learning rate and epochs in shaping the learning process.
What activation functions are, how they work, and when to use sigmoid, tanh, or ReLU.
How backpropagation and derivatives enable the network to learn from mistakes.
Whether a neural network can go beyond simple tasks like addition and learn multiple things.

Ready? Let's go!

Nodes and Layers: What's really going on?

A neuron (or node) is the basic building block of a neural network. Each neuron receives input values, multiplies them by weights, adds a bias, and then passes the result through an activation function to produce an output. It’s like a tiny decision-maker that transforms data step-by-step.

A layer is a collection of neurons that operate at the same level of the network. Information flows from one layer to the next. The three main types of layers are:

Input layer: receives the raw data (e.g., the numbers you want to add).
Hidden layers: perform transformations on the input, learning patterns through weighted connections.
Output layer: produces the final prediction or result.

Each neuron in a layer is connected to the neurons in the next layer, and each of those connections has a weight that the network learns and adjusts during training.

Think of each node (neuron) in a neural network as a tiny decision-maker. It takes some inputs, multiplies them by its "weights," adds a "bias," then applies an activation function (like sigmoid, tanh, or ReLU).

More nodes = more brainpower:

A small number of nodes might quickly solve simple problems.
More nodes mean your network can understand more complex patterns, but too many can slow it down or cause it to overthink (overfit)!

What if we stack layers? You can totally stack multiple hidden layers—that’s a "deep neural network". Each layer learns something slightly more abstract:

First layer learns simple patterns.
Second layer combines those patterns into more complex ideas.
Third and further layers build increasingly abstract understanding.

Weights and Biases in Neural Networks

What is a weight in a neural network?

In a neural network, a weight is a value that represents the strength of the connection between two neurons. When input data flows through the network, it gets multiplied by these weights. A higher weight means the input has a stronger influence on the output of the next neuron. During training, these weights are adjusted to help the network make better predictions.

What is a bias in a neural network?

A bias is an additional parameter in a neural network that allows the activation of a neuron to be shifted. It acts like an offset or threshold that helps the model fit the data better by enabling it to learn patterns that don't pass through the origin. Just like weights, biases are adjusted during training to improve predictions.

How do we set the starting weights?

When we start training, weights must be initialized to some values. In our code example, we simply set them manually at first:

float weight1_node1 = 0.5f, weight2_node1 = 0.5f, bias_node1 = 0.0f;

In more complex systems, weights are typically initialized randomly using small values (like between -0.5 and 0.5). This randomness helps prevent symmetry—where all neurons learn the same thing—and gives the network a better chance at discovering useful patterns. Common initialization strategies include Xavier (Glorot) and He initialization, which are designed to maintain a stable signal as it flows through the network. For simple experiments like ours, manual or small random values work well enough.

How do we adjust the weights?

During training, here's how the adjustment works:

The network makes a prediction using current weights.
It compares the prediction with the correct answer (target output).
It calculates how far off it was — this is the error or "loss".
It uses backpropagation to calculate how each weight affected the error.
Each weight is adjusted slightly to reduce the error for the next time — this step is done using gradients (derivatives) and a learning rate.

In our example C++ code, this process happens inside the nested training loops. Here's a quick reference:

float error = target - y_pred;
float delta_out = error;
float grad_out_w1 = delta_out * h1;
float grad_out_w2 = delta_out * h2;
float grad_out_w3 = delta_out * h3;
float grad_out_b = delta_out;

float delta_h1 = delta_out * weight_out_node1 * sigmoid_derivative(h1);
float delta_h2 = delta_out * weight_out_node2 * sigmoid_derivative(h2);
float delta_h3 = delta_out * weight_out_node3 * sigmoid_derivative(h3);

float grad_w1_node1 = delta_h1 * x1;
float grad_w2_node1 = delta_h1 * x2;
float grad_b_node1 = delta_h1;

// Update weights
weight_out_node1 += learning_rate * grad_out_w1;
weight_out_node2 += learning_rate * grad_out_w2;
weight_out_node3 += learning_rate * grad_out_w3;
bias_out += learning_rate * grad_out_b;

weight1_node1 += learning_rate * grad_w1_node1;
weight2_node1 += learning_rate * grad_w2_node1;
bias_node1 += learning_rate * grad_b_node1;

This snippet demonstrates how we calculate gradients and apply them to update the weights in both the output and hidden layers. Each iteration uses the gradients calculated from the derivative of the activation function and the error to nudge the weights in the direction that reduces the loss. Over thousands of iterations, the weights gradually become more accurate.

Activation Functions: Giving your nodes personality

Activation functions decide whether a neuron should "fire" or not. They're applied to the result of the weighted sum and bias in each node, introducing non-linearity to the network—without which, the network could only learn linear relationships (not very useful for complex problems).

Here are some common activation functions:

1. Sigmoid

Formula: 1 / (1 + e^(-x))
Output range: 0 to 1
Use case: Great for binary classification or output layers where you need values between 0 and 1.
Pros: Smooth, bounded output.
Cons: Saturates at extremes (near 0 or 1), slows down learning due to vanishing gradients.

2. Tanh

Formula: (e^x - e^(-x)) / (e^x + e^(-x))
Output range: -1 to 1
Use case: When you want zero-centered output.
Pros: Stronger gradients than sigmoid; faster learning in hidden layers.
Cons: Still suffers from vanishing gradients at the extremes.

3. ReLU (Rectified Linear Unit)

Formula: max(0, x)
Output range: 0 to infinity.
Use case: Default for most hidden layers in deep learning.
Pros: Fast computation, helps with sparse activation.
Cons: Can die if inputs are always negative ("dying ReLU problem").

When to use which activation function?

Activation	Output Range	Good For	Common Layer Type	Notes
Sigmoid	0 to 1	Binary Output	Output layer	Can cause vanishing gradient
Tanh	-1 to 1	Zero-centered hidden layers	Hidden layer	Stronger gradients than sigmoid
ReLU	0 to ∞	Most deep networks and CNNs	Hidden layer	Fast, efficient, but can "die" on negatives

In our example code, we used sigmoid for hidden layers. It works for small demos, but in larger networks, ReLU is often preferred for better performance and faster training.

Derivatives and Backpropagation: How networks learn from mistakes

Imagine you're blindfolded, standing on a hill, and need to find your way downhill. You’d feel around with your foot to sense which direction slopes down, then move accordingly.

That's what neural networks do using derivatives—they check which small adjustments make predictions more accurate.

After making a prediction, the network calculates how wrong it was (we call this "loss"). It then uses something called backpropagation to trace that error backward through the network, figuring out exactly how each weight contributed to the mistake.

Here's the simplified backpropagation loop:

Forward Pass: Guess the answer.
Calculate Loss: How wrong was that guess?
Backward Pass: Find out how each weight affected that error.
Update Weights: Slightly adjust the weights to reduce future mistakes.

This process repeats thousands of times, gradually making the network smarter and smarter.

Let’s look at a simplified C++ implementation of this idea:

float sigmoid(float x) {
    return 1.0f / (1.0f + exp(-x));
}

float sigmoid_derivative(float y) {
    return y * (1 - y); // y = sigmoid(x)
}

This is our activation function and its derivative. They're used in both the forward and backward passes.

The training loop uses two inputs (a and b), normalizes them between 0 and 1, and trains the network to predict their sum. We use 3 hidden nodes and 1 output node. Here's a key part of the loop:

float error = target - y_pred;
total_loss += error * error;

float delta_out = error;
...
float delta_h1 = delta_out * weight_out_node1 * sigmoid_derivative(h1);

This is where we calculate the error, and use derivatives to push adjustments back through the network—backpropagation in action.

At the end of training, we allow users to input two numbers, and the network predicts the sum. It’s not memorizing—it’s generalizing from what it’s learned.

Learning Rate and Epochs

As you may have noticed in the code, we use two important hyperparameters to control the training process: learning_rate and epochs.

What is a learning rate?

The learning rate determines how big each step is when updating the weights. A small learning rate (like 0.01) means the network makes tiny adjustments each time, which is safer but slower. A large learning rate might speed things up, but if it's too large, the network could overshoot and fail to learn properly.

In our code, we use:

float learning_rate = 0.01f;

This gives us a balance between speed and stability.

What is an epoch?

An epoch is one complete pass through the entire training dataset. In our example, for each epoch, we train the network using every combination of a and b from 0 to 99. Then we do it again for the next epoch.

We train over multiple epochs to allow the network to refine its weights again and again.

If you increase the number of epochs, the network gets more chances to improve, potentially reducing the loss further. However, too many epochs can lead to overfitting, where the model becomes too tailored to the training data and performs poorly on new inputs.
If you decrease the number of epochs, training will be faster, but the network might not learn enough and could underperform.

In the code, we use:

int epochs = 1000;

This is typically a good start for simple problems like learning addition, but you can adjust based on how quickly or slowly the loss goes down.

Can a neural network learn multiple things?

Absolutely. Neural networks are not limited to learning just one type of task. In fact, the same network architecture can be trained to do a variety of things—such as recognizing handwritten digits, translating languages, or even generating music—depending on the data it’s given and how it's trained.

In our example, the network learned to add two numbers, but with different training data and a modified output layer, it could also learn to:

Subtract numbers
Classify images
Predict future values in a time series

The more tasks you want the network to handle—or the more complex the task—the more neurons and layers you may need. A simple problem like addition might only need a few nodes, while tasks like image recognition or language translation may require much deeper and wider networks. Just keep in mind: more complexity isn't always better—adding too many nodes can lead to overfitting, where the network memorizes rather than learns to generalize.

This is the enhanced version of our neural network where we train it to both add and subtract input numbers using a single shared network:


#include <iostream>
#include <cmath>
using namespace std;

// Sigmoid activation function (squashes input into range [0, 1])
float sigmoid(float x) {
    return 1.0f / (1.0f + exp(-x));
}

// Derivative of sigmoid (used for calculating gradients during backpropagation)
float sigmoid_derivative(float y) {
    return y * (1 - y); // y = sigmoid(x)
}

int main() {
    // === Hidden Layer Weights and Biases (3 hidden neurons, 2 inputs each) ===
    float w1_n1 = 0.5f, w2_n1 = 0.5f, b1 = 0.0f;
    float w1_n2 = -0.5f, w2_n2 = -0.5f, b2 = 0.0f;
    float w1_n3 = 0.25f, w2_n3 = -0.25f, b3 = 0.0f;

    // === Output Layer Weights and Biases (2 output neurons: sum and subtract) ===
    float out_w1_sum = 0.5f, out_w2_sum = 0.5f, out_w3_sum = 0.5f, out_b_sum = 0.0f;
    float out_w1_sub = -0.5f, out_w2_sub = -0.5f, out_w3_sub = 0.5f, out_b_sub = 0.0f;

    float learning_rate = 0.01f;
    int epochs = 1000;

    // === Training Loop ===
    for (int epoch = 0; epoch < epochs; ++epoch) {
        float total_loss = 0.0f;

        for (int a = 0; a < 100; ++a) {
            for (int b = 0; b < 100; ++b) {
                // === Normalize input values ===
                float x1 = a / 100.0f;
                float x2 = b / 100.0f;

                // === Normalize targets ===
                float target_sum = (a + b) / 200.0f;          // Sum ranges from 0 to 198 → [0, ~1]
                float target_sub = (a - b + 100.0f) / 200.0f; // Sub ranges from -99 to +99 → [0, ~1]

                // === Forward Pass (Input → Hidden Layer) ===
                float z1 = x1 * w1_n1 + x2 * w2_n1 + b1;
                float h1 = sigmoid(z1);

                float z2 = x1 * w1_n2 + x2 * w2_n2 + b2;
                float h2 = sigmoid(z2);

                float z3 = x1 * w1_n3 + x2 * w2_n3 + b3;
                float h3 = sigmoid(z3);

                // === Forward Pass (Hidden → Output Layer) ===
                float y_pred_sum = h1 * out_w1_sum + h2 * out_w2_sum + h3 * out_w3_sum + out_b_sum;
                float y_pred_sub = h1 * out_w1_sub + h2 * out_w2_sub + h3 * out_w3_sub + out_b_sub;

                // === Compute Loss (Squared Error) ===
                float error_sum = target_sum - y_pred_sum;
                float error_sub = target_sub - y_pred_sub;
                total_loss += error_sum * error_sum + error_sub * error_sub;

                // === Backpropagation - Output Layer (Gradients for each output node) ===
                float delta_out_sum = error_sum;
                float grad_out_w1_sum = delta_out_sum * h1;
                float grad_out_w2_sum = delta_out_sum * h2;
                float grad_out_w3_sum = delta_out_sum * h3;
                float grad_out_b_sum = delta_out_sum;

                float delta_out_sub = error_sub;
                float grad_out_w1_sub = delta_out_sub * h1;
                float grad_out_w2_sub = delta_out_sub * h2;
                float grad_out_w3_sub = delta_out_sub * h3;
                float grad_out_b_sub = delta_out_sub;

                // === Backpropagation - Hidden Layer ===
                float delta_h1 = (delta_out_sum * out_w1_sum + delta_out_sub * out_w1_sub) * sigmoid_derivative(h1);
                float delta_h2 = (delta_out_sum * out_w2_sum + delta_out_sub * out_w2_sub) * sigmoid_derivative(h2);
                float delta_h3 = (delta_out_sum * out_w3_sum + delta_out_sub * out_w3_sub) * sigmoid_derivative(h3);

                float grad_w1_n1 = delta_h1 * x1;
                float grad_w2_n1 = delta_h1 * x2;
                float grad_b1 = delta_h1;

                float grad_w1_n2 = delta_h2 * x1;
                float grad_w2_n2 = delta_h2 * x2;
                float grad_b2 = delta_h2;

                float grad_w1_n3 = delta_h3 * x1;
                float grad_w2_n3 = delta_h3 * x2;
                float grad_b3 = delta_h3;

                // === Update Output Weights ===
                out_w1_sum += learning_rate * grad_out_w1_sum;
                out_w2_sum += learning_rate * grad_out_w2_sum;
                out_w3_sum += learning_rate * grad_out_w3_sum;
                out_b_sum += learning_rate * grad_out_b_sum;

                out_w1_sub += learning_rate * grad_out_w1_sub;
                out_w2_sub += learning_rate * grad_out_w2_sub;
                out_w3_sub += learning_rate * grad_out_w3_sub;
                out_b_sub += learning_rate * grad_out_b_sub;

                // === Update Hidden Weights ===
                w1_n1 += learning_rate * grad_w1_n1;
                w2_n1 += learning_rate * grad_w2_n1;
                b1 += learning_rate * grad_b1;

                w1_n2 += learning_rate * grad_w1_n2;
                w2_n2 += learning_rate * grad_w2_n2;
                b2 += learning_rate * grad_b2;

                w1_n3 += learning_rate * grad_w1_n3;
                w2_n3 += learning_rate * grad_w2_n3;
                b3 += learning_rate * grad_b3;
            }
        }

        // === Print Loss Every 100 Epochs ===
        if ((epoch + 1) % 100 == 0 || epoch == 0)
            cout << "[Epoch " << epoch + 1 << "] Loss: " << total_loss << endl;
    }

    // === Test Phase ===
    float a, b;
    cout << "\nTest prediction\nEnter a: ";
    cin >> a;
    cout << "Enter b: ";
    cin >> b;

    float x1 = a / 100.0f;
    float x2 = b / 100.0f;

    // Forward pass again for user input
    float h1 = sigmoid(x1 * w1_n1 + x2 * w2_n1 + b1);
    float h2 = sigmoid(x1 * w1_n2 + x2 * w2_n2 + b2);
    float h3 = sigmoid(x1 * w1_n3 + x2 * w2_n3 + b3);

    float y_sum = h1 * out_w1_sum + h2 * out_w2_sum + h3 * out_w3_sum + out_b_sum;
    float y_sub = h1 * out_w1_sub + h2 * out_w2_sub + h3 * out_w3_sub + out_b_sub;

    // Denormalize outputs
    float predicted_sum = y_sum * 200.0f;
    float predicted_sub = y_sub * 200.0f - 100.0f;

    cout << "Predicted sum:        " << predicted_sum << endl;
    cout << "Actual sum:           " << (a + b) << endl;
    cout << "Predicted difference: " << predicted_sub << endl;
    cout << "Actual difference:    " << (a - b) << endl;

    return 0;
}

Wrapping up

That was a lot to take in—but kind of awesome, right? We didn’t just talk theory; we watched a neural network learn how to do something real: adding and subtracting numbers. No rules hardcoded—just weights adjusting themselves through training. Along the way, we got to know learning rates, activation functions, and the backpropagation magic that ties it all together.

In next part, we’ll take a step back from the code and dive into the story behind all this. Where did neural networks come from? Why has AI exploded in just the last few years? What changed? We’ll explore how it all started—and why it’s just getting started.