The Math Behind AI: Let’s Start Simple

How would you normally write a program to add two numbers under 100? Easy, right? You'd typically write something straightforward like:

#include 
using namespace std;

int main() {
    int a, b, sum;

    cout << "Enter first number: ";
    cin >> a;

    cout << "Enter second number: ";
    cin >> b;

    sum = a + b;

    cout << "The sum is: " << sum << endl;

    return 0;
}

But let's take a more exciting route—let's teach a machine to learn addition all on its own, using a neural network. Neural networks are at the heart of modern AI, powering everything from image recognition and language translation to advanced systems like ChatGPT.

In this article, we’ll show how a neural network can "learn" to add two numbers through training, not by following a hardcoded rule. You’ll see how a neural network learns to "add" by spotting patterns. We’ll even reveal the actual numbers (the magic behind the curtain) that make this happen.

Here's a peek at a trained neural network's internal structure:

// Hidden layer weights and biases
double weight1_node1 = 0.658136;
double weight2_node1 = 0.840666;
double bias_node1 = -0.893218;

double weight1_node2 = -0.720667;
double weight2_node2 = -0.369172;
double bias_node2 = 0.036762;

double weight1_node3 = 0.512252;
double weight2_node3 = 0.292342;
double bias_node3 = -0.0745917;

// Output layer weights and bias
double weight_out_node1 = 1.93108;
double weight_out_node2 = -0.718584;
double weight_out_node3 = 0.589741;
double bias_out = -0.467899;

At first glance, these values might look random, but they represent a tiny artificial brain capable of learning how to add two numbers.

Use this formula to calculate the output. Don't worry if it seems complicated right now—just follow along:

Here, x1 and x2 are the two numbers you want to add (normalized by dividing by 100):

double x1 = 0.3; // 30 / 100
double x2 = 0.5; // 50 / 100

// Hidden layer
double z1 = x1 * weight1_node1 + x2 * weight2_node1 + bias_node1;
double h1 = sigmoid(z1);

double z2 = x1 * weight1_node2 + x2 * weight2_node2 + bias_node2;
double h2 = sigmoid(z2);

double z3 = x1 * weight1_node3 + x2 * weight2_node3 + bias_node3;
double h3 = sigmoid(z3);

// Output layer
double sum = (h1 * weight_out_node1 + h2 * weight_out_node2 + h3 * weight_out_node3 + bias_out) * 200.0f;

Let's break down why we multiply by 200: we initially scaled our inputs and outputs to fit between 0 and 1 for easier training. Since the maximum sum of two numbers under 100 is 200, we scale the output back up by multiplying it by 200.

Sigmoid function:

double sigmoid(double x) {
    return 1.0 / (1.0 + exp(-x));
}  

Let’s try it out with some real numbers:

Let’s use x1 = 30 and x2 = 50. We normalize them first:

x1 = 30 / 100 = 0.3
x2 = 50 / 100 = 0.5

Now we run through the math:

z1 = (0.3 × 0.658136) + (0.5 × 0.840666) - 0.893218
   ≈ 0.1974408 + 0.420333 - 0.893218
   ≈ -0.2754442
h1 = sigmoid(z1) ≈ 0.431844

z2 = (0.3 × -0.720667) + (0.5 × -0.369172) + 0.036762
   ≈ -0.2162001 - 0.184586 + 0.036762
   ≈ -0.3640241
h2 = sigmoid(z2) ≈ 0.409872

z3 = (0.3 × 0.512252) + (0.5 × 0.292342) - 0.0745917
   ≈ 0.1536756 + 0.146171 - 0.0745917
   ≈ 0.2252549
h3 = sigmoid(z3) ≈ 0.556078

// Final output layer
output = (h1 × 1.93108) + (h2 × -0.718584) + (h3 × 0.589741) - 0.467899
       ≈ (0.834389) + (-0.294320) + (0.328013) - 0.467899
       ≈ 0.400183

Predicted sum = 0.400183 × 200 ≈ 80.0366
// Why multiply by 200? Because during training, we normalized all inputs and outputs to stay between 0 and 1 for better learning stability. Since the maximum sum of two inputs is 100 + 100 = 200, we scale the network’s output back up by multiplying it with 200.

Boom! The result is approximately 80, which is the correct sum of 30 and 50.

If you don’t believe this little network can really add, let’s test it again with different numbers: 20 and 30.

We normalize the inputs first:

x1 = 20 / 100 = 0.2
x2 = 30 / 100 = 0.3

Then the calculations:

z1 = (0.2 × 0.658136) + (0.3 × 0.840666) - 0.893218
   ≈ 0.1316272 + 0.2521998 - 0.893218
   ≈ -0.509391
h1 = sigmoid(z1) ≈ 0.375241

z2 = (0.2 × -0.720667) + (0.3 × -0.369172) + 0.036762
   ≈ -0.1441334 - 0.1107516 + 0.036762
   ≈ -0.218123
h2 = sigmoid(z2) ≈ 0.445714

z3 = (0.2 × 0.512252) + (0.3 × 0.292342) - 0.0745917
   ≈ 0.1024504 + 0.0877026 - 0.0745917
   ≈ 0.1155613
h3 = sigmoid(z3) ≈ 0.528860

// Final output layer
output = (h1 × 1.93108) + (h2 × -0.718584) + (h3 × 0.589741) - 0.467899
       ≈ (0.724688) + (-0.320095) + (0.311644) - 0.467899
       ≈ 0.248338

Predicted sum = 0.248338 × 200 ≈ 49.6676

It predicted about 49.67. That’s super close to the real sum: 50!

Pretty neat, huh?

So... What's a Neural Network Anyway?

Quick intro: A neural network is like a mathematical version of your brain. It consists of units called neurons that process inputs by multiplying them by weights, adding a bias, and then applying an activation function. A common activation function is the sigmoid:

sigmoid(x) = 1 / (1 + e^(-x))

This squashes any input into a value between 0 and 1, enabling the network to capture complex patterns.

This is the graph of the sigmoid function, clearly showing how inputs are smoothly transformed into values between 0 and 1:

There are other activation functions too, like ReLU (Rectified Linear Unit), tanh, and softmax, each with its own use cases and behaviors. But sigmoid is a great place to start because it’s simple and smooth—perfect for small neural networks like ours.

The Setup: Teaching Our Network to Add

For this experiment, we created a small network with:

  • 2 input nodes (for the two numbers to add)
  • 3 hidden nodes (doing the heavy lifting with sigmoid magic)
  • 1 output node (giving us the sum)

Here’s a diagram showing the structure of the neural network we’re using to add two numbers. You can see how each input node connects to all hidden layer nodes, how weights and biases are applied, and how everything flows toward the final output:

And yes, we scaled everything. Since our numbers are under 100, we divide inputs by 100 and scale the output back up by 200 at the end.

This is the code we write to add two numbers using the neural network:

#include 
#include 
using namespace std;

float sigmoid(float x) {
    return 1.0f / (1.0f + exp(-x));
}

float sigmoid_derivative(float y) {
    return y * (1 - y); // y = sigmoid(x)
}

int main() {
    float weight1_node1 = 0.5f, weight2_node1 = 0.5f, bias_node1 = 0.0f;
    float weight1_node2 = -0.5f, weight2_node2 = -0.5f, bias_node2 = 0.0f;
    float weight1_node3 = 0.25f, weight2_node3 = -0.25f, bias_node3 = 0.0f;

    float weight_out_node1 = 0.5f, weight_out_node2 = 0.5f, weight_out_node3 = 0.5f, bias_out = 0.0f;

    float learning_rate = 0.01f;
    int epochs = 1000;

    for (int epoch = 0; epoch < epochs; ++epoch) {
        float total_loss = 0.0f;
        for (int a = 0; a < 100; ++a) {
            for (int b = 0; b < 100; ++b) {
                float x1 = a / 100.0f;
                float x2 = b / 100.0f;
                float target = (a + b) / 200.0f; // Normalize target to 0-1 range

                // Forward pass for hidden layer
                float z1 = x1 * weight1_node1 + x2 * weight2_node1 + bias_node1;
                float h1 = sigmoid(z1);

                float z2 = x1 * weight1_node2 + x2 * weight2_node2 + bias_node2;
                float h2 = sigmoid(z2);

                float z3 = x1 * weight1_node3 + x2 * weight2_node3 + bias_node3;
                float h3 = sigmoid(z3);

                // Output layer with linear activation
                float y_pred = h1 * weight_out_node1 + h2 * weight_out_node2 + h3 * weight_out_node3 + bias_out;

                float error = target - y_pred;
                total_loss += error * error;

                // Backward pass (output layer - linear)
                float delta_out = error;
                float grad_out_w1 = delta_out * h1;
                float grad_out_w2 = delta_out * h2;
                float grad_out_w3 = delta_out * h3;
                float grad_out_b = delta_out;

                // Backward pass (hidden layer)
                float delta_h1 = delta_out * weight_out_node1 * sigmoid_derivative(h1);
                float delta_h2 = delta_out * weight_out_node2 * sigmoid_derivative(h2);
                float delta_h3 = delta_out * weight_out_node3 * sigmoid_derivative(h3);

                float grad_w1_node1 = delta_h1 * x1;
                float grad_w2_node1 = delta_h1 * x2;
                float grad_b_node1 = delta_h1;

                float grad_w1_node2 = delta_h2 * x1;
                float grad_w2_node2 = delta_h2 * x2;
                float grad_b_node2 = delta_h2;

                float grad_w1_node3 = delta_h3 * x1;
                float grad_w2_node3 = delta_h3 * x2;
                float grad_b_node3 = delta_h3;

                // Update weights
                weight_out_node1 += learning_rate * grad_out_w1;
                weight_out_node2 += learning_rate * grad_out_w2;
                weight_out_node3 += learning_rate * grad_out_w3;
                bias_out += learning_rate * grad_out_b;

                weight1_node1 += learning_rate * grad_w1_node1;
                weight2_node1 += learning_rate * grad_w2_node1;
                bias_node1 += learning_rate * grad_b_node1;

                weight1_node2 += learning_rate * grad_w1_node2;
                weight2_node2 += learning_rate * grad_w2_node2;
                bias_node2 += learning_rate * grad_b_node2;

                weight1_node3 += learning_rate * grad_w1_node3;
                weight2_node3 += learning_rate * grad_w2_node3;
                bias_node3 += learning_rate * grad_b_node3;
            }
        }

        if ((epoch + 1) % 100 == 0 || epoch == 0)
            cout << "[Summary] Epoch " << epoch + 1 << ": Loss = " << total_loss << endl;
    }

    // Test the model
    float a, b;
    cout << "\nTest sum prediction (a + b)\nEnter a: ";
    cin >> a;
    cout << "Enter b: ";
    cin >> b;

    float x1 = a / 100.0f;
    float x2 = b / 100.0f;

    float h1 = sigmoid(x1 * weight1_node1 + x2 * weight2_node1 + bias_node1);
    float h2 = sigmoid(x1 * weight1_node2 + x2 * weight2_node2 + bias_node2);
    float h3 = sigmoid(x1 * weight1_node3 + x2 * weight2_node3 + bias_node3);
    float y_pred = h1 * weight_out_node1 + h2 * weight_out_node2 + h3 * weight_out_node3 + bias_out;
    float predicted_sum = y_pred * 200.0f;

    cout << "w1_node1: " << weight1_node1 << ", w2_node1: " << weight2_node1
                             << ", w1_node2: " << weight1_node2 << ", w2_node2: " << weight2_node2
                             << ", w1_node3: " << weight1_node3 << ", w2_node3: " << weight2_node3
                             << ", out_w1: " << weight_out_node1 << ", out_w2: " << weight_out_node2
                             << ", out_w3: " << weight_out_node3 << ", bias_out: " << bias_out
                             << ", bias_node1: " << bias_node1 << ", bias_node2: " << bias_node2 << ", bias_node3: " << bias_node3 << endl;

    cout << "Predicted sum: " << predicted_sum << "\nActual sum: " << (a + b) << endl;

    return 0;
}

So... How Did It Learn?

Here’s the cool part: we didn’t code the rules for addition. We just gave the network a bunch of number pairs and their sums, and it slowly adjusted those weights and biases until the predictions got better and better.

It’s kind of like trial and error but smarter—with a sprinkle of calculus.

Did It Really Learn to Add?

Eh, not exactly like we do. It doesn’t understand what “plus” means. But it does learn a super good approximation. It sees patterns in numbers and predicts sums that are often spot-on.

It’s more like learning by feeling out patterns rather than knowing rules. If you're curious, you can change the input numbers and test it. The more you play, the more you'll understand how these little brains work.

Wrapping Up

Teaching a neural network to add two numbers might sound silly at first, but it’s a great way to peek inside the mind of AI. Behind all the hype, it’s just weights, biases, and some smart math.

In the next part, we’ll dive deeper into how neural networks work: we’ll explore layers, activation functions, derivatives, and what really happens during training.