The Proof of Learning in Machine Learning/AI

Before any mathematical development, we must first understand the foundation of learning and how it is closely linked to the concept of error

The Hypothetical Cook

Imagine that, on any given day, you decide to replicate a delicacy you ate at a renowned restaurant. You remember the taste of this delicacy perfectly. Based on this, you search for the recipe online and attempt to reproduce it at home.

Let’s denote the taste of the delicacy you ate at the restaurant as T, which will represent the expected taste, your target. Based on the recipe you found online, you hope to achieve this goal, i.e., the taste T.

To reproduce this recipe, you follow all the indicated steps, use all the ingredients, the necessary temperature, the cooking time, etc. Let’s denote all these methods and ingredients as X.

After completing the entire process, you taste the dish. At this moment, you judge whether it is similar to the expected taste T. You notice that it is saltier or sweeter than expected. The taste of the delicacy you reproduced at home will be represented by Y.

Therefore, upon realizing that the taste is different from the target T, you assign a quantitative measure of how different it is from the target taste based on taste Y. In other words, you could have added more salt or less salt, more seasoning or less seasoning.

The difference between T and Y can be defined as the error E. The distinction between T and Y is made by the memory of your palate. Therefore, your palate performs a specific function at this moment, which we can define as P(Y) = E. In other words, when experiencing taste Y, the palate assigns the error E based on the target taste T.

Having this quantitative measure of error E, we can reproduce this recipe every day so that with each passing day, the error E decreases. In other words, the distance between the target taste T and the taste Y decreases until T = Y.

Based on this hypothetical scenario, we can define error as the judgment that disagrees with the observed reality, where there is always a function that performs the action of judging. Therefore, in the above case, taste and memory created this judging function.

The act of learning, in this specific case, is characterized by the ability to reduce error. In other words, it is the ability to interact in different ways with the reproduced object in order to decrease the output of the judging function.

The Cook’s Expertise

Returning to the hypothetical case, we have the ingredients and methods X as indicated by the recipe. All the ingredients and equipment are the same as those used by the restaurant; therefore, the outcome depends solely on your ability to manipulate them correctly to achieve the target taste T.

In other words, you manipulate X to obtain Y. Therefore, we can define that you are essentially a function that transforms X into Y, denoted as f(X) = Y.

The function f(X), which represents the act of manipulating the ingredients, also depends on how your brain functions. In other words, if you have had culinary experiences, you will find it easier to transform X into Y.

Let’s define W as the weights of your neurons or your neural capacity to manipulate X. If W is already pre-adjusted based on culinary experiences, it will be easier to transform X into Y. Otherwise, we will need to adjust W until we can transform X into Y.

Therefore, we know that f(X) = Y also depends on W, i.e., we can represent it linearly where f(X) = WX.

Thus, our goal is to discover how we can modify W until the generated Y is very close or equal to T. In other words, how can we adjust W until the error E significantly decreases or becomes zero.

The Cost Function

The function that evaluates the difference between the result and the expected outcome is the cost function. The function that converts the ingredients and culinary methods into the delicacy is our model, which can be an artificial neural network or other machine learning models.

Source: The Author. Eq. (1)

In equation (1), the definition of the cost function E, which depends on the n weights w. In other words, it is a function that indicates the error based on the values of w. In a specific case where all n weights w are not adjusted, the value of the error E will be large. Conversely, in a case where the weights are properly adjusted, the value of the error E will be small or zero.

Source: The Author. Eq. (2)

Therefore, our objective is to find the values of the n weights w such that the condition above is true.

The Gradient

To facilitate understanding of how we will do this, we will define the following function:

Source: The Author. Img. (1)

Therefore, we intuitively know that when x = 0 and y = 0, f(x, y) = 0. However, we want an algorithm that, given random x and y values, adjusts the values of x and y until the function f(x, y) equals zero.

To achieve this, we can use the gradient of the function. In vector calculus, the gradient is a vector that indicates the direction and magnitude in which, by displacement from the specified point, we obtain the greatest possible increase in the value of a quantity.

Source: The Author. Eq. (3)

That is, by applying the gradient to the function f(x, y), we obtain a vector, as shown in the equation (3), which informs how to increment the values of x and y so that the value of f(x, y) grows. However, our goal is to find the values of x and y necessary for the function f(x, y) = 0. Therefore, we can use the negative gradient.

Below is a representation in two dimensions of the function f(x, y) where the coloring shows the value of z. Using the negative gradient, we see the vectors pointing to the minimum of the function.

Source: The Author. Img. (2)

Based on this, we can develop a method to update x and y using the gradient field of the function f(x, y) to find the necessary values for f(x, y) = 0.

The Proof of Learning

We will define a simple function f(x) for algorithm testing. Our intention is to find the minimum of this function. To do this, we can apply the gradient of f(x).

Source: The Author. Eq. (4)

Above, we have the gradient of the function f(x). We will not delve into defining the concept of derivative in this article, but we recommend reading about its definition and why we can represent it in this way.

Knowing that h tends to zero, we can represent the gradient of f(x) as follows:

Source: The Author. Eq. (5)

Based on this, we can replace h with the following term:

Source: The Author. Eq. (6)

We define the element alpha to maintain the necessity of the term h, where alpha must be strictly positive and always tend to zero, identical to the term h. Substituting the new relationship into the definition of derivative, we have:

Source: The Author. Eq. (7)

Now we have a valuable relationship for our proof. We know that any element squared will be positive. From this concept comes the need to replace h with minus alpha times the gradient of f(x).

So:

Source: The Author. Eq. (8)

Therefore, we can judge that the condition in (8) is true as long as alpha is always a positive value.

Source: The Author. Eq. (9)

That is, the value of f(x) being subtracted by a strictly positive value will always be less than the original value of f(x). Therefore, we can replace it with the following relationship using eq. (7) and (9):

Source: The Author. Eq. (10)

Therefore, we have a proven relationship on how to update the values of x so that the function f(x) is at least smaller than its previous state.

Source: The Author. Eq. (11)

So, we know how to decrease the current x to satisfy the inequality (11):

Source: The Author. Eq. (12)

To confirm the validity of this relationship, we can apply this methodology to the function f(x, y) in img. (1) whose behavior we know. So:

Source: The Author. Eq. (13)

Applying this algorithm to the function f(x, y) numerous times, we expect to see the value of the function decrease until it reaches the minimum. To do this, we conducted a simulation where, in addition, we applied noise to the assignment of the updated x and y to visualize the decrease in the value of f(x, y).

Source: The Author. Img. (3)

Notice that when the value of alpha tends to zero, we observe the values of x and y tending to the minimum of the function. When this is not true, for example, at alpha = 0.6, we observe a certain difficulty in finding the minimum of the function f(x, y).

Gradient Descent

This algorithm is known as “Gradient Descent” or “Method of Steepest Descent,” being an optimization method to find the minimum of a function where each step is taken in the direction of the negative gradient. This method does not guarantee that the global minimum of the function will be found, but rather a local minimum.

Discussions about finding the global minimum could be developed in another article, but here, we have mathematically demonstrated how the gradient can be used for this purpose.

Now, applying it to the cost function E that depends on the n weights w, we have:

Source: The Author. Eq. (14)

To update all elements of W based on gradient descent, we have:

Source: The Author. Eq. (15)

And for any nth element 𝑤 of the vector W, we have:

Source: The Author. Eq. (16)

Therefore, we have our theoretical learning algorithm. Logically, this is not applied to the hypothetical idea of the cook, but rather to numerous machine learning algorithms that we know today.

Conclusion

Based on what we have seen, we can conclude the demonstration and the mathematical proof of the theoretical learning algorithm. Such a structure is applied to numerous learning methods such as AdaGrad, Adam, and Stochastic Gradient Descent (SGD).

This method does not guarantee finding the n-weight values w where the cost function yields a result of zero or very close to it. However, it assures us that a local minimum of the cost function will be found.

To address the issue of local minima, there are several more robust methods, such as SGD and Adam, which are commonly used in deep learning.

Nevertheless, understanding the structure and the mathematical proof of the theoretical learning algorithm based on gradient descent will facilitate the comprehension of more complex algorithms.

References

Carreira-Perpinan, M. A., & Hinton, G. E. (2005). On contrastive divergence learning. In R. G. Cowell & Z. Ghahramani (Eds.), Artificial Intelligence and Statistics, 2005. (pp. 33–41). Fort Lauderdale, FL: Society for Artificial Intelligence and Statistics.

García Cabello, J. Mathematical Neural Networks. Axioms 2022, 11, 80.

Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18, 1527–1554. Massachusetts Institute of Technology

LeCun, Y., Bottou, L., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.