Training and Loss

Training a model is called empirical risk minimization.

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

$MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2$

where:

• $(x, y)$ is an example in which

• $x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
• $y$ is the example's label (for example, temperature).
• $prediction(x)$ is a function of the weights and bias in combination with the set of features.
• $D$ is a data set containing many labeled examples, which are pairs.
• $N$ is the number of examples in.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss

An Iterative Approach

Figure 1. An iterative approach to training a model.

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

$\nabla f$

For instance, if:

$f(x,y) = e^{2y}\sin(x)$

then:

$\nabla f(x,y) = \left(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y)\right) = (e^{2y}\cos(x), 2e^{2y}\sin(x))$

Note the following:

$\nabla f$Points in the direction of greatest increase of the function.
$- \nabla f$Points in the direction of greatest decrease of the function.

The number of dimensions in the vector is equal to the number of variables in the formula for $f$; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function $f(x, y)$:

$f(x,y) = 4 + (x - 2)^2 + 2y^2$

when viewed in three dimensions with $z = f(x,y)$ looks like a valley with a minimum at $(2, 0, 4)$:

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

Figure 5. A gradient step moves us to the next point on the loss curve.

Figure 6. Learning rate is too small.

Figure 7. Learning rate is too large.

Figure 8. Learning rate is just right.

The ideal learning rate in one-dimension is $\frac{ 1 }{ f(x)'' }$ (the inverse of the second derivative of f(x) at x).

The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Introducing TensorFlow

The following figure shows the current hierarchy of TensorFlow toolkits:

Toolkit(s)Description
Estimator (tf.estimator)High-level, OOP API.
tf.layers/tf.losses/tf.metricsLibraries for common model components.
TensorFlowLower-level APIs