# Deep Learning, Heteroscedasticity, and TensorFlow Probability It is very often the case that Machine Learning models calculate point estimates of quantities of interest; a typical example would be: given the number of rooms, size, and location of a house, they predict that it costs $$x$$ Euros.

In order to make real-world decisions based on these models, however, it is almost mandatory to quantify the confidence we have in their predictions; this is especially true in business scenarios, where large deviations from a prediction could cost an enormous amount of money. In some cases we might even be interested in learning not just expectation values, but the entire probability distribution of a random variable.

In this short post, I will show how easy it is to combine Deep Learning and Probabilistic Modeling with an awesome new module for TensorFlow, called TensorFlow Probability.

# Learning the parameters of a distribution

Consider the following artificially generated dataset: and imagine your task is to predict $$Y$$ given the feature $$X$$. Probably the first approach one would try is linear regression, which would result in the following predictions (shown in red): From this plot, it is clear that the predictions for $$X \approx 0$$ are “closer to the truth” with respect to the ones for $$X$$ away from $$0$$; the uncertainty we observe in this case is intrinsic in the data and independent from our model. In particular, since the variance of $$Y$$ varies with $$X$$, our data exhibits heteroscedasticity, which can be quite a hassle if we want to quantify the errors we expect to make in our predictions.

Let’s start with the code I used to perform linear regression with TensorFlow:

1
2
3
4
5
6
7
8
9
10
11
12
13
import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions

model = tf.keras.Sequential([
tf.keras.layers.Dense(1)
])

loss='mean_squared_error')

model.fit(X, Y, epochs=40);


Although for now this code is unnecessarily complicated, it is equivalent to linear regression: I defined a model with only a single perceptron tf.keras.layers.Dense(1), which by default has a linear activation function and can take bias into account (that is, for $$X = 0$$ the model will not necessarily predict $$Y = 0$$).

Let’s try now to make the model output not just a number, but a distribution; for the sake of definiteness, let’s try to learn the “best” normal distribution which models the random variable $$Y$$ for each value of $$X$$. In order to do so, notice that we cannot use anymore the mean squared error as a loss function, since it is not a meaningful error metric in this case. We will, instead, use the negative logarithm of likelihood between the distributions given by our model vs the observed data; in this way, training our model becomes essentially a maximum likelihood estimation of parameters!

Notice also that for each value of $$X$$ the normal distribution we want our model to learn is parametrized by two numbers: mean and standard deviation. Therefore, we should increase the number of neurons in the dense layer from one to two, so that each perceptron will learn the functional dependency on $$X$$ of one of the two parameters, and then push its output into the distribution.

The way to implement these changes in TensorFlow Probability is very nice: we can use a tfp.layers.DistributionLambda layer which works in pretty much the same way as a “standard” Keras layer; in its argument, we can plug a lambda function which takes parameters from the previous layers of the network and returns a tfp.Distribution:

1
2
3
4
5
6
7
8
9
10
11
12
model = tf.keras.Sequential([
tf.keras.layers.Dense(2),
tfp.layers.DistributionLambda(lambda t:
tfd.Normal(loc   = t[...,:1],
scale = tf.math.softplus(0.005*t[...,1:])+0.001)
)
])

negloglik = lambda y, p_y: -p_y.log_prob(y)

loss=negloglik)


This code, a minimal alteration from the linear regression case, is able to learn both the mean value and the variance of $$Y$$ given $$X$$. After training, let’s plot what this model has learnt: Excellent! The green lines are set 2 standard deviations above and below the mean (here with “standard deviation” and “mean” I intend the values of these parameters which our model learnt). They seem to somewhat characterize our confidence in the values of $$Y$$; can we improve on this?

Since we used only a single dense layer with linear activations before the tfp.layers.DistributionLambda, both the estimated mean and estimated standard deviation can only depend linearly on $$X$$, and this leads to an underestimation of error in some regions and an overestimation in others. By using a deeper neural network and introducing nonlinear activation functions, however, we can learn more complicated functional dependencies!

We can achieve this with just the addition of another dense layer tf.keras.layers.Dense(20,activation="relu") before the layer with two perceptrons, i.e., by defining the model as:

1
2
3
4
5
6
7
8
model = tf.keras.Sequential([
tf.keras.layers.Dense(20,activation="relu"),
tf.keras.layers.Dense(2),
tfp.layers.DistributionLambda(lambda t:
tfd.Normal(loc   = t[...,:1],
scale=tf.math.softplus(0.005*t[...,1:])+0.001)
)
])


Finally, after fitting the data, we can obtain the following (much nicer) plot: This time, we can see how the confidence interval given by the two green lines accurately assesses the variability of $$Y$$ given $$X$$!

# Conclusions

This was just one of the many incredible things made possible by TensorFlow Probability in very few lines of code; I plan to show many more in the future where I will use it for the study of time series.

In any case, I think that TensorFlow Probability really helps in bridging the gap between probabilistic methods and neural networks, something I am extremely interested in. Thus, I have the feeling this will hardly be the last time I write about it. :-)

# Appendix

The dataset I analyzed was also generated by TensorFlow Probability via a pretty cool feature which is present only in its nightly build (as of the time of writing); the method tfp.distributions.JointDistributionSequential.

It allows to build joint probability distributions starting from elementary (and possibly interdependent) ones. The code I wrote is the following:

1
2
3
4
5
6
tfd = tfp.distributions

joint = tfd.JointDistributionSequential([
tfd.Uniform(low=-8, high=15),
lambda x : tfd.Normal(loc=x, scale=abs(x)+3)
])


In order to get samples from this distribution, one needs just to run the line X, Y = joint.sample(2000); in my view this is a very elegant way to work with distributions in Python. Moreover, as we have seen above, tfp.Distribution objects can be integrated seamlessly and in a modular way with deep learning models.