Verifying a Data-Driven System - Linear Regression

home writeups

Introduction

Previously, I described the challenges presented while verifying a data-driven system. Here, I briefly illustrate those challenges using the example of simple linear regression via ordinary least squares.

Regression

In the language used in the previous post, regression analysis is a "compiler" from an input language (e.g. data, here sets where the elements are in \(\mathbb{R}^2\)) to an output language (the regression function). The output language has obvious semantics, since we can just interpret it as a function \(\mathbb{R} \to \mathbb{R}\) mapping x-values to (estimates of) y-values. However, regression is not a "simple" compiler, as it is data-driven: there is no such semantics for the input language. Given just the data, there is no intrinsic/freely-generated total function that associates x-values with (estimates of) y-values. We do have such a function (just by interpreting the data as a mapping \(\mathbb{R} \to \mathbb{R}\)), but that function is not total, since there are x-values that do not occur in the data (necessarily, since there are continuum-many such values). This makes regression analysis remarkably similar to many more complex data-driven systems we would like to study, since many systems have partial semantics for the input language (e.g. the image recognition example in the previous post). Linear regression is therefore a good "test", in that approaches that work here will likely generalize to more sophisticated systems.

What can be done with a partial semantics?

Even though the semantics generated freely from the data is partial if the domain is \(\mathbb{R}\), there is nothing stopping us from considering things from another perspective. Particularly, as with all partial functions, we can restrict the domain to the subset of \(\mathbb{R}\) that occur as x-values in the data. By similarly restricting the domain of the semantics generated from the output language, we suddenly have two total functions with the same domain and codomain! With this, we could verify the system by proving that the semantics of the data matches the semantics of the regression function generated from the data via least squares, where "matches" denotes some equivalence relation (probably equality within some margin of error determined by the variance of y-values). However, the catch is that by restricting the domain in this way, we are also limiting the generality of our proof: we are only proving that the generated model "works" for the exact x-values that occur in the input data. This can still be useful in many cases, as we can carefully choose an interesting equivalence relation to operate under.

Generating total semantics

Rather than restricting the domain, another approach is to supplement the partial semantics for the input data with additionaly mappings in order to make them total. For example, given some input data, which generates function pointwise, we might "stretch" each y-value to associate with x-values surrounding the corresponding x-value, generating a piecewise. Of course, this is an illusory sort of verification: what's really happening is that we've constructed a second "compiler" from the input language to a second output language, and then proving that this new compiler is consistent with the original. This is also useful in many cases: consider that we might have one model that is accurate but computationally expensive, and another model that is less accurate but very fast.

Other definitions of correctness

A third way to "verify" a data driven model is to change our notion of correctness. Verifying correctness is only feasible if we have semantics for both the input and the output. However, there are many possible semantics! In the example of linear regression, let's consider a new "semantics" for input data that associates a set of ordered pairs with their center-of-mass point \((\bar{x}, \bar{y})\). Then, we can work under a relation that identifies lines with all points that occur on those lines This allows us to prove arbitrary properties (in this example that the center-of-mass point always occurs on the regression line) without leaving our framework for verification.

Author: Samuel Breese

Created: 2019-03-09 Sat 15:45

Validate