# Approaches to Data-Driven Verification

## Introduction

It is desirable to verify (that is, to prove the correctness of) many systems. Here, I explain some of the techniques and terminology involved in verification. No proof assistant is used, but all mathematics are done in a way that it could trivially be translated to any system based on constructive type theory.

## What does it mean to verify a compiler?

Let's treat a compiler like a mapping $$c : A \to B$$ from a source language $$A$$ to a target language $$B$$. Both the language $$A$$ and the language $$B$$ have some semantics, and we'd like to verify that the A-semantics of a given program $$a : A$$ match the B-semantics of the translated program $$c(a) : B$$. We can model those semantics denotationally by considering mappings $$d_A : A \to (I_A \to O_A)$$ and $$d_B : B \to (I_B \to O_B)$$ from programs in the respective languages to changes over some language-specific description of input and output state. We also need some way to demonstrate that the input and output states for different languages correspond with each other; for our purposes, bijections $$c_i : I_A \to I_B$$ and $$c_o : O_A \to O_B$$ will serve. Thus, we want to show that for any input, $$\forall a : A. \forall i : I_A. c_o(d_A(a)(i)) \equiv d_B(c(a))(c_i(i))$$ (under some domain-specific equivalence relation). More succinctly, we want to show that commutes (up to isomorphism).

The tactic above is sufficiently general to model the semantics for practically any language. For example, the semantics $$d_\lambda$$ of the untyped λ-calculus might associate a λ-expression $$E$$ with $$n$$ free variables with a function from $$n$$ λ-expressions to a single λ-expression representing the reduction of $$E$$ with the free variables replaced. The semantics $$d_C$$ of the C programming language would represent a mapping from a C program into a complex function between machine states (memory, filesystem, etc.). Notice that as the language gains more power to mutate arbitrary state in uncontrolled ways, the complexity of this mapping, and therefore the overall complexity of verifying the compiler's correctness, increases severely. On this note, it is sometimes fruitful to model only a subset of a language, and provide limited or no safety guarantees for parts of the language outside of this subset. This happens frequently in the context of practical verification. The most prevalent form of static program analysis for imperative programming languages like C is the type system, which guarantees that terms are used in the proper contexts. However, C allows one to leave the safety envelope of the type system by casting a term of any type to any other type, thereby leaving the subset of C "ensured" by the typechecker. When modeling systems that do not carefully manage state, we can expect this issue to arise again and again.

## A simple compiler

With this in mind, let's return to our model for compiler verification, and build a simple example. For our source language $$S$$, let's choose a simple language for doing arithmetic in postfix ("reverse Polish notation") over natural numbers and variables. For simplicity, we'll support only addition and multiplication as operations. A simple expression might look like $$1\ x + 2\ \times$$, which in English means "double the result of adding one to $$x$$".

We'd like to translate expressions in this postfix representation to more-readable infix notation. Continuing the example above, our compiler $$c : S \to T$$ from the source language to our target language $$T$$ would translate $$1\ x + 2\ \times$$ to $$((1 + x) \times 2)$$. There are a few interesting things we can notice about $$c$$ right away. First, $$c$$ doesn't need to be injective: since addition and multiplication are commutative (a property that will eventually manifest itself in our semantics $$d_S$$ and $$d_T$$), there are many possible inputs could yield a particular output under some choice of $$c$$. For example, we might design $$c$$ such that $$1\ x + 2\ \times$$ to and $$2\ x\ 1 + \times$$ both translate to $$((1 + x) \times 2)$$.

Next, $$c$$ doesn't need to be surjective, either. We don't care about translating programs in $$T$$ to $$S$$, so it's fine if there are programs $$t$$ in $$T$$ that do not have a corresponding $$c^{-1}(t) : S$$. This is related to the previous discussion about partially verifying a language - we can profitably define a compiler from a source language to a subset of a target language with easier-to-define semantics. Here, our target language can be anything that has predictable semantics for infix arithmetic expressions with variables and numeric constants. Of course, in many languages, the actual semantics might not be so predictable: in C and C++, integers that are too large experience overflow, introducing complexity and violating some algebraic properties that we would otherwise like to rely upon in our proofs. For simplicity, we will not consider such complexity here, but it is good to know that it is a viable choice to use a subset of a more complex system as a target language.

In this case, our function $$c$$ is quite simple (where $$n : \mathbb{N}$$, $$x : \mathrm{Var}$$ is an arbitrary variable name, and $$s_1, s_2 : S$$):

\begin{align*} n &\mapsto n \\ x &\mapsto x \\ s_1\ s_2\ + &\mapsto (c(s_1) + c(s_2)) \\ s_1\ s_2\ \times &\mapsto (c(s_1) \times c(s_2)) \end{align*}

Note that the above is not a semantics for $$S$$ but merely a syntactic mapping from terms in $$S$$ to terms in $$T$$.

## Semantics

Now that we've specified $$c$$, we can move on to specifying the semantics of terms in both $$S$$ and $$T$$. We start with $$d_S : S \to (I_S \to O_S)$$, the semantics of $$S$$. The domain $$I_S$$ will be associations of variable names with bindings, represented as sets of Cartesian products $$\mathrm{Var} \times \mathbb{N}$$. The codomain $$O_S$$ will be natural numbers, representing the evaluation of the arithmetic expression under those variable substitutions. The semantics $$d_S$$ is then specified fully by the following rules (where $$\Gamma : I_S$$): Similarly, we can define $$d_T : T \to (I_T \to O_T)$$, the semantics of $$T$$. $$I_T$$ and $$O_T$$ will be the same as $$I_S$$ and $$O_S$$, respectively: we're still just mapping variable bindings to results. Formally, the model of semantics is a little more complicated, as we need to include some sort of way to indicate that an input does not have a corresponding output. However, this is mostly just logisitics (by making $$O_S$$ and $$O_T$$ the coproduct of $$\mathbb{N}$$ and the unit type and properly managing this construction) and has therefore been elided here.

## Verification

Now that we have a model for the compiler ($$c$$) and semantics for expressions in the source ($$d_S$$) and target ($$d_T$$) languages, we can verify that the behavior of the compiler matches our expectations. This is almost always easiest to do using structural induction over the source language $$S$$. Recall that a term $$s : S$$ can be constructed in exactly four ways:

1. from a natural number $$n$$,
2. from a variable name $$x$$,
3. given two subterms $$s_1, s_2 : S$$, from those two subterms followed by the symbol $$+$$, and
4. given two subterms $$s_1, s_2 : S$$, from those two subterms followed by the symbol $$\times$$.

There are two "base cases" (construction from a natural number and construction from a variable name) and two inductive cases. It is easy to prove that $$d_S(x) = d_T(c(x))$$ and $$d_S(n) = d_T(c(n))$$ for all $$x : \mathrm{Var}$$ and $$n : \mathbb{N}$$, since $$c$$ leaves those terms unchanged and the related semantic rules are identical for both languages.

That leaves the inductive cases. Assuming that $$d_S(s_1) = d_T(c(s_1))$$ and $$d_S(s_2) = d_T(c(s_2))$$, we must show that $$d_S(s_1\ s_2\ +) = d_T(c(s_1\ s_2\ +))$$ (and similarly for $$\times$$). We know that $$d_S(s_1\ s_2\ +) = d_S(s_1) + d_S(s_2)$$, so this is equivalent to showing that $$d_S(s_1) + d_S(s_2) = d_T(c(s_1\ s_2\ +))$$. We also know that $$c(s_1\ s_2\ +) = (c(s_1) + c(s_2))$$, so this in turn is equivalent to showing that $$d_S(s_1) + d_S(s_2) = d_T(c(s_1) + c(s_2))$$. Since $$d_T(c(s_1) + c(s_2)) = d_T(c(s_1)) + d_T(c(s_2))$$, this is equivalent to $$d_S(s_1) + d_S(s_2) = d_T(c(s_1)) + d_T(c(s_2))$$, which follows by reflexivity from the induction hypothesis.

Applying a similar method for $$\times$$ finishes the proof. We are now sure that our compiler $$c$$ preserves the meaning of a program in the source language $$S$$ when translated to the target language $$T$$.

## Introducing data

A data-driven model is, in many ways, much like a compiler. Consider a machine learning system that, trained upon $$n$$ images $$d_i : D,\ i \in [0, n)$$, builds a predictive model $$m : D \to \mathbb{B}$$ determining whether or not a single image contains a house. We would like to verify that given good training data, this machine learning system produces good models: if the training data has some properties, then the resulting models match (with some margin of error that will be included in a domain-specific equivalence relation) a human observer (a "perfect" model) determining whether an image contains a house. This is obviously infeasible in the general case! If we had a perfect model that was sufficiently rigorous that we could prove properties about it, we likely wouldn't be using these machine-learning techniques to produce inferior models.

Of course, in practice, there is more to consider: perhaps the perfect model has terrible time or space complexity, while the approximate models are efficient. Or, even if there isn't a perfect model, we could prove that the approximate models have certain input-dependent properties (but this is moving away from the goal of verification: a proof of correctness, rather than formalization of what are probably theorems in the relevant domain). In the end, in order to prove that anything is correct, we need some way to determine what correctness is, at an appropriate level of formality. In compilers, we have an easy way to assess correctness, since we have semantics for the input and output languages (ideally). In data-driven systems, the waters are murkier: in essence, we lack semantics for the source language (in the machine-learning example, the training data).

## Moving forward

I think that the best approach to ensuring correctness when data-driven systems are involved is to try to maximize the size of the "safety envelope". I don't think verification of most data-driven components is feasible or indeed coherent (although we may be able to prove consistency with e.g. a statistical model, etc.). However, it is unlikely that any system involved is entirely data-driven. Any determistic components (anything we can formalize using the broad model of a compiler described above) should be verified. I expect that this can be done in many unexpected places by considering interesting definitions of the semantics (e.g. maybe associating probability distribution) and equivalence. From there, we use formalized properties of the data-driven components (distinct from verifying correctness) to show that they interact favorably with the other (verified) components. This method has the advantage of allowing a modular sort of proof development: I can proceed incrementally, verifying components and proving properties as I go. At each step, we having a working system, and new development progressively widens the safety envelope.

Created: 2019-03-09 Sat 15:45

Validate