Independently updatable variables

Let information be recorded in a set of variables, and there are declared integrity constraints on these variables. We say a variable is independently updatable if the set of possible values of that variable (as determined by the constraint) is independent of the values of the other variables.

For the purpose of investigating independently updatable variables we are not interested in alternative ways of changing a variable (e.g. UPDATE versus INSERT versus DELETE on a relvar). Instead we are going down to a finer-grained level so we can talk about variables which can be assigned any value of their type.

In other words we’re tending to (where we can) uphold a principle:

To be an updatable variable of type T is to support assignment of any value of type T.

or

To be an updatable variable is to be an independently updatable variable.

If a variable is not independently updatable we’re tending to ignore it being available as the denoted target for update operations.

Example

For example suppose relation R represents the extension of the integrity constraint on variables var1,var2, var3. This is the set of possible (i.e. valid) states of variables var1, var2, var3.

It can be proven that every nonempty relation has a unique prime Cartesian factorisation. In this case R = R1⨯R2. R1 and R2 are the prime factors of R.

R = R1 R2
var1var2var3
a1x
a2x
b1x
a1y
a2y
b2y
var1var2
a1
a2
b1
var3
x
y

The unique prime Cartesian factorisation implies there is a unique maximal partition of the variables into groups which can be updated independently. This partition is uniquely determined by the constraints. Complicated constraints mean fewer groups and less opportunity for updating variables independently.

In this case {var1,var2,var3} is partitioned into two sets: {var1,var2} and {var3}.

The prime Cartesian factorisation of the extension of a constraint

There is a unique prime Cartesian factorisation (PCF) of any non-empty relation. In the following we are interested in the PCF of the extension of a possrep constraint.

There is a unique PCF of the extension of a given possrep constraint, the effect being to partition the possrep components so that the components within one factor can be updated independently of the other factors - i.e. there are no constraint dependencies between factors.

Note that a unique PCF exists for both finite and infinite non-empty relations, so for example it is applicable to constraints in computational geometry examples involving infinite sets in Euclidean space. Of course that involves intensional definitions of sets.

For example it is easy to see that we can factorise

`{ (x,y,z) ∈ ℝ^3 | x^2+y^2=1 ∧ z>0 }`

as

`{ (x,y) ∈ ℝ^2 | x^2+y^2=1 } ⨯ { (z) ∈ ℝ | z>0 }`

The way in which constraints can be expressed is open ended. In the context of mathematical computing, data structures and algorithms in general (not just the trivial finite relations), it would be interesting know whether in general if one splits a representation into two parts with no mutual constraints, can it be assumed this particular division is compatible with the maximal way of splitting the representation into many parts with no mutual constraints? Intuitively it seems plausible but can it be proven? That is the purpose of the unique PCF theorem.

The unique PCF theorem (which is an existence & uniqueness proof) is what allows one to give a formal definition of the unique maximal partition of variables into groups which can be updated independently. It therefore provides an objective metric for selecting between alternative representations.

It shows that in general constraints on a tuple no matter how complex can be uniquely written in a normalised form as a conjunction where the extension of each conjunct is a prime Cartesian factor of the extension of the entire constraint. Here we are only talking about a normal form modulo how the conjuncts are expressed. This normal form sounds vaguely like Conjunctive Normal Form (CNF) but it's not the same thing.

It is conjectured that the normal form is very important to high performance dbms implementation technique (it relates to asynchronous updates, locking and concurrency, replication and synchronisation, branching and merging, constraint checking, ...)

We shall abuse terminology and talk about the information recorded in a prime factor, which really means the information recorded by the attributes in that factor. There are some general statements that can be made in this regard. For example, prime factors relate to units of cohesion. i.e. there is some sense in which information within a prime factor is mutually interdependent and indivisible, whereas between different prime factors the information is more independent, and provides a basis for supporting independent and concurrent data entry by different users. That means it is a relevant to data representation independence and updateable views. If a representation provides many prime factors then it is expected there are more opportunities for choosing where and what information is visible and/or updatable. This translates into better support for data representation independence.

Prime factors are also relevant to locking and concurrency, and the granularity of atomic updates. For example, if we define atomic updates to be the most primitive updates that are compatible with integrity preservation, then clearly that idea is somehow connected to prime factors. It follows that prime factors might be relevant to scalability of databases. If very large databases don't have prime factors then atomic updates might be complex, take longer to execute, are more difficult to validate and concurrency might be limited. This is particularly relevant to distributed databases - e.g. it might be preferable for prime factors to not be distributed over a network to avoid the need for distributed atomic commit protocols.

If prime factors are indeed an important consideration, it's nice to know that they are defined uniquely and independently of the way integrity constraints happen to be expressed in the schema (e.g. uniqueness constraint, functional dependency, foreign key constraint, or some boolean valued expression making use of the relational algebra). Prime factors provide some basis for comparing alternative data representations.

The prime factors on the database tuple of a relational database represent a fairly coarse partition of a large schema into independent parts, and therefore only appears to have limited utility.

However, later in this paper we discuss the idea that a database can typically be factored into a much larger set of conditionally existent prime factors.

If prime factors are small enough, then it is possible that a system with MVCC + one write mutex per factor, has advantages over conventional strict 2PL lock managers with dead-lock detection etc. What's the point in burdening the lock manager with support for concurrent updates within a factor within which values have a bearing on each other - as far as constraint checking is concerned?

todo

The effect is to partition the attributes of the relation into non-empty subsets. A projection gives the potential factors. Note that the motivation is not to factorise the relations recorded in a database! Instead the intended application is the prime factorisation of the relation which is the set of tuples on which a constraint is imposed on a representation of a value of a given type using a tuple. The idea is to define mathematically what it means exactly for relvars to be "connected" given the database constraints, in the sense of partitioning them into groups that can be updated independently. The constraints on the dbvar determine a set of dbvalues which satisfy the constraints. This is a set of tuples. In fact it is a non-empty relation which has a unique prime Cartesian factorisation. This gives the partition of the relvars into mutually "connected" groups