Information equivalence classes

Two schemas are information equivalent if they represent the same information.

Let `W` denote the set of possible world situations, and let `S` be a relational database schema.

Let `d_S(w)` denote the database value for schema S in world situation `w∈W`.

Let `D(S)` = `{ d_S(w) | w∈W }` denote the set of possible database values with schema `S`.

It is assumed `D(S)⊆C(S)` - i.e. the constraint doesn't prevent the recording of possible world situations. We say the database constraint of `S` has been specified maximally if `C(S) = D(S)`.

Definition: We say the information in schema `S_1` contains the information in schema `S_2` if
`∀w,w'∈W: (d_(S_1)(w)=d_(S_1)(w')) → (d_(S_2)(w)=d_(S_2)(w'))`

Definition: We say `S_1` and `S_2` are information equivalent if each contains the information in the other.

This relation is reflexive, symmetric and transitive, and is therefore an equivalence relation. Therefore database schemas are partitioned into information equivalence classes.

The business requirements determine the information to be recorded which in turn determines all the following:

the information equivalence class
the "maximal" constraint on each schema within the information equivalence class
what facts can be asserted independently of other facts

Maximal decomposition of information

For a given information equivalence class and constraint, it can be shown there always exists a schema which achieves the maximal partition of its relvars into groups which can be updated independently. See maximal decomposition of information.

Interestingly this maximal decomposition corresponds to the unique prime factorisation of the number of possible database values `N = |C(S)|`.

Do we expect that in real database systems `N` tends to be a large prime number?

It would be surprising if the number of states defined by real-life database schemas (with maximal constraints) tend to be prime numbers. It's difficult to think of a reason to expect it given that the Prime Number Theorem [] says large primes are much less common than large composites, and given that in mathematics/computing it is generally a hard problem to find large prime numbers, despite number theorists thinking about the problem for a long time.

It seems likely that in practise for real databases with finite types the number of prime factors is very large despite the constraints. Perhaps this isn't apparent because of the tendency to not consider designs having Database-Valued-Attributes (DVAs) (which goes against Codd's idea of simple attribute types).

Normalisation

Normalisation is concerned with identifying a schema within an equivalence class which is more convenient to use in some sense. It doesn't affect what facts can be asserted or retracted independently of other facts.

It would be interesting to know how nornalisation relates to maximal decomposition as defined above.