= Bi-Temporal PostgreSQL - A PostgreSQL Specific Bi-Temporal Database Emulator --- Author:: Ara Howard mailto:ahoward@fsl.noaa.gov or mailto:howardat@colorado.edu == Introduction --- The BtPgsql module contains a set of classes used to manage bitemporal relations stored in a PostgreSQL (http://www.postgresql.org) database. == Prerequisites --- * Running PostgreSQL installation (http://www.postgresql.org/) * Ruby >= 1.6.6 installation (http://www.ruby-lang.org) * Ruby PostgreSQL binding from the RAA (http://raa.ruby-lang.org/list.rhtml?name=postgres) * Basic understanding of Relational Database Management Systems (RDBMSs) (http://techdocs.postgresql.org/college/002_referentialintegrity/index.php) - Note: this link requires libflash.so configured in your browser. * Patience! == Topics --- * DISCLAIMER - Provisions for using this software * README - A general overiew of the problem domain (This file) * INSTALL - Installation instructions * DESIGN - Design overview * BtPgsql - Details of the API * TODO - It just keeps growing * BUGS - Hopefull not growing (*Please* report any to me) * DEPENDS - A short list of them * TUTORIAL - A simple tutorial == The Bi-Temporal Problem Domain --- It is tedious to explain the theory of bitemporal databases without some real world examples and in all of the following examples we will consider a data model which records the elevation and a description of meteorlogical weather stations. Each station is assumed to be designated by a unique non-null four letter code. In a traditional RDBMS the SQL used to construct a relation (table) used to store information about these stations might look like : create table stations ( sid text, elevation integer, description text, primary key (sid) ); The stations relation would contain (in a traditional RDBMS) tuples (rows) resembling : stations ------------------------------------------------ sid | elevation | description --- | | ------------------------------------------------ | | PANC | 40 | ANCHORAGE INTL ARPT KDEN | 1656 | DENVER INTL | | === Valid Time Relations --- Valid time relations have tuples, identified by a primary key, which change over time. In otherwords the relation's primary key is *not* unique within the table but *is* unique in the table when any given point in time is considered. For example, a valid time stations relation might contain the tuples : stations ----------------------------------------------------------- sid | elevation | description | vt_begin | vt_end --- | | | | ----------------------------------------------------------- | | | | PANC | 40 | ANCHORAGE INTL ARPT | -infinity | infinity KDEN | 1600 | DENVER INTL | -infinity | 1999 KDEN | 1656 | DENVER INTL | 1999 | infinity | | | | Where _vt_begin_ (valid time begining) is an *inclusive* timestamp demarking the instant the fact became true in the real world and _vt_end_ (valid time ending) is an *exclusive* timestamp demarking the instant the fact stopped being true in the the real world. The stations relation shows that station _PANC_ sits at 40 meters above sea level and that this has been true from the beginning of time (-_infinity_), and will remain, as is best known, true forever (_infinity_). It is often the case in valid time relations that the time a fact became true is not known and so it is simply entered as true since the beginning of time. The tuples describing _KDEN_ are more interesting; they show that _KDEN_ was (in the real world) situated at an elevation of 1600 meters above sea level starting from -_infinity_ (we don't really know when), but that in 1999 it moved to 1656 meters above sea level, and has remained there since. This is shown by the fact the the _vt_end_ time is _infinity_, implying that we do not know, at present, when this fact will stop being true. Graphically, one might envision this history as : KDEN -> -infinity | | 1600 meters | 1999 | | 1656 meters | V infinity Or more compactly as : KDEN -> [-i (1600) 1999)[1999 (1656) i) Where the symbols '[' and ')' designate the respective inclusive starting and exclusive ending points of a temporal period, and (xxxx) implies the datum indicated by the primary key for that period. This notation borrows from the mathmatical notation for a range where : [0 - 42) would indicate all those numbers 0,1,2,3,...,41 but not, unfortunately, including 42 (http://www.bbc.co.uk/cult/hitchhikers/). This table would allow such queries to be made as : What was the elevation of KDEN in 1998? select elevation from stations where sid = "KDEN" and vt_begin <= "1998" and -- inclusive "1998" < vt_end; -- exclusive or : What is the elevation of KDEN now? select elevation from stations where sid = "KDEN" and vt_begin <= now() and -- inclusive now() < vt_end; -- exclusive to be easily answered with simple sql queries. Notice that for these queries to be meaningfull it is absolutly *critical* that the following situation *never* arise : stations ----------------------------------------------------------- sid | elevation | description | vt_begin | vt_end --- | | | | ----------------------------------------------------------- | | | | KDEN | 1600 | DENVER INTL | -infinity | 1999 KDEN | 1656 | DENVER INTL | 1980 | infinity | | | | Notice that in the above situation _TWO_ tuples would be associated with the primary key _KDEN_ for the period [1990 - 1999), implying that it was simoultaneously at an elevation of 1600 and 1656 meters above sea level! This violates the semantic notion of a primary key - that it uniquely identifies a tuple and is non-null. For valid time relations is it necesary to ammend this definition to read thus : The primary key of a valid time relation is non-null and uniquely identifies a tuple given a specific moment in time. Later we will see how the BtPgsql::Relation class is able to interpret and enforce this meaning on temporal relations. We now begin to see why the normal RDBMS referential integrity constructs (like primary key) are insufficient to handle valid time relations Further complicating matters is the fact that valid time relations make possible the notion of updating certain rows over a range of time. For example, imagine that it was discoved today that the elevation of _KDEN_ had actually been 1597 meters above sea level during the period lasting from 1999 to 2000 while the station was operating at a temporary location before it's move the present location which sits at 1656 meters above sea level. Such information can be modeled in a valid time relation thus : stations ----------------------------------------------------------- sid | elevation | description | vt_begin | vt_end --- | | | | ----------------------------------------------------------- | | | | PANC | 40 | ANCHORAGE INTL ARPT | -infinity | infinity KDEN | 1600 | DENVER INTL | -infinity | 1999 KDEN | 1597 | DENVER INTL | 1999 | 2000 KDEN | 1656 | DENVER INTL | 2000 | infinity | | | | Affecting this update is another matter entirely! These types operation can certainly be done via standard SQL, but each update must be hand written - an error prone and tedious task. We will see later that a Ruby BtPgsql::Relation can handle the details of this update automatically. One last concept, and major source of complexity, with valid time relations is the concept of valid time foreign keys. In a non-temporal RDBMS the concept of a foreign key referes to a field in one relation which refers to a primary key field in another (possibly the same) relation. Thus, by extension a foreign key, like a primary key, uniquely identifies exactly one tuple, though in another table. Foreign key relationships between two valid time relations have a slightly different definition : A valid time foreign key must identify a set of tuples in the referenced relation whose valid time periods completely _cover_ that of the referencing tuple That's a somewhat confusing definition. This example should clarify. Assume a valid time stations relation : stations ----------------------------------------------------------- sid | elevation | description | vt_begin | vt_end --- | | | | ----------------------------------------------------------- | | | | KDEN | 1600 | DENVER INTL | -infinity | 2000 KDEN | 1656 | DENVER INTL | 2000 | infinity | | | | where _sid_ is the valid time primary key. Further, assume a station_managers relation : station_managers ----------------------------------------------------------- mid | sid | name | vt_begin | vt_end --- | ^^^ | | | ----------------------------------------------------------- | | | | 42 | KDEN | Ford Prefect | -infinity | infinity | | | | where _mid_ is the valid time primary key and _sid_ is a valid time foreign key referencing stations. This is a valid relationship because, at any given moment, the foreign key _sid_ refers to exactly one tuple in stations. Note that this foreign key identifies a *set* of tuples spanning the entire timeline of the referencing tuple. Personally, I like to visualize this as in : station_managers 42, KDEN -> [-infinity -- infinity) | | | IS COVERED BY | stations | | | | KDEN -> [-infinity -- 2000)[2000 -- infinity) But consider if stations was instead : stations ----------------------------------------------------------- sid | elevation | description | vt_begin | vt_end --- | | | | ----------------------------------------------------------- | | | | KDEN | 1600 | DENVER INTL | -infinity | 1990 KDEN | 1656 | DENVER INTL | 2000 | infinity | | | | And station_managers was, as before : station_managers ----------------------------------------------------------- mid | sid | name | vt_begin | vt_end --- | ^^^ | | | ----------------------------------------------------------- 42 | KDEN | Ford Prefect | -infinity | infinity Note the vt_end of 1990 in the first tuple of stations indicating that KDEN did not exist during the period [1990 - 2000)! One could visualize this as : station_managers 42, KDEN -> [-infinity -- infinity) | | | NOT COVERED BY | stations | ^^^ | | | KDEN -> [-infinity -- 1990) [2000 -- infinity) ^^^^ Notice the *gap*! This gap shows an error in the valid time referential model - namely that Ford Prefect was the manager of a non-existent station during the period [1990 - 2000). This is a type of referential itegrity not easily managable within the confines of ANSI-92 SQL. We will see later how a BtPgsql::Relation can prevent this sort of referential itegrity anomoly from occuring. === Transaction Time Relations --- Transaction time relations are similar to valid time relations except that they track the state of the actual relations, not that of the real world. These are sometimes known as audit logs as these types of relations can be used to track the history of a table's inserts, updates, and deletions and can, therefore, be queried as to the particular state of a table at some instant in the past. Their structure is also similar to valid time relations, for example : stations ----------------------------------------------------------- sid | elevation | description | tt_start | tt_stop --- | | | | ----------------------------------------------------------- | | | | PANC | 40 | ANCHORAGE INTL ARPT | 1970 | infinity KDEN | 1600 | DENVER INTL | 1970 | 1999 KDEN | 1656 | DENVER INTL | 2000 | infinity | | | | Shows that tuple identified by station _PANC_, at 40 meters above sea level, was *inserted* at the *inclusive* transaction start time (_tt_start_) of 1970, and has never been deleted, shown by an *exclusive* transaction stop time (_tt_stop_) of infinity. The transaction time relation _stations_ also shows that the tuple identified by station _KDEN_ was initially inserted into the stations relation in 1970, but was then virtually deleted from the relation in 1999 and then reinserted into stations 2000 where it has remained since. Note that ther are no actual deletions done from transaction time relations! It is this property which allows one to *_reconstruct_* a relation from a prior period using a simple SQL statement such as : select * from stations where tt_start <= 1975 and -- inclusive 1975 < tt_stop; -- exclusive which would yield the relation as it existed in 1975. Using the graphical notation from above we can show this transaction time relation with respect to _KDEN_ as : KDEN -> 1970 | | 1600 meters | 1999 !! NO DATA HERE !! 2000 | | 1656 meters | V or more compactly as : KDEN -> [1970 (1600) 1999) [2000 (1656) i) ^^^^ Important observations regarding transaction time relations are * they model the state of the database, *not* the external (modeled) reality * they have similar issues regarding primary keys, foreign keys, and referential integrity as valid time relations * tt_start is always greater then, or equal to, _NOW_ - this is because although we cannot travel back in time to enter a tuple (_tt_start_ < _NOW_) we *can* enter a tuple with a _tt_start_ time in the future, thereby commiting an insertion to a future date. This last feature can be quite usefull. For example, assume we know, at the present moment, that KDEN will be changing locations in 2004 and that the elevation will then be 1777 meters above sea level. We may then execute the following at any point prior to that : -- virtually delete the extant KDEN row in the future update stations set tt_stop = "2004" where sid = "KDEN" and tt_stop = "infinity"; -- insert a row which will not "appear" untill 2004 insert into stations values ("KDEN", "1777", "DENVER INTL", "2004", "infinity"); Resulting in stations becoming : stations ----------------------------------------------------------- sid | elevation | description | tt_start | tt_stop --- | | | | ----------------------------------------------------------- | | | | PANC | 40 | ANCHORAGE INTL ARPT | 1970 | infinity KDEN | 1600 | DENVER INTL | 1970 | 1999 KDEN | 1656 | DENVER INTL | 2000 | 2004 KDEN | 1777 | DENVER INTL | 2004 | infinity | | | | Which will commit the information and yet processes which may execute queries like : -- get the current elevation of KDEN select * from stations where sid = "KDEN" and tt_start <= now() and now() < tt_stop; will *continute* to see the elevation of 1656 untill 2004 has arrived, at which point the 1777 tuple will become visible using the same query as above. This feature can be used to maintain a single relation for both testing and production software versions since production processes may ask for the currently active tuple while testing processes may ask for the latest tuple even if that tuple technically will not exist untill a future date. If testing is successfull production processes will automatically begin using the new tuple when the specified time has elapsed. We will not cover transaction time relations further because the next section describes the focus of BtPgsql, bi-temporal Relations, which encompass both vaild time relations and transaction time relations in one incredibly difficult to understand bundle ;-). === Bi-Temporal Relations --- Bi-temporal relations model both the history of a relation and the history of the facts modeled by that relation - they track the history of the table and record a history of a reality which changes over time (is there any other?). In otherwords they at once exihibit all the traits of valid time and transaction time relations. For example, consider this sequence of events committed against a bi-temporal relation : Event 0) The year is 1970 and it is decided that, as is best known, the KDEN station is sitting at 1600 meters. It is not really known when this became true (the station began operating) but it is condsidered to have been in existence from the beginning of time. It is not known when the station will move and so a vt_end time of infinity is stored. This fact is entered with a tt_start time of *NOW* (1970) and the tt_stop field is left as infinity, since the tuple has not yet been virtually deleted. stations --------------------------------------------------------------------------------- sid | elevation | description | vt_begin | vt_end | tt_start | tt_stop --- | | | | | | --------------------------------------------------------------------------------- | | | | | | KDEN | 1600 | DENVER INTL | -infinity | infinity | 1970 | infinity | | | | | | Event 1) In 1980 it is discovered that the station actually moved in 1979, and that it is now located at an elevation of 1656 meters. The relation can model this as : stations --------------------------------------------------------------------------------- sid | elevation | description | vt_begin | vt_end | tt_start | tt_stop --- | | | | | | --------------------------------------------------------------------------------- | | | | | | KDEN | 1600 | DENVER INTL | -infinity | infinity | 1970 | 1980 KDEN | 1600 | DENVER INTL | -infinity | 1979 | 1980 | infinity KDEN | 1656 | DENVER INTL | 1979 | infinity | 1980 | infinity | | | | | | This set of tuples would allow the following queries to be answered : What was the elevation of KDEN in 1965 as was best known in 1973? select * from stations where sid = "KDEN" and vt_begin <= "1965" and "1965" < vt_end and tt_start <= "1973" and "1973" < tt_stop; Which would identify this tuple : stations --------------------------------------------------------------------------------- sid | elevation | description | vt_begin | vt_end | tt_start | tt_stop --- | | | | | | --------------------------------------------------------------------------------- | | | | | | KDEN | 1600 | DENVER INTL | -infinity | infinity | 1970 | 1980 | | | | | | Or this query : What was the elevation of KDEN in 1981 as is best known now? select * from stations where sid = "KDEN" and vt_begin <= "1981" and "1981" < vt_end and tt_start <= now() and now() < tt_stop; Which would identify this tuple : stations --------------------------------------------------------------------------------- sid | elevation | description | vt_begin | vt_end | tt_start | tt_stop --- | | | | | | --------------------------------------------------------------------------------- | | | | | | KDEN | 1656 | DENVER INTL | 1979 | infinity | 1980 | infinity | | | | | | Hopefully it is clear at this point that bi-temporal relations model information about the history of the real world and also when this information was inserted/deleted/updated in a relation. These two timelines, that of the real world (valid time) and that of the relation (transaction time) are completely orthogonal. It is easy to imagine all of this becoming very confusing, especially when relationships (foreign keys) between bi-temporal relations are considered. It would be beyond the scope of this doccument to completely described the semantics of all possible bi-temporal relationships but note that BtPgsql handles all of them automatically for you in most cases and that all that is required by users of the classes in this module is an understanding of *what* to do - BtPgsql takes care of *how*. == BtPgsql - Bi-Temporal Management Made Easy(ier) --- === The _How_ --- Before moving along to *what* BtPgsql can do, the reader would be advised to puruse the explanation provided of *how* it accomplishes it's function described in the DESIGN overview. === The _What_ --- At this point we are in a position to discuss some of the things BtPgsql can do for you. This part the doccumentation defers to that of the BtPgsql module and it's fundemental class - BtPgsql::Relation. We simply enumerate here, some of the more important methods of this class : * BtPgsql::Relation.seq_insert * BtPgsql::Relation.seq_update * BtPgsql::Relation.seq_delete * BtPgsql::Relation.seq_push == Summary --- BtPgsql can substantially reduce the headaches associated with managing bi-temporal relational database tables with hand written sql. == Suggested Reading --- The inspiration for BtPgsql was the excellent, Developing Time-Oriented Database Applications in SQL by Richard T. Snodgras. Several logic errors in that text have been correctly implemented in BtPgsql and surely several have been introduced - it is a difficult set of rules to correctly code.