Most Engineering data can be tabulated into an array of rows and columns, making them amenable to relational database engines. Use of the relational paradigm is attractive because it emphasizes data integrity and parsimonious storage.
The relational paradigm also forces explicit recognition of relationships between the various elements of the data, mitigating many types of cognitive bias. Although rarely taught as such in an academic System Engineering setting, I consider it to be a fundamental SE skill.
I find that the most useful way to summarize the approach is by way of stating criteria as they are applied to the final product. In that context, the following basic definitions will be used:
- “Row” is synonymous with “record”. Most database engines are designed to add, modify, or delete records.
- “Column” is synonymous with “field” and “attribute”. Most database engines are intended to add or delete fields only when the database is being designed. In fact, many of them allow the database designer to specify rules making it nearly impossible to add fields once a table has been possible.
- A “database engine” is a collection of software routines than perform highly standardized “relational” operations on data. These routines don’t care what the data mean; they’re just implementing “set theory” to construct complicated data for display or as input to another piece of software. The structure issue is left to the database design phase, in which tables and their structure (columns) are defined.
From force of habits long established, I generally follow Shlaer and Mellor (S&M)1 in the analysis of “uncontrolled input data”. Their development follows an ordered set of rules, written in a highly accessible manner. There are, however, a couple of un-enumerated rules embedded in their process, and I find it preferable to re-state them as criteria (below) to be met by the final set of tables in a database.
As a general rule, I prefer the specification of criteria to specification of rigid process or procedure. This permits innovation with respect to methodology, without loss of integrity in the result.
Other methodological sources exist, all of which eventually lead to the same database design, if followed with sufficient discipline. Some of those sources also define criteria beyond those given below. I don’t dispute the utility of the additional criteria, but I have rarely encountered a need for them when operating on Engineering data, so I’ll leave their investigation to the interested reader.
1) Single Value: Cell entries shall have exactly one value.
Most technical engineering values are numerical. Fortunately, this criterion admits to categorical data as if selected from a list of valid entries (e.g., a “stoplight color” column can take values of “Red”, “Yellow”, or “Green”). The best way to deal with categorical data is to create a separate table for them, referencing into it from the primary data table(s). This will permit ancillary data to be recorded about the categories (which almost always turns out to be necessary).
Aside
SM uses “value” in the colloquial sense. I considered using “entry” here, but that would simply have created two possible sources of confusion where there would otherwise have been just one. Therefore, when discussing relational database principles, I follow the colloquial usage.
Because we have to operate in the real Engineering world, this criterion will require use of the IEEE 754-1985 conventions for “NaN” (Not a Number, meaning “undefined”) and “INF” (infinity). Not all relational database engines can accept these values, and their use might require extensive work-around. As far as I know, this notion is a significant departure from standard database design practice, but I’ve found no good ways around it2.
This criterion will also require temporary suspension as data are developed: even if we know the name of an attribute, we don’t always know the value right away, and we don’t want to be keeping lots of scrap paper while resolving the TBD’s (“To Be Determined”). Depending on the nature of the effort, I have successfully used three different practices to deal with values that are TBD:
- Use the NULL value defined for Structured Query Language (SQL) to indicate values that are not known yet. The database makes no attempt to serialize or manage the TBD’s other than to indicate where they are (they’re easy to find in this practice).
- Explicit enumeration of TBD’s, which fills the TBD cell with an integer pointing to the Pseudo-Primary Key3 in a separate, project-global tabulation of related data.
- This method is not strictly relational until all of the TBD’s have been resolved. Until then, there is a finite risk of ambiguity in the data.
- The ambiguity can be mitigated (but not eliminated) by shadowing each technically substantive table with and identically structured table of one-bit data indicating whether a cell of the primary table is a TBD4.
- Mitigation by means of secondary tables creates new possibilities for error: modification of the primary table (the one with the actual data) must be exactly matched in the secondary table.
- Mitigation by means of secondary tables should always be managed by code. Manual management has typically been shown inadequate.
- Mitigation by means of secondary tables isn’t really feasible unless all of the attributes are known before we start entering data into a table.
- Creation of a structured cell array in Matlab where the cell can accept either IEEE-compliant data or enumerated TBD’s in the form of a string (that is, cells that are not “strongly typed” at design time, but can be changed on the fly). Of course, this method requires dedicated coding of standard relational operations, but the basics are supported in standard m-code.
2) Atomic Data: Cell entries shall be atomic, without resorting to cypher or algorithm.
For example, “408731 seconds” is OK, but “4d:17hr:32min:11sec” is not.
Unit conversion is a different issue. Taken in concert with the non-transitive criterion (below), this means that a single set of units applies to all entries in a given column. Conversion to that (selected) set of units must be undertaken before data entry.
3) Keys: Each table shall have at least one combination of one or more fields identified capable of retrieving at most one record from the table.
It is possible for any given table to have more than one key. Standard practice selects one key as “Primary”, and places a uniqueness-assuring “index” on every other such combination. Certain relational operations operate by intersecting two (or more) tables over part or all of a given key.
It is common for a database designer to assign an auto-numbered field as the Primary Key; so common, in fact, that some people simply call such a field “Primary” and don’t bother with analysis of this issue. I refer to this concept as a “Pseudo-Primary Key”, because it actually violates some of the criteria below. It is, however, extremely useful because the use of such a field can dramatically reduce storage requirements and improve the speed of subsequent operations.
A Pseudo-Primary Key should not be defined as an initial approach to table design, because that allows the design to bypass much of the integrity enforcement tools built into the database engine. Good database design always restricts itself to the real data until all of the keys have been identified. Don’t add an auto-numbered Pseudo-Primary Key until after you’ve assured all of the “real” data are correctly related to each other.
Note the use of “at most one record” in the criterion. It allows for the possibility that no record matches the selection criterion: the case where we asked for something that could exist, but doesn’t.
4) Correct Attribution: Each field that is not part of the Primary Key shall be an attribute of the entire Primary Key.
This basically says that if there is relevant information about an entire class of things, then that is to be stored in a table about the class, separately from information about some specific subset of the class. So information about Mars goes in a table having a Primary Key of “Planet Name”, but information about “planet” as an abstract concept goes somewhere else.
Meeting this criterion will often require “splitting” of tables, where one table keeps the original (provisional) Primary Key, and each of the other tables gets a new key (usually by adding a new field to the original Primary Key in order to distinguish between rows).
5) Non-transitive Relationships: Each combination of one or more non-key fields shall be uncorrelated with any other combination of one or more non-key fields.
That is, the table cannot have relationships that are “non-transitive” (in set theory speak). It prohibits having a table where a combination of one or more cells with values than can be calculated (or looked up in a table) from any combination of the other cells in that same row. This is sort of like saying that the columns form a linearly independent set of equations: not exactly like it, but sort of.
Meeting this criterion will often require deletion of columns as the data are analyzed during database design. Such deletion is often accompanied by writing code that can easily reconstruct the deleted data as an output to be displayed or consumed by other software instead of being stored.
This notion is particularly important in the context of requirements and specifications, where we don’t want to have accidental contradiction. The importance of that notion cannot be over-stated!
Footnotes- Shlaer, Sally and Stephen J. Mellor, “Object-Oriented Systems Analysis”, Yourdon Press (ISBN 0-13-62903-X), (1988)[↩]
- That issue has often driven me straight to Matlab.[↩]
- Often referred to as an “autonumber” with uniqueness assured by the database engine.[↩]
- This concept can be extended to additional bits in order to accommodate “flavors” of TBD.[↩]