Data modeling is the act of
exploring data-oriented structures. Like other modeling
artifacts data models can be used for a variety of
purposes, from high-level conceptual models to physical
data models (PDMs). Physical data modeling is
conceptually similar to design class modeling, the goal
being to design the internal schema of a database,
depicting the data tables, the data columns of those
tables, and the relationships between the tables.
presents a partial PDM for the university - you know
that it isn't complete by the fact that the Seminar
table includes foreign keys to tables that aren't shown,
and quite frankly it's obvious that many domain concepts
such as course and professor are clearly not modeled.
All but one of the boxes represent tables, the one
exception is UniversityDB which lists the stored
procedures implemented within the database. Because the
diagram is given the stereotype Physical Data Model
you know that the class boxes represent tables, without
the diagram stereotype I would have needed to use the
stereotype Table on each table. Relationships
between tables are modeled using standard UML notation,
although not shown in the example it would be reasonable
to model composition and inheritance relationships
between tables. Relationships are implemented via the
use of keys (more on this below).
When you are physical data modeling
the following tasks are performed in an iterative
Identify tables. Tables are the database
equivalent of classes; data is stored in physical
tables. As you can see in Figure 1
the university has a Student table to store
student data, a Course table to store course
data, and so on. Figure 1 uses
UML-based notation (this is a publicly defined
profile which anyone can provide input into). If you
have a class model in place a good start is to do a
one-to-one mapping of your classes to data tables, an
approach that works well in "greenfield" environments
where you have the luxury of designing your database
schema from scratch. Because this rarely happens in
practice you need to be prepared to be constrained by
one or more legacy database schemas which you will
then need to map your classes to. In these situations
it is unlikely that you will need to do much data
modeling, you will simply need to learn to live with
the existing data sources, but you will need to be
able to read and understand existing models. In some
cases you may need to perform
legacy data analysis and model the existing schema
before you can start working with it.
Data normalization is a process in which data
attributes within a data model are organized to
increase the cohesion of tables and to reduce the
coupling between tables. The fundamental goal is to
ensure that data is stored in one and only one place.
This is an important consideration for application
developers because it is incredibly difficult to
stores objects in a relational database if a data
attribute is stored in several places. The tables in
Figure 1 are in third normal
Identify columns. A column is the database
equivalent of an attribute, and each table will have
one or more columns. For example, the Student
table has attributes such as FirstName and
StudentNumber. Unlike attributes in classes,
which can either be primitive types or other objects,
a column may only be a primitive type such as a char
(a string), an int (integer), or a float.
Identify stored procedures. A stored procedure is
conceptually similar to a global method implemented by
the database. In Figure 1 you
see that stored procedures such as averageMark()
and studentsEnrolled() are modeled as
operations of the class UniversityDB. These
stored procedures implement code that work with data
stored in the database, in this case they calculate
the average mark of a student and count the number of
students enrolled in a given seminar respectively.
Although some of these stored procedures clearly act
on data contained in a single table they are not
modeled as part of the table (along the lines of
methods being part of classes). Instead, because
stored procedures are a part of the overall database
and not a single table, they are modeled as part of a
class with the name of the database.
Apply naming conventions. Your organization should
have standards and guidelines applicable to data
modeling, and if not you should lobby to have some put
in place. As always, you should follow AM's practice
Apply Modeling Standards.
Identify relationships. There are relationships
between tables just like there are relationships
between classes. The advice presented relationships
UML class diagrams applies.
Apply data model patterns. Some data modelers will
apply common data model patterns, David Hay's (1996)
Data Model Patterns is the best reference on
the subject. Data model patterns are conceptually
closest to analysis patterns because they describe
solutions to common domain issues. Hay's book is a
very good reference for anyone involved in
analysis-level modeling, even when you're taking an
object approach instead of a data approach because his
patterns model business structures from a wide variety
of business domains.
Assign keys. A key is one or more data attributes
that uniquely identify a row in a table. A key that
is two or more attributes is called a composite key.
A primary key is the preferred key for an entity type
whereas an alternate key (also known as a secondary
key) is an alternative way to access rows within a
table. In a physical database a key would be formed
of one or more table columns whose value(s) uniquely
identifies a row within a relational table. Primary
keys are indicated using the <<PK>> stereotype and
foreign keys via <<FK>>.
Read here for more about keys.
Although similar notation is used
it is interesting to note the differences between the
PDM of Figure 21 and the
UML class diagram from which is ti based:
Keys. Where it is common
practice to not model scaffolding properties on class
models it is common to model keys (the data equivalent
isn't modeled for columns because they're all public.
However, because most databases support access control
rights you may want to model them using UML
constraints, UML notes, or as
business rules. Similarly stored procedures are
also public so they aren't modeled either.
No many-to-many associations.
Relational databases are unable to natively support
many-to-many associations, unlike objects, and as a
result you need to resolve them via the addition of an
associative table. The closest thing to an
associative table in is WaitList which was
introduced to resolve the on waiting list
many-to-many association depicted in the
class diagram. A pure associative table is
comprised of the primary key columns of the two tables
which it maintains the relationship between, in this
case StudentNumber from Student and
SeminarOID from Seminar. Notice how in
WaitList these columns have both a PK and
an FK stereotype because they make up the
primary key of WaitList while at the same time
are foreign keys to the other two tables. WaitList
isn't truly an associative table because it contains
non-key columns, in this case the Added column
which is used to ensure that the first people on the
waiting list are the ones that are given the
opportunity to enroll if a seat becomes available.
Had WaitList been a pure associative table I
would have applied the associative table
stereotype to it.
I will often use a CASE tool to
create physical data models. The two features I require
of a data modeling tool are the ability to generate data
definition language (DDL) code required to create the
database schema and the ability to reverse engineer a
data model from an existing database schema. Virtually
all data modeling tools still on the market today
support these features.