How to develop a conceptual data model

An(other) example: creating an ERD for the dance school


The title is a hybrid of a question and guideline. A similar hybrid appears in Chapters 2-6 for each type of model presented in “The what and how of modelling information and knowledge” where I illustrate the development of a ‘typical’ model of that type. Chapter 4 is about conceptual data models where I made the choice to demonstrate the procedure with a topic that aligned with Chapter 3’s example. But I promised to use another example on the book’s website, and here we are.

The example—domain and task—is to develop a conceptual data model for a dance school following my adaptation of the Conceptual Schema Design Procedure for ORM that was originally proposed by Terry Halpin. While ORM is arguably better, EER is more popular. (This page assumes you know what EER diagrams are.)

The reason I gave for relegating it to the online web page was that the dance school model may be more useful to more people and therefore it was to be left to free online material. This is true. There are comparatively many more dance schools that face the same tasks of membership and class administration than there are ornithologists and ecologists needing databases. What is also true is that there are many more distracting options than the comparatively demarcated topic of the scientific database. This is fine for a web page but not so much for a book. And so, without further ado, here are those seven steps, which are preceded by a ‘bonus’ step for context at the start and considerations on EER notations and looking beyond the CSDP at the end.
   

The exploration phase (“Step 0”)

Not all dance schools are the same, yet a model for the database depends on the dance school. Also in this case, there is no readily available neat, clear, short, contained description to base the model off, and thus the situation is again not like the typical textbook case of EER diagram development. Let’s explore the domain before we’ll talk to our client. For this illustration: which sort of dance school are we going to select to assume to be our client?

I could take Maroesjka where I used to dance folk dance for 12 years. Members (or their parents, or the municipality if their income was below the threshold) paid upfront for the whole school year that ran from September to July. Most members were a member of one group, but if of two, then the other was the demonstration group. The teachers taught more than one class. Several housewives sewed the national and international costumes. For the adult groups, you could buy a tape with the songs of the dances we learned that school year. This succinct description is along the line of the stories with crisp sentences that we as lecturers conjure up as exercises for students to devise an EER diagram.

A conceptual data model for a database to store this information would be quite different from the Latin & Ballroom school, even though it operated in the same village at the same time. You (/parents/municipality) paid per semester, there were rigid levels, exams to take (bronze, silver, gold, gold star), contests for the very dedicated, and summer dance crazes (the lambada when I danced there—it’s been a while). One level covered several dance styles, such as cha-cha-cha, quickstep, or waltz. Classes were taught by a couple who taught multiple classes.

In contrast, the “SBK scene”, to learn Salsa, Bachata and/or Kizomba (and/or Semba and/or Cha-Cha-Cha and/or ladies styling thereof, and/or a particular variant of each type) here in Cape Town, is a yet different story, irrespective of the school. There may be ‘walk-ins’ (pay-per-lesson) or monthly payments, sign-ups for one dance style or for multiple ones with a reduced fee per month, and group lessons or private lessons. Each lesson does one dance style only. This may go on for the whole calendar year (paid monthly), for a performance group for a few months, or for a flashmob, or a workshop day or weekend. Songs may be shared by a teacher by means of a link to a Spotify list. There may be a WhatsApp group for communication. Costumes for performances or flashmobs are a story unto their own.

Now what? These examples would be part of the sort of preparatory work to do before meeting a client. It assists with getting an idea of the ‘universe of discourse’ and can feed into devising sample questions to ask the client in order to extract data needs. Like “do members pay per month?”, “can a teacher teach more than one class?”, “is each class for one dance style only?” and so on. Each dance school will have their unique set of answers to these questions and then some more data needs for their own prospective database for membership management. Not even two dance schools of the same type will have the exact same business rules and data needs. That’s also precisely the point with conceptual data models as compared to, say, ontologies: it is allowed to be different, and we’ll make it work somehow.   

What they all have in common is that there are members and teachers, classes, dance genres and styles, levels, and music for the classes. Perhaps we can build an enterprise model for that, but not today. For now, we’ll demarcate it to be for a system for a Latin & Ballroom dance school where payment is made per semester.
     

“Step 1” – The Facts 

Besides questions to dig deeper, we’ll need to get plausible sample data and construct elementary facts from them, which is part of that “step 1” (out of 7) of the conceptual schema design procedure (CSDP). A selection of such facts that cannot, or ought not to, be decomposed further without losing information may be as follows:

Member with name Lauren takes class #sem1in2023B1
Member with name Lauren takes class #sem1in2023L2
Member with name Josephine takes class #sem1in2023B3
Member with name Jake takes class # sem2in2023B3
Member with name Jake takes class #sem1in2023L1

class #sem1in2023B1 involves Ballroom genre at level Beginner in semester 1 2023
class #sem1in2023L2 involves Latin genre at level Intermediate in semester 2 2023
class #sem1in2023B3 involves Ballroom genre level Advanced in semester 1 2023
class #sem2in2023B3 involves Ballroom genre level Advanced in semester 2 2023
class #sem2in2023L3 involves Latin genre level Advanced in semester 2 2023

Member Josephine paid for semester 1 2023 
Member Jake paid for semester 2 2023

Jake as leader partners with Josephine as follower in the 2023 Cape Contest
Luke as leader partners with Lauren as follower in the 2023 Cape Contest
Luke as leader partners with Shireen as follower in the 2023 Cape Contest

Employee with name Shireen teaches Latin genre
Employee with name Josephine teaches Ballroom genre

You’re supposed to write down samples for all required data. For the sake of brevity, I have omitted the Bronze/Silver etc certification exams.
It’s a bit of a tedious task already, but if it’s done well here, it can be reused later in the procedure when drawing the model, interrogating the client further, and testing whether we did it right. And, with a bit of luck, the client already has gory data all over the place, in an unwieldy spreadsheet and in notes and folders, so it wouldn’t be a case of coming up with it from scratch entirely. Let’s go with it for now and see how it fares in the next steps.
     

“Step 2” - Start Drawing the Entity Types, Relationships, and Attributes

We need to generalise from the example facts to draw the key elements of the EER diagram. Some of it was already indicated in the sample sentences, like “Member with name Lauren...” and “... level Beginner …”. But what are we going to do with the “Employee Shireen teaches Latin genre”? Is it to be understood as teaching any one Latin class at any one level or all Latin classes regardless the level, or shouldn’t it matter in that any of those combinations should be allowed? And should we identify members and employees only by first name? Also, may a member play more than one role in a partner dance at different times?

Trying to convert the sentences into elements in the EER diagram—whether it should become an entity type, a relationship, or an attribute—shows that it turns out we should have been even more precise with those sample sentences than we already were. For instance, maybe “Employee with name Shireen teaches Latin genre taught in class #sem1in2023L1”. If we know the class they teach, however, we would be able to derive the genre they teach, provided the class has the genre specified. Thus, the revised sentence is not an improvement. “Employee with name Shireen teaches Latin genre at level Beginner” faces a similar issue, unless we make a distinction of what they teach currently versus what they are permitted to teach regardless whether they are actually teaching in that semester versus keeping a record also of what they have taught. This raises the question of the scope of the database, on whether it should store only current data or also past and possible future data.

Making decisions about these matters, a resultant partial EER diagram is shown in Figure 1, permitting any-time data.

Figure 1. The model at the end of step 2 of the amended CSDP.

I made several decisions, like that each of Genre, Level, and Semester are separate entity types. At least level could just as well have been made an attributes of the Class entity type, but this gives more flexibility in case we do need to record the teacher’s level or the contest’s level. Genre and Semester as entity types favours reuse, as shown for the member payment and the teacher’s genre specialty.
 

“Step 3” – Cleaning Up

Cleaning up any duplicates does not yield an improvement because there are no duplicates or synonyms among the entity types to merge them into one. Effectively, I already did so in step 2, with the reused Genre and Semester. There is, however, some repetition in the information about the people: members can be teachers as well, so we’d like to share the name across the two entity types. We can do this by introducing a supertype, say, Person, add the name identifier there and have it inherited. This also will make it easier to add further information, such as their phone numbers or email addresses, later on. There are no specific calculations needed, so we obtain the following minor refinement.

Figure 2. The model at the end of step 3 of the amended CSDP.

   

“Step 4” – Cardinality Constraints

How many times can an entity play that role in a relationship? Looking back at the sample data, we can glean at least part of the answers already. Partnering for the contest showed that a leader can partner with more than one follower—presumably for a different genre or level. It would be safer to ask the client whether a follower can partner with more than one lead, unless they already said that this is representative data. For payment, evidently, that’s exactly once per member per semester, but they may pay for multiple semesters and for each semester there are expected to be multiple members paying for it. And so on for the other constraints.

The other component step is about the arity of the relationships that we had already: can they be split up further without losing information? If we crafted our facts well, we shouldn’t have that problem, but things might have happened between step 1 and step 4. There does not seem to be an issue.

Overall, this gets us to an improved, but possibly still partial, EER diagram, as shown in Figure 3.   

Figure 3. The model at the end of step 4 of the amended CSDP. (Issues regarding notation are discussed near the end of the page)

 

“Step 5” – Mandatory Constraints

For the entities that participate, we need to determine whether each one in the possible population—those whose representation will be stored in the database, to be precise—must participate in the relationship or only some of them. For instance, all members must have paid their fee. The mandatory constraints on employees with genres is not clear from the sample data: does each employee have a genre for teaching or are there employees who don’t do any of it, like an administrator or a cleaner or an events manager? And could it be that we might be missing teachers teaching a genre altogether, as if, say, ballroom is a critical skill and no teacher may be available for some time and, hence, an optional participation? This also needs to be checked for the participating entity types in all the other relationships.

The second sub-task is something called ‘derivations’ in ORM, on whether some relationship can be derived, which does not exist in EER and so we arrive at and updated EER as in Figure 4. We’re nearly there.   

Figure 4. The model at the end of step 5 of the amended CSDP. (Issues regarding notation are discussed near the end of the page)

 

“Step 6” – Some Advanced Constraints

The ‘basic advanced’ constraints are value constraints and subtyping constraints. That the latter only pops up near the end is, perhaps, a reflection that the feature was added only later in conceptual data modelling and there tend to be few hierarchies for databases anyway, and for sure less than in UML Class diagrams. We have members as learners and as teachers, which we already had as subtypes of Person. This is then also the time to add any disjointness or completeness constraints over the subtyping, if applicable. Since members can be teachers, disjointness is not appropriate. One might argue that they’re the only two relevant roles and then completeness would apply.

Value constraints help to ensure only clean data will be added to the prospective database. For instance, we could constrain an attribute to have only two possible values, Latin and Ballroom, rather than a subtyping or leave the ‘name’ attribute open to other dance genres, and likewise for levels as {Beginner, Intermediate, Advanced} and roles to be only a choice of {leader, follower}. All this gets us to the EER diagram shown in Figure 5.

Figure 5. The model at the end of step 6 of the amended CSDP.

 

“Step 7” – More Advanced Constraints

Those even more advanced constraints from ORM aren’t used in EER, but if they were, we would be able to add that a person cannot partner with themselves in a contest, or: partnering is irreflexive.

We can run through some final checks, notably looking back at those sample facts from step 1: can they all be accommodated for? If so, then we’re sorted. We are with Figure 5 already, in fact.

Other considerations: notation and beyond the basics

And that’s about all there is—for this example, that is. What it glosses over are whether there are any weak entity types, which could be slotted in either in step 4 on step 6, and there are variants of EER that also consider cardinality on attributes, composite attributes and multivalued attributes, neither of which were necessary for this particular example (though I easily could have made up the need for it by adding Address of Person as a composite attribute).
And, honestly, this particular notation is suboptimal. This particular version of Crow's foot notation, also called Information Engineering notation (IE notation for short), is pushed by the drawing tool Omnigraffle that I used for creating the figures, but it’s not visually consistent. It uses a short line perpendicular to the relationship line both for cardinality of 1 and for mandatory participation, which amounts to icon overloading, and circle vs line for optional/mandatory participation rather than the same type of icon for the same type of constraint. Preferably, it would be, e.g., only lines for cardinality and only blobs for mandatory/optional, not alternating. With icon consistency -- lines for cardinality, blobs for participation -- and therewith one I'd rather see, the diagram looks as shown in Figure 6.

Figure 6. The final model at the end of step 6, but then with a visually coherent notation.

There are several notations for ER and EER and it's anyone's guess which one of all of them is the best. One that resembles Peter Chen's original proposal uses diamonds for relationships and ovals for attributes and either double/single solid (or solid/dashed) lines for mandatory/optional. It appears in a number of textbooks. That notation applied to our dance school model is shown in Figure 7. Some tools use colour for the elements. Either variant as a 'package' is acceptable, but their notations should not be mixed.

Figure 7. The final model at the end of step 6, roughly in Chen’s proposed notation, which is popular in some textbooks and modelling tools.

Finally, we can also go beyond the CSDP to model more domain information more precisely. To do so, we need to either bring in temporal conceptual data modelling, which was alluded to in step 2 with the past/present/future data, or reach out to content of Chapters 5 and 6 of the book, on ontologies and Ontology. Perhaps another time. For now, in the database development process, a suitable software application will have a button or menu option with “generate database schema”, which it does upon clicking it and selecting the SQL flavour, so that you can start using your tailor-made database in a mere few minutes after finalising the model.