Relational Databases/Introduction
Introduction
editRelational databases are a common and powerful approach to storing information. They make efficient use of computer storage by connecting sets of data together. In this course we are going to engage in a quick introduction to relational databases, starting from the basics of how they work and finishing at a point where the reader should be able to build their own simple database in an SQL-based relational database management system (RDMS), such as MySQL, SQL Server or PostgreSQL.
Some background: how data is stored in a flat file database
editI would like to propose a simple database: a small database for a personal library, where I hope to keep track of who has borrowed my books. To do so, I want to keep track of:
- Borrower's name
- Borrower's email
- Borrower's phone number
- When the book was borrowed
- Title
- Author
- Year published
Of course, a more complex database would store more data, but for a personal library that might be enough. This will let me know what the book is, (title, author and year of publication), who has it, when they took it, and how to contact them in order to ask for it back.
One method of tracking of these records is a spreadsheet. In a spreadsheet, each column could represent one of the fields above, while each row would cover one of the records. For example, I could use:
Name | Phone | Date | Title | Author | Published | |
---|---|---|---|---|---|---|
Sam | sam@internet | 1234-5678 | 1 January 1971 | Alice's Adventures in Wonderland | Lewis Carroll | 1865 |
Alex | alex@internet | 2345-6789 | 1 March 1971 | The War of the Worlds | H. G. Wells | 1898 |
Kim | kim@internet | 3456-7890 | 1 May 1971 | A Study in Scarlet | Sir Arthur Conan Doyle | 1887 |
This is what is commonly referred to as a flat file database: a database in which all of the data is kept in a single file. It has the advantage that it is relatively easy to use. If I wished to know what book Alex had borrowed, I could quickly scan through the names column until I found Alex, then read across to find the title. Similarly, a computer could check the first column of each row until it found the desired name, and then quickly output the title of the book. There is a bit of a speed hit if there are a lot of records, as it could take a while to scan a file consisting of tens of thousands of entries, but I would get there in the end. And if needed, I could always optimise the data: for example, I could order the records by the name of the borrower or the title of the book, which would make searching even faster if I was only looking for one of those two fields.[1] Thus flat file databases are useful where either speed (given certain constraints) or simplicity is a concern, especially when the data set can be well defined and is small, or where searches are limited to just one field in an ordered set.
However, the limitations of this approach quickly become apparent. For example, what if Alex borrows two books at once? There are two options:
- Create a second record, repeating all of the information about Alex
- Add more columns to allow for more books
The first option creates:
Name | Phone | Date | Title | Author | Published | |
---|---|---|---|---|---|---|
Sam | sam@internet | 1234-5678 | 1 January 1971 | Alice's Adventures in Wonderland | Lewis Carroll | 1865 |
Alex | alex@internet | 2345-6789 | 1 March 1971 | The War of the Worlds | H. G. Wells | 1898 |
Kim | kim@internet | 3456-7890 | 1 May 1971 | A Study in Scarlet | Sir Arthur Conan Doyle | 1887 |
Alex | alex@internet | 2345-6789 | 1 March 1971 | The Time Machine | H. G. Wells | 1895 |
Unfortunately, this introduces a number of concerns, although for now the focus on but two. First, by repeating data, the possibility of errors has increased. Every time the same information has to be entered, there is a chance that something will go wrong. And this problem becomes worse when Alex's details need to be updated, as those details must be updated in every one of the rows related to Alex. Databases need to be reliable, and anything which can increase the possibility of errors is a bad thing. Second, it wastes memory. The system already has Alex's details on record – now it has the same data twice, using up twice the memory.
Alternatively, more columns be added per record:
Name | Phone | Date 1 | Title 1 | Author 1 | Published 1 | Date 2 | Title 2 | Author 2 | Published 2 | |
---|---|---|---|---|---|---|---|---|---|---|
Sam | sam@internet | 1234-5678 | 1 January 1971 | Alice's Adventures in Wonderland | Lewis Carroll | 1865 | ||||
Alex | alex@internet | 2345-6789 | 1 March 1971 | The War of the Worlds | H. G. Wells | 1898 | 1 March 1971 | The Time Machine | H. G. Wells | 1895 |
Kim | kim@internet | 3456-7890 | 1 May 1971 | A Study in Scarlet | Sir Arthur Conan Doyle | 1887 |
This makes the risk of errors smaller, but it is still wasting memory: all of those empty fields take up memory in the database, and most of them are unused. (Many databases will reserve a certain amount of memory per field, irrespective of what is actually in it – variable width fields do exist, but many fields are not variable, and even those that are have to commit some memory to the task). Plus, what happens if Alex now decides to borrow a third book? I could add a third set of columns, but then the memory usage is even worse - and, of course, that places a limit of three books per user. What if someone wants to borrow four?
Thus it is time for an alternative approach.
An alternative approach
editIf we stop thinking about computers, and instead think about how this would be handled this in real life, it seems that the solution is simple: create two sets of records. One as a list of "borrowers", containing their names, email addresses and phone numbers, and a second containing a list of books with the names of those who borrowed them. Implementing this using the above data, we get:
Name | Phone | |
---|---|---|
Sam | sam@internet | 1234-5678 |
Alex | alex@internet | 2345-6789 |
Kim | kim@internet | 3456-7890 |
Title | Author | Published | Name | Date |
---|---|---|---|---|
Alice's Adventures in Wonderland | Lewis Carroll | 1865 | Sam | 1 January 1971 |
The War of the Worlds | H. G. Wells | 1898 | Alex | 1 March 1971 |
The Time Machine | H. G. Wells | 1895 | Alex | 1 March 1971 |
A Study in Scarlet | Sir Arthur Conan Doyle | 1887 | Kim | 1 May 1971 |
This is much more effective that the flat file database: it has the potential to use less memory (although there is a need to include the borrower's name in two places); there is less information being repeated; and there is no limit as to how many books someone can borrow. There is a cost, of course: using it has become harder. For example, to find out who to ring to chase up a copy of The Time Machine, I would need to look in the list of books, find it, read the name of the borrower, then scan the list of borrowers until I find their name and look up their phone number. Not impossible, but certainly more difficult than with the flat file. (Doing this, in the language of relational databases, is called "joining" the two tables, and results in a speed hit).
There another problem, though. What would happen if there were two people called Alex? In such a situation, the system would collapse: there would be no means of telling if the Alex referred to in the books table is the first Alex in the borrowers table or a different Alex. The name is not sufficiently unique to identify all of the potential people who could borrow books. What is needed is something unique - a code to represent each borrower that would be different for each one. This, in computing terms, is called a unique identifier, and in databases a unique identifier can be a primary key (PK): a unique code that is the primary means of identifying a record in the database.[2] A common example is the registration number for a car, or an ID given to each student at a school.
The simplest solution to creating a unique identifier is to just number each record, using a numeric sequence starting at 1. Doing so will provide the unique identifier that the database needs, but it also has a neat side-effect: in computing numbers are generally much smaller than names, and thus the result is a reduction in how much memory is required to refer to the different records.
ID | Name | Phone | |
---|---|---|---|
1 | Sam | sam@internet | 1234-5678 |
2 | Alex | alex@internet | 2345-6789 |
3 | Kim | kim@internet | 3456-7890 |
4 | Alex | otheralex@internet | 4567-8901 |
Title | Author | Published | Borrower ID | Date |
---|---|---|---|---|
Alice's Adventures in Wonderland | Lewis Carroll | 1865 | 1 | 1 January 1971 |
The War of the Worlds | H. G. Wells | 1898 | 2 | 1 March 1971 |
The Time Machine | H. G. Wells | 1895 | 2 | 1 March 1971 |
A Study in Scarlet | Sir Arthur Conan Doyle | 1887 | 3 | 1 May 1971 |
This does make it a bit harder for people to read, perhaps, but computers won't suffer from the same problem. They can handle the numbers as easily as they can handle words.
Expanding the model
editThis is good, but it is still not perfect. One problem is that there is no way of knowing who borrowed the book last – the system doesn't retain a borrowing history. It has the date the book was borrowed on (which will presumably be blank if it has been returned), but as there is only one date box, we need to overwrite that date each time it is borrowed.
On solution to this problem could be to modify the "Books" list. For example, we could add a "returned" column so we know when it came back, and just add a new record when it is borrowed again.
Title | Author | Published | Borrower ID | Date | Returned |
---|---|---|---|---|---|
Alice's Adventures in Wonderland | Lewis Carroll | 1865 | 1 | 1 January 1971 | |
The War of the Worlds | H. G. Wells | 1898 | 2 | 1 March 1971 | 12 March 1971 |
The Time Machine | H. G. Wells | 1895 | 2 | 1 March 1971 | |
A Study in Scarlet | Sir Arthur Conan Doyle | 1887 | 3 | 1 May 1971 | |
The War of the Worlds | H. G. Wells | 1898 | 3 | 14 March 1971 |
However, now there is the same problems as before: data is being repeated, which is both risky (because of the possiblity of errors) and wastes memory.
So, how about an alternative? The solution last time was to create a new list. How about separating the dates, borrower IDs and dates returned from the books list, and placing it in one of its own?
ID | Title | Author | Published |
---|---|---|---|
1 | Alice's Adventures in Wonderland | Lewis Carroll | 1865 |
2 | The War of the Worlds | H. G. Wells | 1898 |
3 | The Time Machine | H. G. Wells | 1895 |
4 | A Study in Scarlet | Sir Arthur Conan Doyle | 1887 |
Book ID | Borrower ID | Date | Returned |
---|---|---|---|
1 | 1 | 1 January 1971 | |
2 | 2 | 1 March 1971 | 12 March 1971 |
3 | 2 | 1 March 1971 | |
4 | 3 | 1 May 1971 | |
2 | 3 | 14 March 1971 |
It is true that this does not make it any easier for people to work with – now they have to scan through three lists to work out who has borrowed a book, and they need to check to see if it has been returned as they go – but this has reduced the incidence of repeated data, improving the reliability and potential size of the system. Reliability is one of the primary priorities for database design.
Bringing it together
editThis final design is a relational database. Indeed, the design could be readily implemented in a relational database such as MySQL, Oracle, Microsoft Access or SQL Server with no significant modifications at all. The "relations" part of the "relational database" name comes from the IDs: the Books list is related to the Loans list by the Book ID, and the Loans list is related to the Borrowers by the Borrower ID. There is still room for some refinement, but if you can understand how these lists (or "tables" as they should be called in a database) relate to each other then you've passed the biggest hurdle.
Tasks to try
editAfter the above discussion, we now have a basic database consisting of three tables and a number of records:
ID | Name | Phone | |
---|---|---|---|
1 | Sam | sam@internet | 1234-5678 |
2 | Alex | alex@internet | 2345-6789 |
3 | Kim | kim@internet | 3456-7890 |
4 | Alex | otheralex@internet | 4567-8901 |
ID | Title | Author | Published |
---|---|---|---|
1 | Alice's Adventures in Wonderland | Lewis Carroll | 1865 |
2 | The War of the Worlds | H. G. Wells | 1898 |
3 | The Time Machine | H. G. Wells | 1895 |
4 | A Study in Scarlet | Sir Arthur Conan Doyle | 1887 |
Book ID | Borrower ID | Date | Returned |
---|---|---|---|
1 | 1 | 1 January 1971 | |
2 | 2 | 1 March 1971 | 12 March 1971 |
3 | 2 | 1 March 1971 | |
4 | 3 | 1 May 1971 | |
2 | 3 | 14 March 1971 |
Using these tables, there are a number of things you can try.
Question 1
editWhen was "The Time Machine" borrowed?
Solution
|
---|
Looking in the Books table, the ID of "The Time Machine" is "3". Looking up 3 in the Loans table under "Book ID" reveals one record, which confirms that the book was borrowed on 1 March 1971. |
Question 2
editHow many books has Alex with the email address otheralex@internet borrowed?
Solution
|
---|
There are two people with the same name, Alex, so relying on the name wouldn't be enough. However, using the email address, the Borrowers table reveals that the Alex with that email has the ID of 4. Looking at the Borrower ID column in the Loans tables reveals that there are no loans for a person with the ID of 4. Therefore, Alex with the email address otheralex@internet has not borrowed any books. |
Question 3
editWho has borrowed the copy of H. G. Wells' "The War of the Worlds".
Solution
|
---|
Looking in the Books table, the ID of "The War of the Worlds" is "2". Looking up 2 in the Loans table under "Book ID" reveals two records. However, only the second record shows that the book hasn't been returned, and that is by a borrower with the ID of "3". Checking the Borrowers table reveals that the borrower with that ID is Kim. Therefore Kim currently has a copy of the book. Given that Kim borrowed it in 1971, it is probably overdue. |
Question 4
editHow many books Kim has currently borrowed.
Solution
|
---|
Kim's ID in the Borrowers table is "3". There are two records in the Loans table for someone with that ID, neither of which has been returned. Therefore Kim has borrowed two books. (Note that there was no need to look into the Books table). |
Question 5
editWhat would you need to change in the database in order for Sam to return her book, and for Kim to borrow it after her.
Solution
|
---|
You would need to change the record in Loans representing Sam's loan (Borrower ID 1, Book ID 1) to include a return date, and then add a new record to the Loans table, with the book ID (1), Kim's ID (3), and the date the book was borrowed. Technically you could do this without accessing the Books table, but you would need to find out both Kim and Sam's IDs from the Borrowers table. |
Question 6
editAdd a new borrower (Jim, jim@internet, ph 9876-5432) and a new book (Dracula).
Solution
| ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Add to Borrowers:
Technically you could use any ID not currently being used, but it is easier just to go with a sequential number, as that way it won't repeat. To the Books table, you need to add:
|
Question 7
editMake a new table ("Authors") to contain a list of authors, and relate it back the "Books" table. (This one is more challenging, but would be a good step towards understanding relational databases).
Solution
| ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
You will need to make two changes. First, create a new Authors table consisting of Name and ID:
Then you need to adjust the Books table to use an Author ID rather than an author's name:
This will,once again, make it more difficult to find data, as you need to join more tables together. But it means that you only need to spell an author's name correctly once, and it uses less memory (so long as enough authors have written more than one book in order to make it worthwhile). |
Next steps
editOnce this makes sense, as I hope that it will, it is time to move on to looking at how to design a relational database.
References
edit- ↑ Search optimisation works well if records can be placed into order first. If I wanted to know if the letter "C" was in the sequence "G,H,I,R,T,V,A,W,Z", as an example, I would need to read each of the values to check. But if they were in alphabetical order, "A,G,H,I,R,T,W,Z", I would know it wasn't there the instant I reached "G", as I would have already found it if it was present.
- ↑ Sometimes it is not possible to find an existing unique value, or creating one would just waste more memory. Instead, you can use what is referred to as a "composite key", where a combination of two or more values provides a unique identifier. But don't worry about that now, as it is something that won't be relevant until later.