Relational Databases/Introduction

Introduction

edit

Relational databases are a common and powerful approach to storing information. They make efficient use of computer storage by connecting sets of data together. In this course we are going to engage in a quick introduction to relational databases, starting from the basics of how they work and finishing at a point where the reader should be able to build their own simple database in an SQL-based relational database management system (RDMS), such as MySQL, SQL Server or PostgreSQL.

Some background: how data is stored in a flat file database

edit

I would like to propose a simple database: a small database for a personal library, where I hope to keep track of who has borrowed my books. To do so, I want to keep track of:

  • Borrower's name
  • Borrower's email
  • Borrower's phone number
  • When the book was borrowed
  • Title
  • Author
  • Year published

Of course, a more complex database would store more data, but for a personal library that might be enough. This will let me know what the book is, (title, author and year of publication), who has it, when they took it, and how to contact them in order to ask for it back.

One method of tracking of these records is a spreadsheet. In a spreadsheet, each column could represent one of the fields above, while each row would cover one of the records. For example, I could use:

Loans
Name Email Phone Date Title Author Published
Sam sam@internet 1234-5678 1 January 1971 Alice's Adventures in Wonderland Lewis Carroll 1865
Alex alex@internet 2345-6789 1 March 1971 The War of the Worlds H. G. Wells 1898
Kim kim@internet 3456-7890 1 May 1971 A Study in Scarlet Sir Arthur Conan Doyle 1887

This is what is commonly referred to as a flat file database: a database in which all of the data is kept in a single file. It has the advantage that it is relatively easy to use. If I wished to know what book Alex had borrowed, I could quickly scan through the names column until I found Alex, then read across to find the title. Similarly, a computer could check the first column of each row until it found the desired name, and then quickly output the title of the book. There is a bit of a speed hit if there are a lot of records, as it could take a while to scan a file consisting of tens of thousands of entries, but I would get there in the end. And if needed, I could always optimise the data: for example, I could order the records by the name of the borrower or the title of the book, which would make searching even faster if I was only looking for one of those two fields.[1] Thus flat file databases are useful where either speed (given certain constraints) or simplicity is a concern, especially when the data set can be well defined and is small, or where searches are limited to just one field in an ordered set.

However, the limitations of this approach quickly become apparent. For example, what if Alex borrows two books at once? There are two options:

  1. Create a second record, repeating all of the information about Alex
  2. Add more columns to allow for more books

The first option creates:

Loans (Version 2)
Name Email Phone Date Title Author Published
Sam sam@internet 1234-5678 1 January 1971 Alice's Adventures in Wonderland Lewis Carroll 1865
Alex alex@internet 2345-6789 1 March 1971 The War of the Worlds H. G. Wells 1898
Kim kim@internet 3456-7890 1 May 1971 A Study in Scarlet Sir Arthur Conan Doyle 1887
Alex alex@internet 2345-6789 1 March 1971 The Time Machine H. G. Wells 1895

Unfortunately, this introduces a number of concerns, although for now the focus on but two. First, by repeating data, the possibility of errors has increased. Every time the same information has to be entered, there is a chance that something will go wrong. And this problem becomes worse when Alex's details need to be updated, as those details must be updated in every one of the rows related to Alex. Databases need to be reliable, and anything which can increase the possibility of errors is a bad thing. Second, it wastes memory. The system already has Alex's details on record – now it has the same data twice, using up twice the memory.

Alternatively, more columns be added per record:

Loans (Version 3)
Name Email Phone Date 1 Title 1 Author 1 Published 1 Date 2 Title 2 Author 2 Published 2
Sam sam@internet 1234-5678 1 January 1971 Alice's Adventures in Wonderland Lewis Carroll 1865
Alex alex@internet 2345-6789 1 March 1971 The War of the Worlds H. G. Wells 1898 1 March 1971 The Time Machine H. G. Wells 1895
Kim kim@internet 3456-7890 1 May 1971 A Study in Scarlet Sir Arthur Conan Doyle 1887

This makes the risk of errors smaller, but it is still wasting memory: all of those empty fields take up memory in the database, and most of them are unused. (Many databases will reserve a certain amount of memory per field, irrespective of what is actually in it – variable width fields do exist, but many fields are not variable, and even those that are have to commit some memory to the task). Plus, what happens if Alex now decides to borrow a third book? I could add a third set of columns, but then the memory usage is even worse - and, of course, that places a limit of three books per user. What if someone wants to borrow four?

Thus it is time for an alternative approach.

An alternative approach

edit

If we stop thinking about computers, and instead think about how this would be handled this in real life, it seems that the solution is simple: create two sets of records. One as a list of "borrowers", containing their names, email addresses and phone numbers, and a second containing a list of books with the names of those who borrowed them. Implementing this using the above data, we get:

Borrowers
Name Email Phone
Sam sam@internet 1234-5678
Alex alex@internet 2345-6789
Kim kim@internet 3456-7890
Books
Title Author Published Name Date
Alice's Adventures in Wonderland Lewis Carroll 1865 Sam 1 January 1971
The War of the Worlds H. G. Wells 1898 Alex 1 March 1971
The Time Machine H. G. Wells 1895 Alex 1 March 1971
A Study in Scarlet Sir Arthur Conan Doyle 1887 Kim 1 May 1971

This is much more effective that the flat file database: it has the potential to use less memory (although there is a need to include the borrower's name in two places); there is less information being repeated; and there is no limit as to how many books someone can borrow. There is a cost, of course: using it has become harder. For example, to find out who to ring to chase up a copy of The Time Machine, I would need to look in the list of books, find it, read the name of the borrower, then scan the list of borrowers until I find their name and look up their phone number. Not impossible, but certainly more difficult than with the flat file. (Doing this, in the language of relational databases, is called "joining" the two tables, and results in a speed hit).

There another problem, though. What would happen if there were two people called Alex? In such a situation, the system would collapse: there would be no means of telling if the Alex referred to in the books table is the first Alex in the borrowers table or a different Alex. The name is not sufficiently unique to identify all of the potential people who could borrow books. What is needed is something unique - a code to represent each borrower that would be different for each one. This, in computing terms, is called a unique identifier, and in databases a unique identifier can be a primary key (PK): a unique code that is the primary means of identifying a record in the database.[2] A common example is the registration number for a car, or an ID given to each student at a school.

The simplest solution to creating a unique identifier is to just number each record, using a numeric sequence starting at 1. Doing so will provide the unique identifier that the database needs, but it also has a neat side-effect: in computing numbers are generally much smaller than names, and thus the result is a reduction in how much memory is required to refer to the different records.

Borrowers (with IDs)
ID Name Email Phone
1 Sam sam@internet 1234-5678
2 Alex alex@internet 2345-6789
3 Kim kim@internet 3456-7890
4 Alex otheralex@internet 4567-8901
Books
Title Author Published Borrower ID Date
Alice's Adventures in Wonderland Lewis Carroll 1865 1 1 January 1971
The War of the Worlds H. G. Wells 1898 2 1 March 1971
The Time Machine H. G. Wells 1895 2 1 March 1971
A Study in Scarlet Sir Arthur Conan Doyle 1887 3 1 May 1971

This does make it a bit harder for people to read, perhaps, but computers won't suffer from the same problem. They can handle the numbers as easily as they can handle words.

Expanding the model

edit

This is good, but it is still not perfect. One problem is that there is no way of knowing who borrowed the book last – the system doesn't retain a borrowing history. It has the date the book was borrowed on (which will presumably be blank if it has been returned), but as there is only one date box, we need to overwrite that date each time it is borrowed.

On solution to this problem could be to modify the "Books" list. For example, we could add a "returned" column so we know when it came back, and just add a new record when it is borrowed again.

Books
Title Author Published Borrower ID Date Returned
Alice's Adventures in Wonderland Lewis Carroll 1865 1 1 January 1971
The War of the Worlds H. G. Wells 1898 2 1 March 1971 12 March 1971
The Time Machine H. G. Wells 1895 2 1 March 1971
A Study in Scarlet Sir Arthur Conan Doyle 1887 3 1 May 1971
The War of the Worlds H. G. Wells 1898 3 14 March 1971

However, now there is the same problems as before: data is being repeated, which is both risky (because of the possiblity of errors) and wastes memory.

So, how about an alternative? The solution last time was to create a new list. How about separating the dates, borrower IDs and dates returned from the books list, and placing it in one of its own?

Books
ID Title Author Published
1 Alice's Adventures in Wonderland Lewis Carroll 1865
2 The War of the Worlds H. G. Wells 1898
3 The Time Machine H. G. Wells 1895
4 A Study in Scarlet Sir Arthur Conan Doyle 1887
Loans
Book ID Borrower ID Date Returned
1 1 1 January 1971
2 2 1 March 1971 12 March 1971
3 2 1 March 1971
4 3 1 May 1971
2 3 14 March 1971

It is true that this does not make it any easier for people to work with – now they have to scan through three lists to work out who has borrowed a book, and they need to check to see if it has been returned as they go – but this has reduced the incidence of repeated data, improving the reliability and potential size of the system. Reliability is one of the primary priorities for database design.

Bringing it together

edit

This final design is a relational database. Indeed, the design could be readily implemented in a relational database such as MySQL, Oracle, Microsoft Access or SQL Server with no significant modifications at all. The "relations" part of the "relational database" name comes from the IDs: the Books list is related to the Loans list by the Book ID, and the Loans list is related to the Borrowers by the Borrower ID. There is still room for some refinement, but if you can understand how these lists (or "tables" as they should be called in a database) relate to each other then you've passed the biggest hurdle.

Tasks to try

edit

After the above discussion, we now have a basic database consisting of three tables and a number of records:

Borrowers
ID Name Email Phone
1 Sam sam@internet 1234-5678
2 Alex alex@internet 2345-6789
3 Kim kim@internet 3456-7890
4 Alex otheralex@internet 4567-8901
Books
ID Title Author Published
1 Alice's Adventures in Wonderland Lewis Carroll 1865
2 The War of the Worlds H. G. Wells 1898
3 The Time Machine H. G. Wells 1895
4 A Study in Scarlet Sir Arthur Conan Doyle 1887
Loans
Book ID Borrower ID Date Returned
1 1 1 January 1971
2 2 1 March 1971 12 March 1971
3 2 1 March 1971
4 3 1 May 1971
2 3 14 March 1971

Using these tables, there are a number of things you can try.

Question 1

edit

When was "The Time Machine" borrowed?

Solution

Looking in the Books table, the ID of "The Time Machine" is "3". Looking up 3 in the Loans table under "Book ID" reveals one record, which confirms that the book was borrowed on 1 March 1971.

Question 2

edit

How many books has Alex with the email address otheralex@internet borrowed?

Solution

There are two people with the same name, Alex, so relying on the name wouldn't be enough. However, using the email address, the Borrowers table reveals that the Alex with that email has the ID of 4. Looking at the Borrower ID column in the Loans tables reveals that there are no loans for a person with the ID of 4. Therefore, Alex with the email address otheralex@internet has not borrowed any books.

Question 3

edit

Who has borrowed the copy of H. G. Wells' "The War of the Worlds".

Solution

Looking in the Books table, the ID of "The War of the Worlds" is "2". Looking up 2 in the Loans table under "Book ID" reveals two records. However, only the second record shows that the book hasn't been returned, and that is by a borrower with the ID of "3". Checking the Borrowers table reveals that the borrower with that ID is Kim.

Therefore Kim currently has a copy of the book. Given that Kim borrowed it in 1971, it is probably overdue.

Question 4

edit

How many books Kim has currently borrowed.

Solution

Kim's ID in the Borrowers table is "3". There are two records in the Loans table for someone with that ID, neither of which has been returned. Therefore Kim has borrowed two books. (Note that there was no need to look into the Books table).

Question 5

edit

What would you need to change in the database in order for Sam to return her book, and for Kim to borrow it after her.

Solution

You would need to change the record in Loans representing Sam's loan (Borrower ID 1, Book ID 1) to include a return date, and then add a new record to the Loans table, with the book ID (1), Kim's ID (3), and the date the book was borrowed.

Technically you could do this without accessing the Books table, but you would need to find out both Kim and Sam's IDs from the Borrowers table.

Question 6

edit

Add a new borrower (Jim, jim@internet, ph 9876-5432) and a new book (Dracula).

Solution

Add to Borrowers:

Borrowers
ID Name Email Phone
5 Jim jim@internet 9876-5432

Technically you could use any ID not currently being used, but it is easier just to go with a sequential number, as that way it won't repeat. To the Books table, you need to add:

Books
ID Title Author Published
5 Dracula Bram Stoker 1897

Question 7

edit

Make a new table ("Authors") to contain a list of authors, and relate it back the "Books" table. (This one is more challenging, but would be a good step towards understanding relational databases).

Solution

You will need to make two changes. First, create a new Authors table consisting of Name and ID:

Authors
ID Name
1 Lewis Carroll
2 H. G. Wells
3 Sir Arthur Conan Doyle

Then you need to adjust the Books table to use an Author ID rather than an author's name:

Books
ID Title Author ID Published
1 Alice's Adventures in Wonderland 1 1865
2 The War of the Worlds 2 1898
3 The Time Machine 2 1895
4 A Study in Scarlet 3 1887

This will,once again, make it more difficult to find data, as you need to join more tables together. But it means that you only need to spell an author's name correctly once, and it uses less memory (so long as enough authors have written more than one book in order to make it worthwhile).

Next steps

edit

Once this makes sense, as I hope that it will, it is time to move on to looking at how to design a relational database.

References

edit
  1. Search optimisation works well if records can be placed into order first. If I wanted to know if the letter "C" was in the sequence "G,H,I,R,T,V,A,W,Z", as an example, I would need to read each of the values to check. But if they were in alphabetical order, "A,G,H,I,R,T,W,Z", I would know it wasn't there the instant I reached "G", as I would have already found it if it was present.
  2. Sometimes it is not possible to find an existing unique value, or creating one would just waste more memory. Instead, you can use what is referred to as a "composite key", where a combination of two or more values provides a unique identifier. But don't worry about that now, as it is something that won't be relevant until later.