Database Modeling/Management

Wednesday, October 28, 2009

Wrapping Up...

Questions I answered:

What is a database?
Why do we use databases?
Why are relational databases better than spreadsheets?
What are database relationships?
What is database modeling?

Resources:

Scholarly: Modern Database Management [9th edition] by Jeffrey A. Hoffer, Mary B. Prescott, and Heikki Topi
- The textbook I used for the class "Database Management" at the University of Maine at Presque Isle. Pearson Education was responsible for the content, which is a very good education program, so the information is reliable. This book alone could answer all of the questions I had. It deals with everything database-related.
Newspaper: Q/A: Youth Runaways: http://thelede.blogs.nytimes.com/2009/10/26/q-and-a-youth-runaways/?scp=2&sq=database&st=cse
- A New York Times article about how there is a database available publishing numbers that people concerned about teenage prostitution can call. I located it by searching for the word "database" in the... online database of newspaper articles. A reporter in the New York Times was responsible for the information. It has less to do with their knowledge of databases, but rather the fact that databases are used in almost everything. It helps answer "Why use databases," considering it is a very convenient way to organize lots of information and publish it.
Popular: Tech FAQ - http://www.tech-faq.com/database.shtml
- Tech FAQ is, I believe, a blog of some sort answering questions about technology. There is really no credibility to this, but I've used them before, and they have pretty extensive answers to common and uncommon technology questions. They are very good at explaining things in easy-to-understand terminology.
FAQ: Access Web - http://www.mvps.org/access/tencommandments.htm
- This is an amazing site! It is devoted to FAQ's about Microsoft Access, and seems to have many good instructions on how to do specific things inside Access. While this doesn't really help answer my questions, it was very helpful in my "General Database Rules..." post.

Terms:

Data - any collections of information that we find relevant or valuable
Database - an organized collection of logically related data
Relational Database - the type of database that establishes the relationships between tables by means of common fields
Data Normalization - eliminating all repeating data
Data Type - the classification of what type of data appears in the cell
Database Management System (DBMS) - a program that is used to create, manage, and search databases
Query - the way a DBMS searches a database for information
Overhead - when you basically say "I don't want to assign the data type of this piece of data, so basically just make sure it has enough room to be whatever it wants." (The space you take up by not defining space that isn't actually being used)
Normalization - the process of breaking data into smaller pieces without repeating data
Entity - a fancy way of saying table
Entity-Relationship (ER) Diagram - a diagram relating the tables of a database to each other with symbols expressing whether the relationship is one-to-many, one-to-one, or many-to-many
Child Table - a table that only exists to connect two many-to-many relationships
Keys - fields that link to other tables
Primary Keys - the number or code that MUST be unique with every row of data

Professional Blogs:

Database Management and Analytic Technologies in a Changing World - http://www.dbms2.com/
- I chose this blog because it is all about Database Management. However, he really took it more into the "Analytic Technologies" direction. I wouldn't recommend this blog for someone who is trying to learn more about database management. It all is going over my head. It is very focused on extremely new technologies.
  I would consider it a good blog by the criteria listed, however. The blog has existed since August 2005, and has over 80 pages of posts. He never seemed to lose his zeal for it. He posts very frequently. His posts are relevant, they are just a tad... over the top. It does, however, provide information on where database management is headed, which is always useful in technology. It's also easy to find information because he's categorized it well.
DataGeek - http://dbageek.blogspot.com/
- This blog I chose also because it's very close to the topic... data and data management. A few of the posts he had earlier were actually very relevant to the topic, and very helpful. I would recommend following... maybe that would encourage him to start posting. According to the blog review, this blog isn't very good at all. He almost never posts. It existed from 2005 to 2006, so it is definitely old, but... short-lived. The posts are pretty relevant, but... they stopped coming in 2006. It DID provide practical information about Microsoft Access, which is always nice. It's easy to locate information, only because it is so sparse.

General Database Rules...

Here are a few rules that should help clear up any confusion when taken in context with the rest of the blog.

Databases are supposed to have unique data. There is no point in having the same data twice. This doesn't mean it shouldn't be very similar. Occasionally you'll have two sets of the same data but with a different time stamp. Like a patient could come in with the same symptoms and the same diagnosis and the same treatment at two different times.
While typically tables can have many keys (fields that link to other tables), every table should have one primary key (with the exception of child tables). A primary key is the number or code that MUST be unique with every row of data. It is also typically the field used to reference the other data from other tables, and to link the tables together. Child tables are different only because they have two fields that, in combination, make the primary key. While a code can appear more than once in a child table (that is the point of a child table), the combination of the components must be unique.
Other "rules" that I don't feel are as important but are considered convention can be found at The Access Web, in Ten Commandment form!

Database Modeling is the act of planning a database. Obviously, so far the relationships seem really confusing. This is because quite a bit of planning usually goes into database construction. The diagram below is an example of an old-fashioned drawn ER-Diagram (or an Entity-Relationship Diagram, which is a diagram relating the tables of a database to each other with symbols expressing whether the relationship is one-to-many, one-to-one, or many-to-many).

Each circle here is an "Entity," which is just a fancy way of saying table in this case. Each line represents a relationship, and each ratio represents the type of relationship. This appears to be a database of road law violators.

If you see the line between "Citation Notice" and "Location," you will see "1:1." This means that each citation notice will only have one location. This makes sense... right? You aren't going to have the same violation instance in two places, or else it would be two different instances. However, it seems to me that this relationship should be "M:1" as per the notation here, because you can have many different violation instances in the same location. Relationships that are 1:1 typically are pointless in database construction, because this says that each time you have a violation you are going to have a different location, and if that were the case you would just put that as a field in "Citation Notice" rather than as a separate table. However, this is just my opinion, and it wasn't my database. That's what I get for googling "ER diagram."

Now look at the line between "Citation Notice" and "Violation." That says "M:M." This makes sense, because the "Violation" table would be just the types of violations that are possible, and the "Citation Notice" table is all of the actual occurrences of the violations. At one instance a person could be committing several violations, and obviously violations are going to be committed in more than one instance by more than one person. Many-to-many relationships in databases pose a funny problem. This may seem confusing, but you are basically saying that we should put several violations connecting with several citation notices. This is impossible. In each table you only have one field to make the connection. As a result, with many-to-many relationships one has to create a child table, or a table that only exists to connect two many-to-many relationships. This table would typically be called "Violation to Citation Notice" or vise versa, and would simply have one field for violationID and one field for citationNoticeID. The uniqueness wouldn't occur with a primary key (mentioned "General Database Rules..."), but rather with both of the fields in combination. So the data would still be unique (which is the point of a database), it just wouldn't be in one column.

Below there is a more advanced ER Diagram.

The interesting things about this diagram is that (1.) this diagram is created automatically from access when you create relationships via linking keys, and (2.) this includes the field values of each table and illustrates which fields are linking with other tables, and which fields are receiving the link.

In the recent past, database modeling and diagramming has lost emphasis, because with more complex data structures, like hospital data needs, planning is only going to get you so far. It is also usually a tedious process. More emphasis has been placed on experimenting with data structures to see what works. Also, data structures are available for common data needs, like hospitals, banks, and small businesses.

The following is an excellent video on how to take a business' data needs and convert it into a database diagram:

Tuesday, October 27, 2009

Explaining Database Relationships...

Yeah, database relationships are kind of confusing. To illustrate these relationships, we are going to use Microsoft Access, a very powerful DBMS (or Database Management System, which is a program that is used to create, manage, and search databases).

So consider the following scenario (which I've wanted to start, so this is a perfect opportunity): I'm trying create a database of all of the scripture I read that helps with certain things we experience in life. Below is one table with some data that will be included.

This looks like a spreadsheet, no? Well, it's actually a table that is part of a database in Access. If you look at it, there is repeated data. Three out of four books are Psalms, and two of the four topics are about Joy. This data can be normalized, or broken into smaller pieces without repeating data. Take a look at the diagrams below.

I created two more tables. One that references each book, and another that references each topic. Finally, I replaced the text in both of the columns in my first table (Reference) to the number that corresponds to the number in the tables. Because I named the fields in "Reference" BookID and TopicID, I also named them BookID and TopicID in their respective tables. This implies the relationship between the numbers and the text that is in the tables that contain these same fields.

What's so Great About a Relational Database?

Take a look at the spreadsheet below. This is for an actual business selling fashion items like shoes and purses. This spreadsheet is every item in this client's inventory. While it might be a little overwhelming at first, it makes sense. There is a column for the item title, inventory number, price, etc. It also has columns (if you scroll way to the right) like classification, supplier, warehouse, ad template name, posting name, etc. that has a lot of repeated data.

Let's go over the problems with this method of keeping a database.

There is an incredible amount of data that is repeated. The Classification, Warehouse, Supplier, Ad Template Name, Posting Template Name, and many other fields all have data that is repeated over and over again. This is bad for the following reasons:
1. Repeated information is hard to change easily. What if this client had to change all of the products that said "custom sandals" to "custom summer wear?" She would have to go through a very messy find-and-replace, which would be much harder than it sounds, considering this particular spreadsheet contains over 2000 rows (which I had to cut down to upload). This problem has occurred, and because it is an inconvenience to change everything in the spreadsheet, the client resorts to changing all of the items "from here on." Hopefully we can all see why this is a bad idea. We then have to write programs in order to interpret whether the cell says one thing OR the other, in every situation interpretation occurs. Obviously not a fun time. (This can be changed with a relational database because you only have to change the text in one table, because all of the occurrences in the large table would be just numbers referring to whatever value is in the other table. Change the value in that one place, it changes throughout.)
2. Mistakes are easily made when data input is all text rather than, say, a dropdown. While a dropdown can be created when a program has values to choose from (like a table of values), if it's just one big spreadsheet all input has to be straight text. This is very conducive to errors. Considering this spreadsheet is on her website and is subject to searching, this causes many problems. Let's say she accidentally puts "Custom Booots" instead of "Custom Boots." That means when someone searches for "boots" that won't come up. Also considering that copy and paste is commonly used in data entry, this could make for all of the entries in a session to be left out of a very important search. (This doesn't occur in a relational database because rather than everything being text, you have the options that appear in the table that connects with the integers in your main table.)
There is very little focus on data type with spreadsheets. For example, if you scroll over to column AM (Attribute13Value), you will see the apparent height of the shoe. Let's say I'm a female that is looking for some shoes, but I absolutely cannot do shoes over 2". I can search for shoes that are 0 to 1/2", I can search for shoes that are 1 3/4" to 2 3/4", but I can't compare values. I can't "sort by shoe height." While in this case it's not that big of a deal considering there are only three categories, comparisons can't be made when all of the data type is text. Data types are simply the classification of what type of data appears in the cell. If all of the information is of the data type text, which is often the case with spreadsheets, when you "order by" a column it puts the numbers in alphabetical order. So, essentially, higher-level queries (a query being essentially the way a DBMS searches a database for information) are much more difficult and often impossible. (This isn't the case in Relational Databases because data types are much more easily assigned in databases and you can essentially do any query you want, like I want shoes that are between the height of 2" and 3.5" and are between the sizes of 8.5 and 9.5, which would never happen if you are just dealing with text.)
While creating a database can initially make the size much larger, once you get to databases of this size you would save a lot more space with a database rather than a spreadsheet type of database which uses a lot of overhead. Overhead is when you basically say "I don't want to assign the data type of this piece of data, so basically just make sure it has enough room to be whatever it wants." This isn't a good idea. When you know that you are just going to be putting shoe size in a column, there is no need to leave it the data type of "text." This data type takes up a lot of space, because it has to make sure that "just in case the user puts in a lot of data, the room is available," even if the user knows that more than a certain amount will never be used. The increased overhead and ultimately size of the database results in slower search queries, as well as inserting or deleting from the database. (This isn't the case with Relational Databases because typically when you create the database you are very specific with which data type is going to be in that field, and how much space it will need.)

Overall, while it might take a little bit of time to go through the normalization procedure, especially with a project like this, it would be well worth it for the advantages of having multiple tables as a Relational Database rather than just one table in a database, which is essentially a spreadsheet.

(While I only took the time to thoroughly explain Relational Databases, there are other types of databases, including flat, hierarchical, and network databases. You can find out more about these types of databases at techFAQ.com.)

What is a Database? Why Do We Use Them?

Before we get into what a database is, we should probably clarify what data are. Data are stored representations of objects and events that have meaning and importance in the user’s environment. Yeah, it’s hard to define concepts like data, and when you try, it ends up sounding so general that it could include anything. Basically, data are any collections of information that we find relevant or valuable. Your name, address, and social security number are data.

Now that we know what data are, how do we organize and collect data? This is the purpose of a database. A database is an organized collection of logically related data. This definition is much more easily graspable. It’s a collection of data that relate to each other. This isn’t as complicated as some people may think. A teacher’s hard copy gradebook is technically a database. It holds the data of the students’ grades and how they calculate into the final grade. A wedding guest book is a database. Folders of receipts could be considered a database. Anything that you use as a reference to look at data collected is considered a database.

So… what makes this type of database different from the database we always hear about? "Our database crashed, but thankfully we had a couple backups." Articles citing "databases of phone numbers" used in tracking crime. Most of the databases we hear about today are computer-based. These databases can be accessed automatically with programs and websites to show information to the user. So while with a hard copy gradebook the teacher would have to open up the book and consult the hard data, typically with computer-based databases another program accesses it and displays it in a nice fashion. For example, facebook has a (pretty extensive) database with all of the users, their information, and the relationship between users. Obviously if we looked at the database we would just see lines of information, but facebook’s programs access the database and serve it up nice and pretty.

What are your options with creating a database, you ask (or even if you didn’t)? Well, one way you can create a database easily is with Microsoft Excel. Many people create databases with excel, putting information like bookkeeping figures, class information, hour logs, inventory, etc. into spreadsheets. This is obviously data, correct? And it’s being collected in one spot that relates it, correct? Therefore, we consider it one type of database. However, there is another type of database that is most commonly thought of when the term is used. A Relational Database, the type of database created in Microsoft Access, establishes the relationships between entities by means of common fields included in a file, called a “relation.” WOAH! We just skipped over a lot of information. Basically the difference between a relational database and the databases we’ve been talking about are the following…

While “traditional” databases simply have columns of information and look more like spreadsheets, relational databases have what we call tables. These tables are basically like mini-spreadsheets. Instead of having one huge spreadsheet with all of the information in it, much of it being repeated, you have many that all relate to each other. So a personal finance table instead of saying "food" three dozen times in the same column and "clothing" two dozen times, it will say "1" and "2" in the respective columns. This table then relates to another table, called... say... "Expenditure Type" that has the code "1" in one column and what it means, "food" in the second column. This same concept is used for every part of data in a table that might be repeated, to eliminate repeated data (this process of eliminating all repeating data is called data normalization). So a different table would be created called "Restaurants," and a code would be given to each, and perhaps a different table would be created for each member of the family who might spend money. While this process may seem to cause more trouble for the creator, a relational database is MUCH more useful than a spreadsheet-like database, and will be much less of a hassle in the long run. (To see why, read the later post, "What's so Great About a Relational Database?")

To conclude, a database is basically a collection of information that you want to store. The kind of databases that are used on computers are typically relational database. A relational database is a database which includes several tables of data and links them together rather than repeating the same data over and over.