Installing and using Apache Cassandra With Java Part 2 (Data model)
Welkom to the second part of Installing and Using Apache Cassandra With Java, the first part got quite a lot of attention, i’ve done my best so far to update the first part with all the new information that came to me by you readers.
The second part is less practical and more theoretical, this is partly because Cassandra is a completely different concept from well known RDBMS databases that we first need to focus on some of the concepts that are being used in Cassandra. Since i also have a history in RDBMS databases i also needed to learn some different concepts, before i could continue with the whole process. One of the key differences between Cassandra and a regular RDBMS database is the Data model, in a standard RDBMS database we have concepts like tables, columns, constraints, indexes, query language etc..
Well, it just feels like learning it all over again, it took me quite a lot of hours before i could grasp the concept of the data model which is used in Cassandra, and although i think i know understand what the key components are, i still have to make a mind switch in how i organize my data structures. So i am not there yet but i can explain what various data containers are. So here i will start out with the simplest version and we keep on going till we have had them all. Since this is a Apache Cassandra with Java i explain it using diagrams (please click on them for a bigger version) and an example if you would translate it directly to a Java class structure. Hope this will speed it up to get a better understanding.
Column:
A column is also referred to as a tuple (triplet) that contains a name, value and a timestamp. This is the smallest data container there is. The following is a Java representation of the column. This becomes convenient to explain the more complex structures.
As you can see, it is a very simple container, a name, a value and a timestamp. The name and value technically have no limit in size.
SuperColumn:
A SuperColumn is a tuple with a name and a value, it doesn’t have a timestamp like the Column tuple. Notice that the value is in this case not a binary value but more of a Map style container. The map contains key / column combinations. What is important here is that the key has the same value as the name of the Columnit refers to. So to put it simple, a SuperColumn is a container for one or more Columns. You will see that it will also make a big difference later on when we discuss the ColumnFamily and SuperColumnFamily.
So suppose we express a SuperColumn in Java we have something like the following:
Byte[] name;
// The key is equal to the name of the Column
Map<Byte[] /* key */, Column> value = null;
}
Although it is still not that complex it might be nice to see en example of it using Java:
sc.name = "person1";
sc.put("firstname", new Column("firstname", "Ronald");
sc.put("familyname", new Column("familyname", "Mathies");
ColumnFamily:
ColumnFamily is a structure that can keep an infinite number of rows, for most people with an RDBMS background, this is the structure that resembles a Table the most. When you look at the diagram you can see that a ColumnFamily has a name (comparable to the name of a Table), A map with a key (comparable to a row identifier) and a value (which is a Map containing Columns). The map with the columns have the same rules as the SuperColumn, the key has the same value as the name of the Column it refers to.
This part was very confusing for me so i tried to create a resemblance of it in Java to make it a bit clearer:
Byte[] name;
// The key is a user generated key
Map<Byte[] /* key */,
// The key is equal to the name of the Column.
Map<Byte[] /* key */, Column>> value = null;
}
As you can see it becomes a bit hard to read now, but think of it in the following way, suppose i have an address book, the name of the ColumnFamily is Addressbook.
cf.name = "AddressBook";
Now we need to add some data, so i want to store a person, we first create the row and the we add the various columns to the row, finally we add the row to the ColumnFamily:
// Make sure the key "firstname" is the same as the name of the column.
row.put("firstname", new Column("firstname", "Ronald");
row.put("familyname", new Column("familyname", "Mathies");
row.put("city", new Column("city", "Leiden");
cf.put("person1", row);
Map<Byte[], Column> row = new HashMap<Byte[], Column>();
// Make sure the key "firstname" is the same as the name of the column.
row.put("firstname", new Column("firstname", "Maomao");
row.put("familyname", new Column("familyname", "Chen");
row.put("city", new Column("city", "Leiden");
cf.put("person2", row);
SuperColumnFamily:
Finally we have the largest container, the SuperColumnFamily, if you understand the ColumnFamily then this construction isn’t much harder, instead of having Columns in the inner most Map we have SuperColumns. So it just adds an extra dimension. As displayed in the image, the Key of the Map which contain the SuperColumns must be the same as the name of the SuperColumn (just like with the ColumnFamily).
Here is another Java representation for this construction:
Byte[] name;
// The key is a user generated key
Map<Byte[] /*key */,
// The key is equal to the name of
// the SuperColumn.
Map<Byte[] /* key */,
SuperColumn>> value = null;
}
So basically it is the same as a ColumnFamily, but instead we have a map that contains Columns we have a map that contains SuperColumns.
Keyspaces:
Keyspaces are quite simple again, from an RDBMS point of view you can compare this to your schema, normally you have one per application. A keyspace contains the ColumnFamilies. Note however there is no relationship between the ColumnFamiliies, they are just separate containers.
So thats about everything i can think of to explain you about the data model, there is still one part i haven’t talked about and that is the sorting mechanism of the various containers. We will take a look at that in part 3 of Installing and using Apache Cassandra With Java.
I hope the explanation of this makes it a bit clearer on how the data model works in Cassandra, when there are parts that are not clear on confusing then please let me know so i can adjust them. If you have any idea’s on how to make things clearer then i would like to know that.
There’s a great article about the Super Columns and Columns stuff, from which I could finally wrap my mind about the concepts. It’s on http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model (it’s a bit long)
Maybe I wouldn’t have needed to read it, had I seen your Java-ish examples.
That is a great article indeed, part of my knowledge about the data model came from that site ( along with the explanation on the Cassandra wiki ). However i got confused with the JSON notation in the more complex cases. Putting it in a graphical representation and a Java example makes it a lot easier i think, especially when you have Java knowledge.
Can you explain how to store data in the database?
I mean, my thinking is relational database, I’ve been a normal database developer for over ten years. What I don’t get is how I should think about storing my data in the database for the common cases of:
A) Normal table.
B) Table with 1 to N mapping.
C) Table with N to N mapping.
….
@philip andrew
Yes, part 3 will cover that, there i will describe how to store data into the database and how to retrieve it. This part is to explain how the data model within Cassandra works. But, one thing i have learned so far, let go of the whole RDBMS principal, a lot of stuff that you can do in a RDBMS database don’t exist in Cassandra. The example for part three will cover a Blogging database (which should be familiar for everybody. I will also add a sample Eclipse project. But it will take some time before i have put everything together, like you i have a relational database background so it’s also confusing for me, but i’m getting the hang of it.
Congrats on doing such a fine job in explaining the Casandra data model.
So you say that two ColumnFamilies in one keyspace are completely independent. This does not make much sense since there is a “keyspace” term. So may be there is some relation, at least inside Cassandra which may impact storage efficiency or provide later support for foreign keys?
@andrewb
What i mean by that is that there is no relation between the data inside two different column families. Unlike there is a relation between the data within a ColumnFamily (Columns, Supercolumns all have a relation with each other since they have a parent / child relation).
It is however not possible (within cassandra) to create a relationship between two different ColumnFamilies.
When you make an analogy with a RDBMS database i can create two tables and a relationship between them, changing the data in one can have an effect on the data in the other.
This is something that is not possible in Cassandra (between ColumnFamilies)
Talking about support of foreign keys within Cassandra, i have a feeling that this will not be added between ColumnFamilies. You also see this in Googles BigTable implementation, entity groups (similar to ColumnFamilies) don’t have a relation to other entity groups.
I got confused with the data model,for example, the SuperColumn. As you say
public class SuperColumn {
Byte[] name;
// The key is equal to the name of the Column
Map value = null;
}
but in the Cassandra source code , there is not Map type field in the Class SuperColumn,instead it has a List filed.
@ Michael Qi
The code exceptions are indeed not the way that Cassandra internally works, the reason i use these examples is to get the mindset straight. Looking at the Cassandra data model for the first time can be very confusing, one method to make it clearer is presenting it in a way that is more recognizable for people.
So don’ think the code samples as the reality of how it is implemented but more as a different representation to understand it easier. (just like we use images to make things clearer i uses two different methods, images and code)