Installing and using Apache Cassandra With Java Part 5 (Thrift Client 2)
Lets continue where we left off with part four, last time we handled the storing, mutating, deleting and searching of data for a standard ColumnFamily. This time we will do the same but then we are going to look at how it is done for the SuperColumnFamily. We will also have a look at how to create a sort of index. Last time we handled everything for an Author, this time we will take a look at the Posts SuperColumnFamily and the Tag ColumnFamily. A post can have one or more tags and a tag can have one or more posts. So it is important to keep the tags up to date, if we for example create posts with the same tag that already has been used then the tag row should get more references to other posts.
This posting also includes a download (Sample Code (538)) with all the source code in a Eclipse project, the project also contains some extra examples that you can have a look at. If you have downloaded the sample code from the previous post then you can replace that project with this one since it contains both new and old examples). When you have the project loaded into your Eclipse you will probably see some error messages, this has to do with a variable in your project libraries. Change the target folder of the variable in the following way:
Right click on the project and choose Properties. Click on the Java Build Path in the left part and then click on the Libraries tab. Now click on one of the lines that mention CASSANDRA_HOME and choose the Edit… button. Now you can click on the Variable… button to create a new variable with the name CASSANDRA_HOME, make sure it points to your Cassandra home folder ( for example C:/dev/apache-cassandra-0.6.0-beta2 ).
lets get started, in the previous post i already outlined how the posts table will look like, it is a SuperColumnFamily containing two SuperColumns per post, one for the post information (author, title, subject, etc..) and one for storing the tags associated with the post. Just to refresh our memory, lets have a look again at the table representation:
The Posts SuperColumnFamily will organize the data using SuperColumns which contain Columns:
| SuperColumnFamily: Posts | |||||||||||||||||||||||||||||
| Key | Value | ||||||||||||||||||||||||||||
| “cats-are-funny-animals” |
| ||||||||||||||||||||||||||||
| “dogs-are-great-companions” |
| ||||||||||||||||||||||||||||
Lets get started with some code on how to create a new posting:
...
long timestamp = System.currentTimeMillis();
List<ColumnOrSuperColumn> postSuperColumns = new ArrayList<ColumnOrSuperColumn>();
// Build up the post SuperColumn
List<Column> columns = new ArrayList<Column>();
columns.add(new Column("title".getBytes(ENCODING), title.getBytes(ENCODING), timestamp));
columns.add(new Column("body".getBytes(ENCODING), body.getBytes(ENCODING), timestamp));
columns.add(new Column("author".getBytes(ENCODING), author.getBytes(ENCODING), timestamp));
columns.add(new Column("created".getBytes(ENCODING), created.getBytes(ENCODING), timestamp));
SuperColumn superColumn = new SuperColumn("post".getBytes(ENCODING), columns);
ColumnOrSuperColumn columnOrSuperColumn = new ColumnOrSuperColumn();
columnOrSuperColumn.setSuper_column(superColumn);
postSuperColumns.add(columnOrSuperColumn);
// Build up the tags SuperColumn
columns = new ArrayList<Column>();
for (int index = 0; index < tags.length; index++) {
columns.add(new Column(("" + index).getBytes(ENCODING), tags[index].getBytes(ENCODING), timestamp));
}
superColumn = new SuperColumn("tags".getBytes(ENCODING), columns);
columnOrSuperColumn = new ColumnOrSuperColumn();
columnOrSuperColumn.setSuper_column(superColumn);
postSuperColumns.add(columnOrSuperColumn);
Map<String, List<ColumnOrSuperColumn>> job = new HashMap<String, List<ColumnOrSuperColumn>>();
job.put("Posts", postSuperColumns);
client.batch_insert("Blog", postKey, job, ConsistencyLevel.ALL);
So what do we do here, first we create a timestamp so that all the columns we create have the same timestamp. Then we create a list named postSuperColumns for holding the SuperColumn objects, for a single post we need to have two SuperColumn objects, one for the post details and one for the list of tags.
We then create a list for holding Column objects, unlike using Column objects in a normal ColumnFamily which need to be wrapped in a ColumnOrSuperColumn object we can fill the list directly with Columns. Only the highest level (directly under the ColumnFamily or SuperColumnFamily) need to use the ColumnOrSuperColumn aggregate.
We fill this list with the detail information about the posting, so for the author, title, body etc.. we create a Column and add it. Then we create the SuperColumn with the key post and add the list of columns. Since we are now at the highest level we need to wrap the SuperColumn in the ColumnOrSuperColumn aggregate and add it to the postSuperColumns list.
Secondly we do the same for the tags. In this case we have a String array of tags and for each tag we create a new Column which we add to a new empty columns list. We then add this list to the tags SuperColumn and give the SuperColumn the key tags.
Finally we create the Map which contains the data per ColumnFamily and add the postSuperColumns with the Posts ColumnFamily name. Finally we call the batch_insert with the appropriate keystore and the key of the posting.
All in all it is quite the same as that we did for the Author, except there is an extra layer in between.
So in the previous post we talked about retrieving data, for the most part it is exactly the same for SuperColumns as it is for Columns. So to see an example of retrieving all the posts or retrieving a single post with all the Columns within the SuperColumn just have a look at the examples in the previous post or check out the sample project (Sample Code (538)) that comes along with this post. There are of course also situations that are different. Suppose we want to retrieve a post with only the title and bodycolumn which are present in the post SuperColumn.
slicePredicate.addToColumn_names("title".getBytes(ENCODING));
slicePredicate.addToColumn_names("body".getBytes(ENCODING));
ColumnParent columnParent = new ColumnParent("Posts");
columnParent.setSuper_column("post".getBytes(ENCODING));
List<ColumnOrSuperColumn> result =
client.get_slice("Blog", postKey, columnParent, slicePredicate, ConsistencyLevel.ONE);
Still, the code looks really a lot like the code we have seen before, the SlicePredicate is used to specify the columns that we want to retrieve (just like with the examples of the normal ColumnFamily), one thing is different here because we need to specify from which SuperColumn we want to retrieve the columns from. Now i want to point one thing out so that it doesn’t go unnoticed, you can only use the SlicePredicate in combination with one SuperColumn, so it is not possible to retrieve a few columns from different SuperColumns at the same time. Using the SlicePredicate limits you to only retrieving one SuperColumn.
So now how do we update information. This also requires quite a lot of code (i mention this a lot, but actually, when you update more then one field at a time the amount of code ‘overhead’ becomes less in comparison to the actual code needed to specify the changes). Suppose we want to change the title of a post and since we are making a change we want to introduce a new field containing the updateDate of the post.
Map<String, Map<String, List<Mutation>>> job = new HashMap<String, Map<String, List<Mutation>>>();
List<Mutation> mutations = new ArrayList<Mutation>();
// update the title
List<Column> columns = new ArrayList<Column>();
columns.add(new Column("title".getBytes(ENCODING), "Update: cats are funny animals".getBytes(ENCODING), timestamp));
columns.add(new Column("updateDate".getBytes(ENCODING), "03/02/2010".getBytes(ENCODING), timestamp));
SuperColumn superColumn = new SuperColumn("post".getBytes(ENCODING), columns);
ColumnOrSuperColumn columnOrSuperColumn = new ColumnOrSuperColumn();
columnOrSuperColumn.setSuper_column(superColumn);
Mutation mutation = new Mutation();
mutation.setColumn_or_supercolumn(columnOrSuperColumn);
mutations.add(mutation);
Map<String, List<Mutation>> mutationsForColumnFamily = new HashMap<String, List<Mutation>>();
mutationsForColumnFamily.put("Posts", mutations);
job.put("cats-are-funny-animals", mutationsForColumnFamily);
client.batch_mutate("Blog", job, ConsistencyLevel.ALL);
So, in general it is quite the same as we have seen before, but since we now have that extra layer we need to specify that the Mutation contains a SuperColumn instead of a Column. And since the SuperColumn contains a list of Columns we can specify all the changes for that SuperColumn within that list. Again, Cassandra will find out if you want to modify or create a new Column. If you want to delete a Column you would need to specify the Deletion separately on a different mutation like we have done in the previous post.
Removing a post is exactly the same as removing an Author, so we will not discuss that here anymore, again, (again, checkout the sample project (Sample Code (538)) that comes with this post for examples about deleting posts). There is still one thing we haven’t talked about, and that is how to use the Tags ColumnFamily, first, lets refresh our heads again with the table representation:
The Tags ColumnFamily will organize the data using Columns:
| ColumnFamily: Tags | |||||||||||
| Key | Value | ||||||||||
| “cats” |
| ||||||||||
| “dogs” |
| ||||||||||
| “allergy” |
| ||||||||||
| “animals” |
| ||||||||||
This ColumnFamily is what we call an Inverted Index, why inverted? well, if you look at the data you can see that the entry point is thetag itself, and the information we get with the tag are all the posts that are related with it. So if we know the posts we can retrieve the posts from the Posts ColumnFamily. The trick however is how we need to handle it, suppose we have a blogging website we could for example write a post and select a number of tags that have a relation to the contents of the post. When we edit the post at a later moment we can still change the tags, select a few more or remove a few. Finally, we need to be able to remove a post and by that, update the tags so that they don’t contain any references to posts that don’t exist anymore.
Since we cannot store or modify data in different ColumnFamilies with different keys at the same time we need to split it up in different calls to Cassandra. In general we have already seen most of the calls on how to create, modify and delete a tag. But now we also need to check if a tag already exists, because then we know if we need to update it with a new post key or that we need to create the tag with the post key. The Cassandra client also has a get_count method which can tell us if a tag already exists:
int count = client.get_count("Blog", tagKey, columnParent, ConsistencyLevel.ONE);
if (count > 0) {
// Update the tag with a Post key
} else {
// Create the tag along with a Post key.
}
As you can see, this is really straight forward, we just specify which ColumnFamily and what the tag is. When the tag exists the count will be 1, otherwise 0. The sample project included with this post contains all the examples, along with the ability to create a post and change the Tags ColumnFamily accordingly.
There are of course many different Inverted Indexes possible in a blogging environment, for example a combination of the authors and which posts they made, or the creation dates and which post was created to have a sort of time line.
So this was the second post about storing, modifying, removing and selecting data from the Cassandra database. We have covered all the basics. This was the last post for the time being (well at least for a week, maybe two weeks) since i am going to work on the DataModel explanation on the Apache Cassandra Wiki. As i mentioned in a comment in one of the previous posts i was invited by Jonathan Ellis (Project Chair at Cassandra) if i was interested in rewriting that page and i really like to do that. So i will be spending my evenings on that for a while.
However, i still have so much to tell about Cassandra, and i still have so much to learn about Cassandra that i have enough material for another few posts. For example, how to setup a number of nodes and how you can experiment with the capabilities of Cassandra when a node fails. Also i would like to write a post about Hector the Cassandra Client ‘wrapper’ which adds the ability to connect to a different node when your main node (the node you were communicating with) fails. So if you want to know when i publish new posts then subscribe to my RSS feed or Twitter or keep track of me on this website. I hope you all have enjoyed the postings i made about Cassandra. If you have any comments, remarks, questions then please feel free to fill out the comment section below.
This posting also includes a download (Sample Code (538)) with all the source code in a Eclipse project, the project also contains some extra examples that you can have a look at. If you have downloaded the sample code from the previous post then you can replace that project with this one since it contains both new and old examples). When you have the project loaded into your Eclipse you will probably see some error messages, this has to do with a variable in your project libraries. Change the target folder of the variable in the following way:
Right click on the project and choose Properties. Click on the Java Build Path in the left part and then click on the Libraries tab. Now click on one of the lines that mention CASSANDRA_HOME and choose the Edit… button. Now you can click on the Variable… button to create a new variable with the name CASSANDRA_HOME, make sure it points to your Cassandra home folder ( for example C:/dev/apache-cassandra-0.6.0-beta2 ).
Really good article!
Thx a lot!
Awesome! Just what we were looking for vis-a-vis working with Cassandra and Java – thanks a ton!
I am not sure if the your hashmap DataModel is really consistent with the Cassandra’s internal implementation. For example, you mentioned in your post that a supercolumn contains a hashmap of subcolumns with the keys being subcolumns’ names. However, from the sample code your posted, the supercolumn apparently contains a list of subcolumn objects. The same is for a columnfamily which contains a list of columns and for a supercolumnfamily that contains a list of supercolumns.
Please clarify if you will.
If you are referring to part two and three then yes, the Java representations are not one – on – one comparable to the implementations within Cassandra. The reason for representing it like that is to get a better understanding, people with a background in RDBMS databases often have difficulty in understanding how the model is constructed. Using a different representation might clarify it more.
It might have been better that i mention this in part two to avoid any confusion about that, you are not the first to ask this question, in part two somebody also asked this question, so i did add a comment there that it doesn’t explain the internal representation. Just to create a better and easier understanding of how to think of the datamodel.