In a previous article, Friso explained how to monotonically increase row IDs with MapReduce. If you read the article (for which the markdown version sports some 1700 words) you may have noticed that the process is not exactly straightforward. Thankfully pig, from version 0.11, introduced the
RANK function which allows to do the same in pig with only a handful of lines of code.
The basic usage of
RANK is as simple as typing:
score = LOAD '/path/to/my/file' AS (object_id:int, score:float); new_score = RANK score;
RANK function adds to
new_score a new column with respect to
score, containing a unique integer per row.
The usefulness of the
RANK function doesn’t end here though. Let’s imagine we need to use the
itemsimilarity Mahout algorithm, and all we have is a csv file where every line is in the form
user_id, object_id, score
user_id is a
chararray. Mahout, unfortunately, doesn’t accept hashes in this case, but only integers.
But blindly using the
RANK function wouldn’t cut it. As you may know,
itemsimilarity computes similarity between objects based on interactions of users with multiple objects and in this case
RANK would assign a different unique id to every
Let’s see how we can solve this problem. The desired output is in the form
integer_user_id, object_id, score
as this is what Mahout loves. The code to accomplish this is
score = LOAD '/path/to/my/file' AS (user_id:chararray, object_id:int, score:float); user = FOREACH score GENERATE user_id; unique_users = DISTINCT user; new_users = RANK unique_users; new_score = JOIN score BY user_id, new_users BY user_id; new_score = FOREACH new_score GENERATE rank_unique_users, object_id, score; STORE new_score INTO '/path/to/new/file' USING PigStorage(',');
The first line loads the file, putting it into
score. Then using
user = FOREACH score GENERATE user_id; unique_users = DISTINCT user; new_users = RANK unique_users;
we put into
user_id column, and in
unique_users all distinct
user_id‘s. This is a crucial step, as we want equal users to have equal integers id’s. The last line of the block adds the
rank_unique_user column (the name choice is a pig convention) to
new_users. After that we create
new_score = JOIN score BY user_id, new_users BY user_id; new_score = FOREACH new_score GENERATE rank_unique_users, object_id, score;
JOIN. The last line basically discards the
user_id column, as it is not needed anymore for Mahout purposes. At last we save our file
STORE new_score INTO '/path/to/new/file' USING PigStorage(',');
which will now have the desired format.
Subscribe to our newsletter
Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.