MapReduce VS Spark - Inverted Index example

Inverted Index is mapping of content like text to the document in which it can be found. Mainly used in search engines, it provides faster lookup on text searches i.e to find the documents where the search text occurs.

Problem Statement:

Dataset is available at this link.

Solution:
MapReduce:
The map method can read the input files and output (word, filename) as the key-value pair. Reducer can then use a hashmap of (filename, count) to count the occurrences of each filename for a particular word key.

InvertedIndexMapper.java:
Map method reads the input directory containing Shakespeare's works. However all the stopwords like "is", "are", "where" etc should be removed while processing. Stopwords are kept in a file in HDFS which can be cached in memory using distributed cache. This file can be read from setup method in the mapper class and store all stopwords in a hashset. Words from input file can then be compared against the stopwords in hashset.

As getLocalCacheFiles() is now deprecated, we use getCacheFiles() to read the cached file.
Stopwords read from the cached file are added to the hashset stopwords.
mapper then filters out all the stopwords by comparing against the stopwords hashset and writes to the output the word and the filename in which it occurs.

InvertedIndexReducer.java:
Reduce method reads the (word, <list of files in which it occurs>) from the mapper output. We use a hashmap to maintain the filename and its frequency of occurance. Each key is then written to the output along with the list of filenames and counts.


InvertedIndexDriver.java:
We use TextInputFormat and also add the stopwords file to distributed cache using the job.addCacheFile method as DistributedCache API is now deprecated.

Execute:
yarn jar MRInvertedIndex-0.0.1-SNAPSHOT.jar com.stdatalabs.MRInvertedIndex.InvertedIndexDriver shakespeare MRInvertedIndex_op -skip stopwords.txt

Output:


Spark:
As discussed in previous posts, spark provides a more flexible API which provides a whole lot of RDD transformations and actions resulting in not a whole lot of lines of code required to achieve the same objective as compared to mapreduce, A look at the code will show

...how
stopwords can be read as a textFile and can be broadcast across nodes.
Read (file_path, line) from input using wholeTextFiles,
Apply flatMap transformation to filter stopwords and create a tuple of (path, word),
Apply map transformation to create a tuple of ((word, filename), 1),
Apply reduceByKey to group all the (word, filename) pairs and sum the counts,
Transform the tuple into required format by subsequent groupBy and map operations.

Execute:
spark-submit --class com.stdatalabs.SparkInvertedIndex.Driver \
--master yarn-client SparkInvertedIndex-0.0.1-SNAPSHOT.jar \
/user/cloudera/stopwords.txt \
/user/cloudera/shakespeare \
/user/cloudera/SparkInvertedIndex_op

Output:

Dataset and the complete source code can be found at this github repo.

You may also like:
MapReduce VS Spark - Aadhaar dataset analysis

MapReduce VS Spark - Wordcount Example

MapReduce VS Spark - Inverted Index example

Spark Streaming part 1: Real time twitter sentiment analysis

Labels: , ,