Word Count demo
What does the sample do?
This sample performs a word count on a full or partial Wikipedia database. It illustrates how JPPF can be used to process large datasets in a very efficient way.
The actual processing is vaguely similar to a mpa/reduce process, which is not surprising given what we are trying to accomplish.
The actual processing is as follows:
- The client application reads the wikipedia file line by line and generates articles. Each article is delimited by the XML tags <page> and </page>.
Furthermore, we only look for the text of each article (we ignore metadata), so within each <page> we only keep the part delimited by <text ...> and </text>.
- Full articles are then grouped into JPPF tasks, and then tasks into JPPF jobs. This results in a constant stream of jobs until all articles are read.
- Each generated job is offered to a submit queue. This queue has a limited capacity, to avoid an explosion of the memory footprint in case jobs are created faster than they are processed:
when the capacity is reached, the next job submission - and reading of articles - is blocked until a slot becomes available.
- Once a job is submitted, it will be processed by one or more nodes, depending on its number of tasks and the load-balancing settings of the server.
Each task in the set processed by the node will perform the word count of all the articles it contains and store the counts into a simple map of words to corresponding count:
basically a Map<String, Long>. This is the equivalent of a 'map' step in a map/reduce strategy.
- To produce results that make sense, there are constraints on what is considered a word: the characters must belong to a predefined set (in this demo ['a'...'z', 'A'...'Z', '-'], we don't count numbers as words),
and each word must be part of a predefined dictionary. In this demo we use a dictionary based on Hunspell en_US v7.1 , which can be found from this page.
Additionally, the redirect articles are excluded from the word counts, but counted nonetheless, for statistics purposes.
- Once a node has processed a set of tasks, it will perform a first 'reduce' step by simply aggregating therir results. The first task in the set will hold the aggregated results, while ther other tasks will have none.
- When the client application receives results from a node, it will aggregate them into its own global word count map: this is the second 'reduce' step.
- Once all results are received, the application sorts them by descending count value, then ascending alphabetical order of the words within each count grouping, and finally format and print these sorted results to a file.
How do I run it?
Before running this sample, you need to install a JPPF server and at least one node.
For information on how to set up a node and server, please refer to the JPPF documentation
Once you have installed a server and node, perform the following steps:
- Open a command prompt in JPPF-x.y-samples-pack/WordCount
- Build the sample's node add-on: type "ant jar". This will create a file named WordCountNodeListener.jar.
This add-on is a node life cycle listener which accomplishes 2 goals:
load the dictionary when the node starts (in the nodeStarting() notification) and aggregate the results of tasks that have just been processed (in the jobEnding() notification)
- Copy WordCountNodeListener.jar in the "lib" folder of the JPPF driver installation, to add it to the driver's classpath. The nodes will download the node listener code from the server.
- Start the driver
- Start one or more node(s). Each node should output a "loaded dictionary: 46986 entries" message, indicating that the node add-on is working properly
- When the server and nodes are started, type "run.bat" on Windows or "./run.sh" on Linux/Unix to start the word count demo. The results will be written to a file Named "WordCountResult.txt"
Tuning and other considerations
I have additional questions and comments, where can I go?
There are 2 privileged places you can go to: