JPPF, java, parallel computing, distributed computing, grid computing, parallel, distributed, cluster, grid, cloud, open source, android, .net

The open source
grid computing

 Home   About   Features   Download   Documentation   On Github   Forums 

Word Count demo

What does the sample do?

This sample performs a word count on a full or partial Wikipedia database. It illustrates how JPPF can be used to process large datasets in a very efficient way. The actual processing is vaguely similar to a mpa/reduce process, which is not surprising given what we are trying to accomplish.

The actual processing is as follows:

  1. The client application reads the wikipedia file line by line and generates articles. Each article is delimited by the XML tags <page> and </page>. Furthermore, we only look for the text of each article (we ignore metadata), so within each <page> we only keep the part delimited by <text ...> and </text>.
  2. Full articles are then grouped into JPPF tasks, and then tasks into JPPF jobs. This results in a constant stream of jobs until all articles are read.
  3. Each generated job is offered to a submit queue. This queue has a limited capacity, to avoid an explosion of the memory footprint in case jobs are created faster than they are processed: when the capacity is reached, the next job submission - and reading of articles - is blocked until a slot becomes available.
  4. Once a job is submitted, it will be processed by one or more nodes, depending on its number of tasks and the load-balancing settings of the server. Each task in the set processed by the node will perform the word count of all the articles it contains and store the counts into a simple map of words to corresponding count: basically a Map<String, Long>. This is the equivalent of a 'map' step in a map/reduce strategy.
  5. To produce results that make sense, there are constraints on what is considered a word: the characters must belong to a predefined set (in this demo ['a'...'z', 'A'...'Z', '-'], we don't count numbers as words), and each word must be part of a predefined dictionary. In this demo we use a dictionary based on Hunspell en_US v7.1 , which can be found from this page. Additionally, the redirect articles are excluded from the word counts, but counted nonetheless, for statistics purposes.
  6. Once a node has processed a set of tasks, it will perform a first 'reduce' step by simply aggregating therir results. The first task in the set will hold the aggregated results, while ther other tasks will have none.
  7. When the client application receives results from a node, it will aggregate them into its own global word count map: this is the second 'reduce' step.
  8. Once all results are received, the application sorts them by descending count value, then ascending alphabetical order of the words within each count grouping, and finally format and print these sorted results to a file.

How do I run it?

Before running this sample, you need to install a JPPF server and at least one node.
For information on how to set up a node and server, please refer to the JPPF documentation.
Once you have installed a server and node, perform the following steps:
  1. Open a command prompt in JPPF-x.y-samples-pack/WordCount
  2. Build the sample's node add-on: type "ant jar". This will create a file named WordCountNodeListener.jar. This add-on is a node life cycle listener which accomplishes 2 goals: load the dictionary when the node starts (in the nodeStarting() notification) and aggregate the results of tasks that have just been processed (in the jobEnding() notification)
  3. Copy WordCountNodeListener.jar in the "lib" folder of the JPPF driver installation, to add it to the driver's classpath. The nodes will download the node listener code from the server.
  4. Start the driver
  5. Start one or more node(s). Each node should output a "loaded dictionary: 46986 entries" message, indicating that the node add-on is working properly
  6. When the server and nodes are started, type "run.bat" on Windows or "./" on Linux/Unix to start the word count demo. The results will be written to a file Named "WordCountResult.txt"

Tuning and other considerations

  • Configuration of the demo. The client configuration file config/ allows you to change a number of parameters:
    # path to the wikipedia file
    wordcount.file = data/wikipedia_en_small.xml
    # how many articles per JPPF task
    wordcount.articles.per.task = 50
    # how many task in each JPPF job
    wordcount.tasks.per.job = 100
    # how many server connections can each job be distributed over (parallel I/O)
    wordcount.channels = 4
    # how many concurrent job can be executing at the same time before reading of articles blocks
    wordcount.job.capacity = 2
    These parameters allow you to tune the demo's behavior, and optimize the memory footprint vs. throughput tradeoff. Feel free to experiment!. The initial values provided in this sample are fit for a client application with 1 GB or heap (-Xmx1024m). If you were to increase the job capacity, you might have to increase the application's heap size as well, in order to avoid out of memory conditions
  • You should also consider carefully how the job capacity, number of channels per job, number of connections to the server, and load-balancing settings interact with each other. The default settings in this sample are a best attempt at maximizing the throughput and finding a balance between how many jobs can be submitted concurrently, and how fast each job can be sent to to or received from the server by using multiple parallel I/O channels
  • In the same way, it may be needed to increase the memory available to each node. This mostly depends on how many article per task and tasks per job you configured, and the load balancing settings in the server. For reference, this demo also provides configuration file for the server and nodes, which you can find in the config/driver and config/node folders
  • The provided Wikipedia file is an extract from the full English Wikipedia dump from May 2012 with only the latest revision and no talk pages. The full dump is pretty large: around 8 GB compressed (bzip2) and 36 GB uncompressed. The extract provided in this demo contains the articles that fit into the first 10,000 lines.
  • The grid topology can also play an important role. In particular, if you have very few nodes, you might consider running multiple JPPF nodes on each physical or virtual machine, each with less processing threads than for a single node. Because of the concurrent I/O involved, you should notice a significant throughput increase.

I have additional questions and comments, where can I go?

There are 2 privileged places you can go to:

JPPF Copyright © 2005-2020 Get JPPF at Fast, secure and Free Open Source software downloads