2015-10-10

Development of the Apache Nutch 1.10 plugin in IntelliJ IDEA using the Maven project

Apache Nutch is quite old project using Apache Ant to build itself. I wanted to develop a plugin for filtering content of parsed pages and because I am IntelliJ IDEA user I wanted to do it in this IDE. This how-to helps you to setup IDEA project so you can develop and debug the Nutch (parse) plugin. I must mention great article by Emir Dizdarevic the Precise data extraction with Apache Nutch which helped me a lot.

Part of the article is a template project: https://github.com/vkuzel/Nutch-Plugin-Development-Template

Overview

When started from IDEA the project is first built by Maven and then deployed to Nutch binary installation. Nutch task (in this case parse) is then started and Nutch itself runs the plugin. IDEA attaches a debugger to Nutch process and you can debug the code.

Maven project

Apache Nutch is present in the Maven repository but unfortunately contains some weird dependencies. To compile plugin you need to add dependency to Nutch org.apache.nutch.nutch version 1.10 and to Hadoop org.apache.hadoop.hadoop-core version 1.2.0 (Hadoop is a core library of Nutch). To compile the project you need to exclude org.apache.cxf dependency from Nutch library because Maven cannot resolve it. Btw. I figured out that different versions of Nutch need to exclude different libraries.

To allow Nutch to run the plugin it has to be built and deployed to Nutch installation directory during every debug session. This is managed by external shell script deploy_plugin_to_nutch_for_debug.sh which is executed by Maven on a install phase of a build process. For this task I incorporated the exec-maven-plugin plugin into the pom.xml file.

Nutch installation

Since this article covers development of Nutch plugin there's no need to have complete Nutch source codes. To run plugin you need properly configured Nutch binary installation that can be downloaded from it's official site. In the template project there's empty directory nutch-1.10 where Nutch should be copied. Only necessary change to Nutch cofiguration (apart from default installation process) is to add the plugin to the plugin.includes directive so Nutch can recognize it.

There's also archive test_data.zip with the pre-downloaded page that can be used to test the parser plugin. This archive is extracted to the nutch-1.10/test_data directory so the plugin is always working with same data.

Project (debug) configuration

Because project uses an external application (Nutch) an custom application run/debug configuration is needed in IDEA. Add new debug configuration with following parameters:

  • Main class: org.apache.nutch.parse.ParseSegment. Nutch is started directly by calling its class not by usual shell script located in the bin directory of its installation.
  • Program arguments: test_data/crawl/segments/20151010172800. This is path to test data extracted from the test_data.zip archive. Test data contains one html page just to test one pass through the plugin.
  • Working directory: nutch-1.10. Besides of this working directory it is also necessary to add Nutch's conf and lib directories to a classpath. Do it by adding them in the Project Settings -> Modules -> Dependencies menu.
  • Before launch run Maven Goal: clean install. Before every execution module has to be built and deployed to Nutch installation. Because of it it's necessary to execute an install goal of Maven's build process.

Now by running the debug configuration you should be able to debug the plugin.