Václav Kužel's tech blog: 2015

2015-11-07

Generating Java classes with the Java 8 date and time API types by the JAXB

The JAXB's XJC is very popular tool for generating Java classes from a specification files (XSD or WSDL). Java 8 introduced new Date and Time API. Unfortunately this API does not come with (un)marshaller so XJC still generates code using XMLGregorianCalendar type by default.

Because I want to work with the LocalDateTime I had to find and link (un)marshaller for the new API to my project. I decided to use Mikhail Sokolov's adapters.

Then I had to tell XJC that it should generate code with the LocalDateTime for the specification's date-time types. This can be achieved by a binding file. So create a new binding.xjb file and put mapping of xsd:dateTime to LocalDateTime type by using LocalDateTimeXmlAdapter adapter/(un)marshaller.

This binding file uses nonstandard adapter parameter so you need to add -extension switch to be able to execute XJC. Whole command should look like this: xjc -d <target_directory> -b binding.xjb -extension <specification_file.xsd>

The generated code is pretty ok except for one small thing. For initialisation of generic types the generated code uses verbose form instead of diamond operator. IntelliJ IDEA's static code analysis complains about it. To fix this I decided to post-process the code by regular expression and replace all verbose initialisations by the shorter diamond.

Whole process of the generating and the post-processing can be put into a shell script generate.sh for easy use.

2015-10-10

Development of the Apache Nutch 1.10 plugin in IntelliJ IDEA using the Maven project

Apache Nutch is quite old project using Apache Ant to build itself. I wanted to develop a plugin for filtering content of parsed pages and because I am IntelliJ IDEA user I wanted to do it in this IDE. This how-to helps you to setup IDEA project so you can develop and debug the Nutch (parse) plugin. I must mention great article by Emir Dizdarevic the Precise data extraction with Apache Nutch which helped me a lot.

Part of the article is a template project: https://github.com/vkuzel/Nutch-Plugin-Development-Template

Overview

When started from IDEA the project is first built by Maven and then deployed to Nutch binary installation. Nutch task (in this case parse) is then started and Nutch itself runs the plugin. IDEA attaches a debugger to Nutch process and you can debug the code.

Maven project

Apache Nutch is present in the Maven repository but unfortunately contains some weird dependencies. To compile plugin you need to add dependency to Nutch org.apache.nutch.nutch version 1.10 and to Hadoop org.apache.hadoop.hadoop-core version 1.2.0 (Hadoop is a core library of Nutch). To compile the project you need to exclude org.apache.cxf dependency from Nutch library because Maven cannot resolve it. Btw. I figured out that different versions of Nutch need to exclude different libraries.

To allow Nutch to run the plugin it has to be built and deployed to Nutch installation directory during every debug session. This is managed by external shell script deploy_plugin_to_nutch_for_debug.sh which is executed by Maven on a install phase of a build process. For this task I incorporated the exec-maven-plugin plugin into the pom.xml file.

Nutch installation

Since this article covers development of Nutch plugin there's no need to have complete Nutch source codes. To run plugin you need properly configured Nutch binary installation that can be downloaded from it's official site. In the template project there's empty directory nutch-1.10 where Nutch should be copied. Only necessary change to Nutch cofiguration (apart from default installation process) is to add the plugin to the plugin.includes directive so Nutch can recognize it.

There's also archive test_data.zip with the pre-downloaded page that can be used to test the parser plugin. This archive is extracted to the nutch-1.10/test_data directory so the plugin is always working with same data.

Project (debug) configuration

Because project uses an external application (Nutch) an custom application run/debug configuration is needed in IDEA. Add new debug configuration with following parameters:

Main class: org.apache.nutch.parse.ParseSegment. Nutch is started directly by calling its class not by usual shell script located in the bin directory of its installation.
Program arguments: test_data/crawl/segments/20151010172800. This is path to test data extracted from the test_data.zip archive. Test data contains one html page just to test one pass through the plugin.
Working directory: nutch-1.10. Besides of this working directory it is also necessary to add Nutch's conf and lib directories to a classpath. Do it by adding them in the Project Settings -> Modules -> Dependencies menu.
Before launch run Maven Goal: clean install. Before every execution module has to be built and deployed to Nutch installation. Because of it it's necessary to execute an install goal of Maven's build process.

Now by running the debug configuration you should be able to debug the plugin.

2015-08-27

Creating offline fallback page for the AngularJs web application using ApplicationCache

I needed to create a page that is going to be displayed to user who opened my application in the past but currently is without internet connection or he is working with the application but his connection is suddenly lost.

Client is opening the application but does not have connection available

For this purpose I decided to use ApplicationCache. I wanted that user will be redirected to offline.html page if index.html page is not loaded because connection is lost. This means I couldn’t put manifest directive to index.html page because page with manifest is automatically cached but I needed only offline.html page to be cached. So I placed manifest to offline.html instead and invoked it by placing object element to index.html.

Then defined failover page in ApplicationCache manifest file.

From now if user tries to open the application without internet connection or without server working a offline.html page will be displayed to him instead.

Client is working with the application and connection is suddenly lost

At this state only requests to server are done via XHR so we just need to intercept responses and if error occurs because connection is lost to redirect client to offline page. Interceptor can be added by $httpProvider.

2015-08-15

How to connect Apache Solr 5.2.1 with Apache Nutch 1.10 and build concordancer on top of this stack

Apache Solr is search server based on Apache Lucene search library that allows you to index and search text content. This is great base for concordancer service. Solr itself cannot gather data from the internet. Apache Nutch was created to handle this job. This article describes basic steps of connecting Nutch to Solr and configuring concordancer service.

This text covers configuration for Solr 5.2.1 and Nutch 1.10 on Linux/Unix/OSX operating system. It's possible that configuration is going to change in future releases.

At first download binary versions of Solr and Nutch from project's websites. Then unpack both projects into one directory, let's say concordancer.

Go to Solr direcotry (solr-5.2.1) and start it by calling

bin/solr start

Then create search core named "concordancer".

bin/solr create -c concordancer

Core is the database of indexed data and configuration of how to perform search on this data. For example you can have several cores one for searching intranet sites another for searching internet websites.

Newly created core concordancer uses default configuration that needs to be changed. Without these changes Nuch can't interoperate with Solr. Open the configuration file server/solr/concordancer/conf/solrconfig.xml.

Almost at the end of file you can see directive <updateRequestProcessorChain name="add-unknown-fields-to-the-schema”>. This tells to Solr that when indexing data Solr should define index structure (de facto database structure for indexed data) dynamically. This means if new data structure is indexed (like website with all its metadata) new fields will be added to index structure described in generated managed-schema file. We don't need this feature so remove <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> and it's content. Also remove <schemaFactory class="ManagedIndexSchemaFactory"> and it's content. Try to find <initParams path="/update/**”> it should contain add-unknown-fields-to-the-schema parameter so remove this directive too.

Search for all occurences of string _text_ in config file and change them to text. Newly created schema file below does not contain field named _text_ so that’s why we have to change it's name.

Now remove generated file managed-schema from configuration directory and replace is by schema.xml file prepared by Nutch developers. This means copying apache-nutch-1.10/conf/schema.xml to solr-5.2.1/server/solr/concordancer/conf directory.

Open the copied configuration schema.xml and find directive <field name="content" type="text" stored="true" indexed="true”/>. Change parameter stored from value false to true. This means that you don’t want to just index page's content but also store it's text for concordancer purposes.

Find directive <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt”/> and delete it. Then close the file and restart Solr by calling:

bin/solr restart

Let’s turn our attention to crawling websites by Nutch so go to directory apache-nutch-1.10 and try to run program without parameters:

bin/nutch

You should see output like this:

If there's some problem please go to Nutch tutorial for more details.

Open Nutch configuration file conf/nutch-site.xml and add configuration of Nutch User-agent http header:

Then create directory that is going to contain files with URLs to be crawled. In this directory create file seed.txt with one URL per line. You should end up with structure like this:

apache-nutch-1.10/urls/seed.txt

Then execute crawling and indexing of pages by calling crawl command.

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/concordancer urls/ crawl/ 1

Troubleshooting: If some error appear while crawling check out the log files solr-5.2.1/server/logs/solr.log and apache-nutch-1.10/logs/hadoop.log Actually Solr's schema.xml file contains some problems like missing field types, etc. Solving of these problems is up to you.

Open the Solr's query page and try to search newly indexed data.

Let's configure Solr's concordancer ability. Open the configuration file server/solr/concordancer/conf/solrconfig.xml and locate configuration directives of highlighter (tag <searchComponent class="solr.HighlightComponent" name="highlight”>). Change the default boundary scanner from simple boundary scanner to break iterator and configure break iterator’s bs.type to SENTENCE. Boundary scanner finds boundaries of sentence so from now concordancer can display whole sentences.

Finally we need to modify request handler to display found sentences. Go to required request handler and add highlighter configuration. I choose to modify /query handler so its configuration looks like following:

Open the Solr admin http://localhost:8983/solr/#/concordancer/query and fire some queries on /query request handler. You should see result similar to this: