Václav Kužel's tech blog: August 2015

I needed to create a page that is going to be displayed to user who opened my application in the past but currently is without internet connection or he is working with the application but his connection is suddenly lost.

Client is opening the application but does not have connection available

For this purpose I decided to use ApplicationCache. I wanted that user will be redirected to offline.html page if index.html page is not loaded because connection is lost. This means I couldn’t put manifest directive to index.html page because page with manifest is automatically cached but I needed only offline.html page to be cached. So I placed manifest to offline.html instead and invoked it by placing object element to index.html.

Then defined failover page in ApplicationCache manifest file.

From now if user tries to open the application without internet connection or without server working a offline.html page will be displayed to him instead.

Client is working with the application and connection is suddenly lost

At this state only requests to server are done via XHR so we just need to intercept responses and if error occurs because connection is lost to redirect client to offline page. Interceptor can be added by $httpProvider.

Apache Solr is search server based on Apache Lucene search library that allows you to index and search text content. This is great base for concordancer service. Solr itself cannot gather data from the internet. Apache Nutch was created to handle this job. This article describes basic steps of connecting Nutch to Solr and configuring concordancer service.

This text covers configuration for Solr 5.2.1 and Nutch 1.10 on Linux/Unix/OSX operating system. It's possible that configuration is going to change in future releases.

At first download binary versions of Solr and Nutch from project's websites. Then unpack both projects into one directory, let's say concordancer.

Go to Solr direcotry (solr-5.2.1) and start it by calling

bin/solr start

Then create search core named "concordancer".

bin/solr create -c concordancer

Core is the database of indexed data and configuration of how to perform search on this data. For example you can have several cores one for searching intranet sites another for searching internet websites.

Newly created core concordancer uses default configuration that needs to be changed. Without these changes Nuch can't interoperate with Solr. Open the configuration file server/solr/concordancer/conf/solrconfig.xml.

Almost at the end of file you can see directive <updateRequestProcessorChain name="add-unknown-fields-to-the-schema”>. This tells to Solr that when indexing data Solr should define index structure (de facto database structure for indexed data) dynamically. This means if new data structure is indexed (like website with all its metadata) new fields will be added to index structure described in generated managed-schema file. We don't need this feature so remove <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> and it's content. Also remove <schemaFactory class="ManagedIndexSchemaFactory"> and it's content. Try to find <initParams path="/update/**”> it should contain add-unknown-fields-to-the-schema parameter so remove this directive too.

Search for all occurences of string _text_ in config file and change them to text. Newly created schema file below does not contain field named _text_ so that’s why we have to change it's name.

Now remove generated file managed-schema from configuration directory and replace is by schema.xml file prepared by Nutch developers. This means copying apache-nutch-1.10/conf/schema.xml to solr-5.2.1/server/solr/concordancer/conf directory.

Open the copied configuration schema.xml and find directive <field name="content" type="text" stored="true" indexed="true”/>. Change parameter stored from value false to true. This means that you don’t want to just index page's content but also store it's text for concordancer purposes.

Find directive <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt”/> and delete it. Then close the file and restart Solr by calling:

bin/solr restart

Let’s turn our attention to crawling websites by Nutch so go to directory apache-nutch-1.10 and try to run program without parameters:

bin/nutch

You should see output like this:

If there's some problem please go to Nutch tutorial for more details.

Open Nutch configuration file conf/nutch-site.xml and add configuration of Nutch User-agent http header:

Then create directory that is going to contain files with URLs to be crawled. In this directory create file seed.txt with one URL per line. You should end up with structure like this:

apache-nutch-1.10/urls/seed.txt

Then execute crawling and indexing of pages by calling crawl command.

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/concordancer urls/ crawl/ 1

Troubleshooting: If some error appear while crawling check out the log files solr-5.2.1/server/logs/solr.log and apache-nutch-1.10/logs/hadoop.log Actually Solr's schema.xml file contains some problems like missing field types, etc. Solving of these problems is up to you.

Open the Solr's query page and try to search newly indexed data.

Let's configure Solr's concordancer ability. Open the configuration file server/solr/concordancer/conf/solrconfig.xml and locate configuration directives of highlighter (tag <searchComponent class="solr.HighlightComponent" name="highlight”>). Change the default boundary scanner from simple boundary scanner to break iterator and configure break iterator’s bs.type to SENTENCE. Boundary scanner finds boundaries of sentence so from now concordancer can display whole sentences.

Finally we need to modify request handler to display found sentences. Go to required request handler and add highlighter configuration. I choose to modify /query handler so its configuration looks like following:

Open the Solr admin http://localhost:8983/solr/#/concordancer/query and fire some queries on /query request handler. You should see result similar to this:

Václav Kužel's tech blog

2015-08-27

Creating offline fallback page for the AngularJs web application using ApplicationCache

Client is opening the application but does not have connection available

Client is working with the application and connection is suddenly lost

2015-08-15

How to connect Apache Solr 5.2.1 with Apache Nutch 1.10 and build concordancer on top of this stack