Apache Solr is search server based on Apache Lucene search library that allows you to index and search text content. This is great base for concordancer service. Solr itself cannot gather data from the internet. Apache Nutch was created to handle this job. This article describes basic steps of connecting Nutch to Solr and configuring concordancer service.
This text covers configuration for Solr 5.2.1 and Nutch 1.10 on Linux/Unix/OSX operating system. It's possible that configuration is going to change in future releases.
At first download binary versions of Solr and Nutch from project's websites. Then unpack both projects into one directory, let's say concordancer
.
Go to Solr direcotry (solr-5.2.1) and start it by calling
bin/solr start
Then create search core named "concordancer".
bin/solr create -c concordancer
Core is the database of indexed data and configuration of how to perform search on this data. For example you can have several cores one for searching intranet sites another for searching internet websites.
Newly created core concordancer uses default configuration that needs to be changed. Without these changes Nuch can't interoperate with Solr. Open the configuration file server/solr/concordancer/conf/solrconfig.xml.
Almost at the end of file you can see directive <updateRequestProcessorChain name="add-unknown-fields-to-the-schema”>. This tells to Solr that when indexing data Solr should define index structure (de facto database structure for indexed data) dynamically. This means if new data structure is indexed (like website with all its metadata) new fields will be added to index structure described in generated managed-schema file. We don't need this feature so remove <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> and it's content. Also remove <schemaFactory class="ManagedIndexSchemaFactory"> and it's content. Try to find <initParams path="/update/**”> it should contain add-unknown-fields-to-the-schema parameter so remove this directive too.
Search for all occurences of string _text_ in config file and change them to text. Newly created schema file below does not contain field named _text_ so that’s why we have to change it's name.
Now remove generated file managed-schema from configuration directory and replace is by schema.xml file prepared by Nutch developers. This means copying apache-nutch-1.10/conf/schema.xml to solr-5.2.1/server/solr/concordancer/conf directory.
Open the copied configuration schema.xml and find directive <field name="content" type="text" stored="true" indexed="true”/>. Change parameter stored from value false to true. This means that you don’t want to just index page's content but also store it's text for concordancer purposes.
Find directive <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt”/> and delete it. Then close the file and restart Solr by calling:
bin/solr restart
Let’s turn our attention to crawling websites by Nutch so go to directory apache-nutch-1.10 and try to run program without parameters:
bin/nutch
You should see output like this:
If there's some problem please go to Nutch tutorial for more details.
Open Nutch configuration file conf/nutch-site.xml and add configuration of Nutch User-agent http header:
Then create directory that is going to contain files with URLs to be crawled. In this directory create file seed.txt with one URL per line. You should end up with structure like this:
apache-nutch-1.10/urls/seed.txt
Then execute crawling and indexing of pages by calling crawl command.
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/concordancer urls/ crawl/ 1
Troubleshooting: If some error appear while crawling check out the log files solr-5.2.1/server/logs/solr.log and apache-nutch-1.10/logs/hadoop.log Actually Solr's schema.xml file contains some problems like missing field types, etc. Solving of these problems is up to you.
Open the Solr's query page and try to search newly indexed data.
Let's configure Solr's concordancer ability. Open the configuration file server/solr/concordancer/conf/solrconfig.xml and locate configuration directives of highlighter (tag <searchComponent class="solr.HighlightComponent" name="highlight”>). Change the default boundary scanner from simple boundary scanner to break iterator and configure break iterator’s bs.type to SENTENCE. Boundary scanner finds boundaries of sentence so from now concordancer can display whole sentences.
Finally we need to modify request handler to display found sentences. Go to required request handler and add highlighter configuration. I choose to modify /query handler so its configuration looks like following:
Open the Solr admin http://localhost:8983/solr/#/concordancer/query
and fire some queries on /query
request handler. You should see result similar to this: