Tech stuff: Devoxx 2009: Solr Power with Lucene

18/11/2009, Erik Hatcher

Lucene core

Apache Solr properties:

search server
based on Apache Lucene
- → Lucene exposed over http
- spell checking
- highlighting
extensible
scalable
- caching
- replication
- master/slave distributed search → sharding
multiple inputs
version 1.4

Using Solr

setup:
- solrconfig.xml:
  - cache settings
  - Lucene indexing parameters
API:
- RequestHandlers:
  - mini-servlets,
  - flexible responses:
    - http GET/POST
    - JSON
    - SolrJ
    - ruby, php, …
    - content streams (must be shielded)
- indexing / deleting a document
  - through api: xml document with commands
  - POST or GET with request parameters
- other actions:
  - commit / rollback: batching document indexing
  - optimize
- search request: simple GET, with optional parameters
  - debug
  - lucene explanation
  - pagination: start / raws
  - score: lucene score
DataImportHandler
- import from RDBMS, xml and e-mail
- incremental indexing
- extensible
- debug console
Solr Cell: uses Lucene Tika:
- index Word, pdf, html ...
- ExtractingRequestHandler
Query parser framework with plugable parsers:
- Lucene syntax:
  - powerful
  - but user-unfriendly syntax
  - exceptions visible to end-users
- Dismax query parser
  - simplified syntax

Advanced Solr: Search Components

standard: query, facet, mlt, highlight, stats, debug
others: elevation, clustering, term, term vector
faceting
- counts subset within results
- group 'facets' of a document (like a category field)
spell checking
pluggable distance algorithms: Levenstein or JaroWinkler
highlighting: custom prefix and suffix → response is highlighted
query elevation → elevate.xml: boost or exclude a document
clustering: grouping of documents into labeled sets
enumerate terms for a field
term vectors: term frequency, document frequency, position, offset
statistics: stats.jsp (in RAM); returns xml
scaling:
- replication:
  - master is polled
  - replicant pulls Lucene index / config files
  - replicate + load balance
- distributed search: single index is too large → sharding

Staring with Solr

agile, iterative process works best:
- basic schema
- bring in data
- check requirement gaps
- adjust solr

Tech stuff