2009/11/18

Devoxx 2009: Solr Power with Lucene

18/11/2009, Erik Hatcher

Lucene core
  • full-text search library
  • concepts
    • inverted index:
      • term + proximity
    • documents
    • fields
      • field-ids: e.g. category, title, name...
      • types: number, date, text...
      • unique keys: unique id per document
    • terms (aka tokens):
      • processed through filters
        • synonyms
        • ignore words
        • stemming
    • scoring relevancy
      • term frequency
      • inverse document frequency
      • field length normalization → control how field length / # occurrences affects scoring
      • boost factors: favor or boost some fields (e.g. titles)
  • core
    • standalone jar
    • core index
Apache Solr properties:
  • search server
  • based on Apache Lucene
    • → Lucene exposed over http
    • spell checking
    • highlighting
  • extensible
  • scalable
    • caching
    • replication
    • master/slave distributed search → sharding
  • multiple inputs
  • version 1.4
Using Solr
  • setup:
    • solrconfig.xml:
      • cache settings
      • Lucene indexing parameters
  • API:
    • RequestHandlers:
      • mini-servlets,
      • flexible responses:
        • http GET/POST
        • JSON
        • SolrJ
        • ruby, php, …
        • content streams (must be shielded)
    • indexing / deleting a document
      • through api: xml document with commands
      • POST or GET with request parameters
    • other actions:
      • commit / rollback: batching document indexing
      • optimize
    • search request: simple GET, with optional parameters
      • debug
      • lucene explanation
      • pagination: start / raws
      • score: lucene score
  • DataImportHandler
    • import from RDBMS, xml and e-mail
    • incremental indexing
    • extensible
    • debug console
  • Solr Cell: uses Lucene Tika:
    • index Word, pdf, html ...
    • ExtractingRequestHandler
  • Query parser framework with plugable parsers:
    • Lucene syntax:
      • powerful
      • but user-unfriendly syntax
      • exceptions visible to end-users
    • Dismax query parser
      • simplified syntax
Advanced Solr: Search Components
  • standard: query, facet, mlt, highlight, stats, debug
  • others: elevation, clustering, term, term vector
  • faceting
    • counts subset within results
    • group 'facets' of a document (like a category field)
  • spell checking
  • pluggable distance algorithms: Levenstein or JaroWinkler
  • highlighting: custom prefix and suffix → response is highlighted
  • query elevation → elevate.xml: boost or exclude a document
  • clustering: grouping of documents into labeled sets
  • enumerate terms for a field
  • term vectors: term frequency, document frequency, position, offset
  • statistics: stats.jsp (in RAM); returns xml
  • scaling:
    • replication:
      • master is polled
      • replicant pulls Lucene index / config files
      • replicate + load balance
    • distributed search: single index is too large → sharding
Staring with Solr
  • agile, iterative process works best:
    • basic schema
    • bring in data
    • check requirement gaps
    • adjust solr

No comments: