Presentation Solr Power with Lucene (2/2)

Solr powers the search of many enterprise applications, and long before Solr shone on enterprise search, Lucene - Solr's core library, was and still is used to drive the search of applications pervasively around the world. In this university lecture we'll cover the typical workflow of getting data into Solr, configuring Solr's indexing and search processes, and building a searching client user interface. Common indexing options will be explored: database (JDBC) indexing, RSS/Atom and other XML data sources, CSV formatted data, rich text (Word, PDF, HTML, etc) documents and indexing through Solr's various native language APIs. Solr's configurability and usage options discussion will include query parsing, faceting, relevancy ranking, spell checking, highlighting, clustering and more-like-this.


PDF: slides.pdf


Solr Power with Lucene

Solr Power with Lucene by Erik Hatcher Lucid Imagination 1


DEMO What’s this demo about 78

Query Parser

Query Parser • Controlled by defType parameter • • • Local {!...} override syntax &defType=lucene (actually a Solr extension of Lucene’s QueryParser) &defType=dismax 51 51

Solr Query Parser

Solr Query Parser • • • • • queryparsersyntax.html+ Solr extensions Kitchen sink parser, includes advanced user-unfriendly syntax Syntax errors throw parse exceptions back to client Example: title:ipod* AND price:[0 TO 100] 52 52

Dismax Query Parser

Dismax Query Parser • Simplified syntax: • loose text “quote phrases” -prohibited +required Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf) 53 53

Raw Query Parser

Raw Query Parser • q={!raw f=id}some:id[value • new TermQuery("id", "some:id[value") 54 54

Searching with SolrJ

Searching with SolrJ SolrServer server = new CommonsHttpSolrServer("http:// localhost:8983/solr"); SolrQuery params = new SolrQuery("author:John"); params.setFields("*,score"); params.setRows(3); QueryResponse response = server.query(params); for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document); } 55 55

Searching with Ruby

Searching with Ruby conn = 'http://localhost:8983/solr') conn.query('my query') do |hit| puts hit.inspect end 56 56

Searching Performance

Searching Performance • • • • • • • Optimized index Non-compound index format Cache tuning (solrconfig.xml) Searcher warming Query complexity: many terms, wildcard terms Faceting concerns Searcher replication 57 57

Search Components

Search Components 58 58

Built-in search components

Built-in search components • Standard: query, facet, mlt, highlight, stats, debug • Others: elevation, clustering, term, term vector 59 59


Faceting • • • • Counts per subset within results Facet on: field terms, queries, date ranges &facet=on &facet.field=cat &facet.query=price:[0 TO 100] SimpleFacetParameters 60 60

Spell checking

Spell checking • • • • http://localhost:8983/solr/spell? q=epod&spellcheck=on& File or index-based dictionaries Supports pluggable distance algorithms: Levenstein and JaroWinkler SpellCheckComponent 61 61


Highlighting • http://localhost:8983/solr/select? • q=apple&hl=on&hl.fl=* HighlightingParameters 62 62

More Like This

More Like This • http://localhost:8983/solr/select? • q=ipod&mlt=true&mlt.fl=manu,cat&m lt.mindf=1&mlt.mintf=1&fl=id,score,n ame MoreLikeThis 63 63

Query Elevation

Query Elevation • http://localhost:8983/solr/elevate? • • q=ipod&debugQuery=true&enableElev ation=true Configure an “elevate.xml” to boost/ exclude specific documents QueryElevationComponent 64 64


Clustering • Dynamic grouping of documents into labeled sets • http://localhost:8983/solr/clustering? q=*:*&rows=10 • ClusteringComponent • Requires additional steps to install (see documentation) with Apache Solr distro 65 65


Terms • Enumerates terms from specified fields • http://localhost:8983/solr/terms? terms.fl=name&terms.sort=index&terms .prefix=vi 66 66

Term Vectors

Term Vectors • Details term vector information: term frequency, document frequency, position and offset information • http://localhost:8983/solr/select/?q=* %3A*&qt=tvrh&tv=true&tv.all=true 67 67

Additional Request Handlers

Additional Request Handlers 68 68


stats.jsp • • • • Not technically a “request handler”, outputs only XML http://localhost:8983/solr/admin/stats.jsp Index stats such as number of documents, searcher open time Request handler details, number of requests and errors, average request time, average requests per second, number of pending docs, etc, etc 69 69


Dump • http://localhost:8983/solr/debug/dump • Echoes parameters, content streams, • and Solr web context Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler] 70 70


Ping • http://localhost:8983/solr/admin/ • • ping If healthcheck configured and file not available, error is reported Executes single configured request and reports failure or OK 71 71


Luke • • • • • • http://localhost:8983/solr/admin/luke Introspects Lucene index structure and schema relationships • See an individual document: ?doc= or ?docId= Schema details: ?show=schema Admin schema browser uses Luke request handler See also: original Luke tool - http:// 72 72


System • http://localhost:8983/solr/admin/ • system core info, Lucene version, JVM details, uptime, operating system info 73 73


Plugins • http://localhost:8983/solr/admin/ • plugins Configuration details of Solr core, available query and update handlers, cache settings 74 74


Threads • http://localhost:8983/solr/admin/ • threads JVM thread details 75 75


Properties • http://localhost:8983/solr/admin/ • properties All JVM system properties, or single property value (?name=os.arch) 76 76


File • http://localhost:8983/solr/admin/file? • http://localhost:8983/solr/admin/file? file=schema.xml&contentType=text/ plain • file=/ See fetchable directory tree 77 77


DEMO What’s this demo about 7


Administration 79 79


Production • Solr provides tools/support for: • Systems concerns: • • caching, warming, replication CPU, Memory, Disk Space, Servlet Containers, Security, Backups, Fail over 80 80

Deployment Environment

Deployment Environment • • Usual answer for CPU/Memory/Disk Space: Bigger/ more is better Servlet Containers Use whatever you know how to maintain and make perform Solr has been used in a wide variety of containers: Jetty, Tomcat, Resin, Weblogic, JBoss, WebSphere,... See • • • 81 81


Security • • • Solr doesn’t secure either the documents or the communication protocols • • • • • • Use standard security practices: Firewall Secure ports Invariant parameters for RequestHandlers Disable remoteStreaming (disabled by default) Remove update handlers for read-only Solr instances? Turn off “file” request handler, and perhaps other admin handlers Solr specifics: • 82 82


Replication • Master is polled • Replicant pulls Lucene index and optionally also Solr configuration files • Query throughput scaling: replicate and load balance • SolrReplication 83 83

Distributed Search

Distributed Search • Distribute documents to same-schema shards • Scaling for when single index becomes too large, or a single query becomes too slow • DistributedSearch 84 84

What’s new in Solr 1.4?

What’s new in Solr 1.4? • • • • • • • • Java-based replication VelocityResponseWriter (Solritas) AJAX-Solr Logging switched to SLF4J Rollback, since last commit StatsComponent TermVectorComponent Configurable Directory provider • • • • • • • • CharFilter TermsComponent Rich document indexing, via Tika (Solr Cell) Greatly improved faceting performance Exact/near duplicate document handling Support added for Lucene's omitTf "trie" range query support And a new logo! 85 85

Lucene 2.9

Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 86 86

Performance Improvements

Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 87 87

Feature Improvements

Feature Improvements • • • • • • • • Rich document indexing DataImportHandler enhancements Smoother replication More choices for logging Multi-select faceting Speedier range queries Duplicate detection New request handler components 88 88

Get Started!

Get Started! • Where's the content? And metadata? • Iterate (Hoss Workflow ) tm • • • • basic schema bring in data requirements gap analysis adjust until users are smiling 89 89


Resources • • • Lucid Imagination • • • Articles, webinars, blogs, and... SEARCH THE LUCENE ECOSYSTEM at: 90 90

Lucid Articles

Lucid Articles Grant Ingersoll Getting Started with Lucene Debugging Relevance Issues in Search Optimizing Findability in Lucene and Solr Yonik Seeley Faceted Search with Solr Erik Hatcher Getting Started with Solr (includes screencast), co-authored with Jonathan Knudsen Sami Siren Content Extraction with Tika Mark Miller Scaling Lucene and Solr And more... 91 91

Lucid Blogging

Lucid Blogging Mark Miller: "Exploring Query Parsers", "Highlighting Highlighter Thoughts", "Investigating OOM and other JVM issues", "Looking forward to new features in Solr 1.4", Lucene internals, etc Erik Hatcher: "acts_as_solr with rich document indexing" Grant Ingersoll: "Sorting, Faceting, and Schema Design in Solr" And many more... 92 92

Lucid Podcasts

Lucid Podcasts Interviews with: Doug Cutting (creator of Lucene) Ryan McKinley (Solr committer) Chris Hostetter (Solr committer) Andrzej Bialecki (Lucene committer, Luke creator), and many more... 93 93

e-book now available!

e-book now available! 94 94

Upcoming Solr Training

Upcoming Solr Training 95 95

Thank You!

Thank You! Contact me: – 96