Presentation Solr Power with Lucene (2/2)

Solr powers the search of many enterprise applications, and long before Solr shone on enterprise search, Lucene - Solr's core library, was and still is used to drive the search of applications pervasively around the world. In this university lecture we'll cover the typical workflow of getting data into Solr, configuring Solr's indexing and search processes, and building a searching client user interface. Common indexing options will be explored: database (JDBC) indexing, RSS/Atom and other XML data sources, CSV formatted data, rich text (Word, PDF, HTML, etc) documents and indexing through Solr's various native language APIs. Solr's configurability and usage options discussion will include query parsing, faceting, relevancy ranking, spell checking, highlighting, clustering and more-like-this.

Speakers


PDF: slides.pdf

Slides

Solr Power with Lucene

Solr Power with Lucene by Erik Hatcher Lucid Imagination http://www.lucidimagination.com 1

DEMO

DEMO What’s this demo about 78

Query Parser

Query Parser • Controlled by defType parameter • • • Local {!...} override syntax &defType=lucene (actually a Solr extension of Lucene’s QueryParser) &defType=dismax 51 51

Solr Query Parser

Solr Query Parser • • • • • http://lucene.apache.org/java/2_9_1/ queryparsersyntax.html+ Solr extensions Kitchen sink parser, includes advanced user-unfriendly syntax Syntax errors throw parse exceptions back to client Example: title:ipod* AND price:[0 TO 100] http://wiki.apache.org/solr/SolrQuerySyntax 52 52

Dismax Query Parser

Dismax Query Parser • Simplified syntax: • loose text “quote phrases” -prohibited +required Spreads query terms across query fields (qf) with dynamic boosting per field, phrase construction (pf), and boosting query and function capabilities (bq and bf) 53 53

Raw Query Parser

Raw Query Parser • q={!raw f=id}some:id[value • new TermQuery("id", "some:id[value") 54 54

Searching with SolrJ

Searching with SolrJ SolrServer server = new CommonsHttpSolrServer("http:// localhost:8983/solr"); SolrQuery params = new SolrQuery("author:John"); params.setFields("*,score"); params.setRows(3); QueryResponse response = server.query(params); for (SolrDocument document : response.getResults()) { System.out.println("Doc: " + document); } 55 55

Searching with Ruby

Searching with Ruby conn = Connection.new( 'http://localhost:8983/solr') conn.query('my query') do |hit| puts hit.inspect end 56 56

Searching Performance

Searching Performance • • • • • • • Optimized index Non-compound index format Cache tuning (solrconfig.xml) Searcher warming Query complexity: many terms, wildcard terms Faceting concerns Searcher replication 57 57

Search Components

Search Components http://flickr.com/photos/steffe/98639094/sizes/l/ 58 58

Built-in search components

Built-in search components • Standard: query, facet, mlt, highlight, stats, debug • Others: elevation, clustering, term, term vector 59 59

Faceting

Faceting • • • • Counts per subset within results Facet on: field terms, queries, date ranges &facet=on &facet.field=cat &facet.query=price:[0 TO 100] http://wiki.apache.org/solr/ SimpleFacetParameters 60 60

Spell checking

Spell checking • • • • http://localhost:8983/solr/spell? q=epod&spellcheck=on&spellcheck.build=true File or index-based dictionaries Supports pluggable distance algorithms: Levenstein and JaroWinkler http://wiki.apache.org/solr/ SpellCheckComponent 61 61

Highlighting

Highlighting • http://localhost:8983/solr/select? • q=apple&hl=on&hl.fl=* http://wiki.apache.org/solr/ HighlightingParameters 62 62

More Like This

More Like This • http://localhost:8983/solr/select? • q=ipod&mlt=true&mlt.fl=manu,cat&m lt.mindf=1&mlt.mintf=1&fl=id,score,n ame http://wiki.apache.org/solr/ MoreLikeThis 63 63

Query Elevation

Query Elevation • http://localhost:8983/solr/elevate? • • q=ipod&debugQuery=true&enableElev ation=true Configure an “elevate.xml” to boost/ exclude specific documents http://wiki.apache.org/solr/ QueryElevationComponent 64 64

Clustering

Clustering • Dynamic grouping of documents into labeled sets • http://localhost:8983/solr/clustering? q=*:*&rows=10 • http://wiki.apache.org/solr/ ClusteringComponent • Requires additional steps to install (see documentation) with Apache Solr distro 65 65

Terms

Terms • Enumerates terms from specified fields • http://localhost:8983/solr/terms? terms.fl=name&terms.sort=index&terms .prefix=vi 66 66

Term Vectors

Term Vectors • Details term vector information: term frequency, document frequency, position and offset information • http://localhost:8983/solr/select/?q=* %3A*&qt=tvrh&tv=true&tv.all=true 67 67

Additional Request Handlers

Additional Request Handlers http://flickr.com/photos/martinlabar/241423789/sizes/l/ 68 68

stats.jsp

stats.jsp • • • • Not technically a “request handler”, outputs only XML http://localhost:8983/solr/admin/stats.jsp Index stats such as number of documents, searcher open time Request handler details, number of requests and errors, average request time, average requests per second, number of pending docs, etc, etc 69 69

Dump

Dump • http://localhost:8983/solr/debug/dump • Echoes parameters, content streams, • and Solr web context Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler] 70 70

Ping

Ping • http://localhost:8983/solr/admin/ • • ping If healthcheck configured and file not available, error is reported Executes single configured request and reports failure or OK 71 71

Luke

Luke • • • • • • http://localhost:8983/solr/admin/luke Introspects Lucene index structure and schema relationships • See an individual document: ?doc= or ?docId= Schema details: ?show=schema Admin schema browser uses Luke request handler See also: original Luke tool - http:// www.getopt.org/luke/ 72 72

System

System • http://localhost:8983/solr/admin/ • system core info, Lucene version, JVM details, uptime, operating system info 73 73

Plugins

Plugins • http://localhost:8983/solr/admin/ • plugins Configuration details of Solr core, available query and update handlers, cache settings 74 74

Threads

Threads • http://localhost:8983/solr/admin/ • threads JVM thread details 75 75

Properties

Properties • http://localhost:8983/solr/admin/ • properties All JVM system properties, or single property value (?name=os.arch) 76 76

File

File • http://localhost:8983/solr/admin/file? • http://localhost:8983/solr/admin/file? file=schema.xml&contentType=text/ plain • file=/ See fetchable directory tree 77 77

DEMO

DEMO What’s this demo about 7

Administration

Administration http://flickr.com/photos/pandiyan/106707806/sizes/o/ 79 79

Production

Production • Solr provides tools/support for: • Systems concerns: • • caching, warming, replication CPU, Memory, Disk Space, Servlet Containers, Security, Backups, Fail over 80 80

Deployment Environment

Deployment Environment • • Usual answer for CPU/Memory/Disk Space: Bigger/ more is better Servlet Containers Use whatever you know how to maintain and make perform Solr has been used in a wide variety of containers: Jetty, Tomcat, Resin, Weblogic, JBoss, WebSphere,... See http://wiki.apache.org/solr/SolrInstall • • • 81 81

Security

Security • • • Solr doesn’t secure either the documents or the communication protocols • • • • • • Use standard security practices: Firewall Secure ports Invariant parameters for RequestHandlers Disable remoteStreaming (disabled by default) Remove update handlers for read-only Solr instances? Turn off “file” request handler, and perhaps other admin handlers Solr specifics: • http://wiki.apache.org/solr/SolrSecurity 82 82

Replication

Replication • Master is polled • Replicant pulls Lucene index and optionally also Solr configuration files • Query throughput scaling: replicate and load balance • http://wiki.apache.org/solr/ SolrReplication 83 83

Distributed Search

Distributed Search • Distribute documents to same-schema shards • Scaling for when single index becomes too large, or a single query becomes too slow • http://wiki.apache.org/solr/ DistributedSearch 84 84

What’s new in Solr 1.4?

What’s new in Solr 1.4? • • • • • • • • Java-based replication VelocityResponseWriter (Solritas) AJAX-Solr Logging switched to SLF4J Rollback, since last commit StatsComponent TermVectorComponent Configurable Directory provider • • • • • • • • CharFilter TermsComponent Rich document indexing, via Tika (Solr Cell) Greatly improved faceting performance Exact/near duplicate document handling Support added for Lucene's omitTf "trie" range query support And a new logo! 85 85

Lucene 2.9

Lucene 2.9 • IndexReader#reopen() • Faster filter performance, by 300% in some cases • Per-segment FieldCache • Reusable token streams • Faster numeric/date range queries, thanks to trie • and tons more, see Lucene 2.9's CHANGES.txt 86 86

Performance Improvements

Performance Improvements • Caching • Concurrent file access • Per-segment index updates • Faceting • DocSet generation, avoids scoring • Streaming updates for SolrJ 87 87

Feature Improvements

Feature Improvements • • • • • • • • Rich document indexing DataImportHandler enhancements Smoother replication More choices for logging Multi-select faceting Speedier range queries Duplicate detection New request handler components 88 88

Get Started!

Get Started! • Where's the content? And metadata? • Iterate (Hoss Workflow ) tm • • • • basic schema bring in data requirements gap analysis adjust until users are smiling 89 89

Resources

Resources • http://wiki.apache.org/solr • solr-user@lucene.apache.org • Lucid Imagination • • • http://www.lucidimagination.com Articles, webinars, blogs, and... SEARCH THE LUCENE ECOSYSTEM at: http://search.lucidimagination.com 90 90

Lucid Articles

Lucid Articles Grant Ingersoll Getting Started with Lucene Debugging Relevance Issues in Search Optimizing Findability in Lucene and Solr Yonik Seeley Faceted Search with Solr Erik Hatcher Getting Started with Solr (includes screencast), co-authored with Jonathan Knudsen Sami Siren Content Extraction with Tika Mark Miller Scaling Lucene and Solr And more... 91 91

Lucid Blogging

Lucid Blogging Mark Miller: "Exploring Query Parsers", "Highlighting Highlighter Thoughts", "Investigating OOM and other JVM issues", "Looking forward to new features in Solr 1.4", Lucene internals, etc Erik Hatcher: "acts_as_solr with rich document indexing" Grant Ingersoll: "Sorting, Faceting, and Schema Design in Solr" And many more... 92 92

Lucid Podcasts

Lucid Podcasts Interviews with: Doug Cutting (creator of Lucene) Ryan McKinley (Solr committer) Chris Hostetter (Solr committer) Andrzej Bialecki (Lucene committer, Luke creator) Monster.com, Digg.com and many more... 93 93

e-book now available! http://www.manning.com/lucene

e-book now available! http://www.manning.com/lucene 94 94

Upcoming Solr Training

Upcoming Solr Training 95 95

Thank You!

Thank You! Contact me: – erik.hatcher@lucidimagination.com 96