Apache lucene architecture

3/18/2023

Some of them are JTidy : a HTML Parser, Pdfbox: a PDF documents parser and SAX: an XML Parser. Document parsers are not part of the Apache Lucene core.

These are used for further processing during indexing and search.Įach common document type like HTML, PDF, XML and so on needs a specific document parser to extract its contents. The Document Handler interface allows the extraction of information like textual contents, numbers and meta data from original documents and provide them as Lucene Documents. For this purpose a Document Handler interface is needed, this one is provided by the Lucene contribution Library. Courtesy of Twitter University.Any Application using Apache Lucene must first of all transform its original data, into Lucene Documents. Watch Lucene creator Doug Cutting give a talk about its creation and development. As such, Lucene has grown into a fast and robust search tool that remains competitive and will undoubtedly hold its place in the history of search development. Cutting notes that, what he believes was a key point in Lucene’s unexpected widespread adoption, was his decision to make it available as open source. The simplicity of the Lucene library and its ease of implementation have contributed to its overwhelming success. As evidenced by the “powered by” page above, Lucene has had a far-reaching impact on search. Lucene is behind some of the largest searches including LinkedIn, Twitter and Wikipedia. As the larger portion of the data is grouped and sorted, this keeps searching time efficient, whilst still allowing indexes to be added “on the fly”. This system keeps the majority of indexed material consolidated in larger collections in a progressive merge sort. Similarly, these larger indexes are also grouped and consolidated as threes, in a continuing fashion. In Lucene, when the number of indexes reaches three, these indexes are consolidated. Index storage was one of the hurdles preventing search engines scaling in size and maintaining speed. When Lucene was developed, a key difference between itself and other search engine libraries was the way in which it handled index storage. The analyser also uses a stemming algorithm to stem each word to its root form. In the analysis stage, most unimportant filler words are discarded. To overcome this, generic text is analysed by Lucene and broken down accordingly before indexing. As such, this field can simply be indexed with its value and return a hit to a matching query.įor more generic content, it is likely that matches and partial matches within the text will be desired, as opposed to a strictly identical match to the query term. When searching for an author name, or contact details, such as address and phone number, it is typical that the desired result is a complete match to the query. To ensure that the search returns good results in response to queries, the fields must be extracted appropriately for the target data type.įor some documents, in particular structured documents, field and value pairs will be fairly trivial to associate. Deciding how to correctly structure the way in which fields are extracted from documents is an important stage in setting up the search. When indexes are queried, Lucene looks for matches between the indexed values and the query terms. This advantage is indispensible, giving Lucene the ability to index structured database objects and unstructured or semi-structured documents, such as Word documents or PDF formats. As the Lucene document, stored in the index, is independent of file type, any type of document that can be parsed into fields and values can be made searchable.

To create an index, whatever document is being index must be parsed, and the fields extracted. These Lucene documents contain fields with associated values, which are essentially key and value pairs. Within a Lucene index, Lucene documents are constructed. Secondly, Lucene queries the created indexes to search for content. Firstly, it creates indexes of the content to be made searchable. The operations performed by Lucene can essentially be simplified to two key steps. These would become the popular search engine library, Lucene, named after his wife’s middle name. Anticipating the burst of the internet bubble, Cutting reduced his working hours to teach himself Java and begin working on a set of search tools. Lucene began development in 1997 as Doug Cutting’s side project during his time at the web search engine Excite.

0 Comments

Apache lucene architecture

Leave a Reply.

Author

Archives

Categories