How faceted search engines work

Posted on 12, Mar 2014

The key feature of many implementations of faceted search is that complex queries can be performed with a simple search term followed by a few mouse clicks. Queries themselves are hidden from the users; search for "shoes" on ebay and the results are automatically faceted on categories such as gender, brand, and style for example. After a few clicks you have performed a complex query on the database without knowing it; more importantly however, you have probably found what you have been looking for using somewhat natural language.

Input data

Faceted search requires documents with metadata attached to them; nuggets of information that orient the document within the collection of documents.. The Reuters 21578 test collection is a good example to start off with (a JSON formatted version can be found here ). If we look at a sample record (a document with metadata):

{
  "19001" : {
    "title" : "BEVERLY HILLS COP II TOPS 100 MLN AT BOX OFFICE",
    "body" : "Gulf and Western Inc's <GW> Paramount\nPictures Corp division said its movie \"Beverly Hills Cop II\"\nhas topped the 100 mln dlr mark at the box office on the 29th\nday of its North American release.\nParamount said the the movie has become the fastest movie\nto hit the 100 mln dlr mark with an \"R\" rating, which means\nthat anyone under 17 must be accompanied by their parents.\nReuter\n\u0003",
  "date":"18-JUN-1987 18:46:13.14",
  "places":["usa"]
}

We can see that the actual document of interest is the "body" , everything else is metadata. If we search for "paramount", we would expect results facted on places = [usa, ...] and title = [beverly hills cop, box office, ...] for example. Of course, extracting "beverly hills cop" and "box office" from the title as related strings will be non-trivial for a toy facted search engine; more likely we would end up with the results facted on title = [beverly, hills, cop, box, office, ...] .

Indexes

For a basic facted search we need three indexes; a forward index, a reverse index, and a facet index. The forward index is a collection of key-value pairs, where the key is unique, and the value is a record. For the reverse index, the key is a word or term such as "paramount", and the value is a list of all unique record identifiers (keys in the forward index) that contain the term. We can limit this to just the document (our "body" in this case) if we want to. Finally, the facet index is an index of indexes where the key is our facet, for example "places", and the value is a reverse index itself In this index the key is a facet focus, for example "usa", and the value is a list of record identifiers that have a facet containing our focus. In Python, we will have something like this:

from collections import defaultdict

forward_index = {}
reverse_index = defaultdict(list)
facet_index = defaultdict(lambda:defaultdict(list))

To populate our indexes we limit the entries in the reverse index to data contained in the targets metadata facets, and we limit the entries in the facet index to facets . Remember, the forward index is comprised of records, and the terms inside a facet we refer to as foci (many focus), which in effect is any term in "body", or "title", etc. Given a populated forward index:

for i, record in forward_index.iteritems():
    for facet, foci in record.iteritems():
        for focus in foci:
            if facet in targets:
                reverse_index[focus].append(i)
            if facet in facets:
                facet_index[facet][focus].append(i)

Obviously for many records these indexes will become large. For this toy example this is a problem that we shall ignore.

Intersections of sets

Building and understanding the indexes is the hard part; querying them given some search terms and target facets is simply calculating the intersections of some sets. To get the primary hits given a search term and a facet-foci pair:

term_hits = set(reverse_index[term])
facet_hits = set(facet_index[facet][foci])

hits = term_hits.intersection(facet_hits)

One can get fancy with multiple search terms and facets, however it is simply more searching of the indexes, and intersecting of sets.

Showing the facets

The user will want to see how these hits relate to each other given their membership to various facets and foci. We can count these memberships like this:

facets = defaultdict(lambda:defaultdict(lambda:0))

for hit in hits:
    for facet, foci in index[hit].iteritems():
        for focus in foci:
            facets[facet][focus] += 1

That is all there is to a very basic faceted search.


Copyright, Christopher Poole 2008-2016.