Drupal 7: Partial word search
Drupal’s default search technique periodically processes all new content and stores keywords and keyword details in separate database tables (search_*). This allows Drupal to run searches by investigating exact matches on the search_index table alone and reviewing pre-calculated scores therein.
Searching exact word matches in a dedicated table (search_index) is much faster in MySQL (and other database servers) than running exact word or ‘LIKE’ queries against full text. However, this limits the results of searches to exact word matches, and often filters results that may benefit a user. For example, if a user executes a search with keyword “performance”, and there are nodes with content including “high-performance” (but not including the pure word “performance”), no results will be displayed.
Fortunately, there are several search enhancing/replacing modules available to remedy this behavior. To investigate these options and the resulting search behavior, I created the following Basic Pages and Articles in an otherwise empty Drupal 7 installation:
Basic Page One One blue, red, green, yellow. Basic Page Two Two red, blue-green, yellowish. Article One One yellow, blue-green, reddish. Article Two Two yellow-green, blue, red.
Note that a search for “blue” returns only Basic Page One and Article Two, where as each node clearly includes the term “blue”.
The Fuzzy Search module feels a little heavy during installation, as it requires the Search API module, which in turn requires the Entity module. Entity is handy for a lot of other uses, and may be installed as a dependency already, so it is really just that Fuzzy Search implements its functionality on top of the Search API, which is not a major concern.
Fuzzy Search takes indexed terms and breaks them down into partial terms of a configurable length. The default length is 3, hence “high-performance” becomes a vast array including “hig”, “igh”, “gh-“, … , “anc”, “nce”. Within the index workflow settings, you can edit the default partial wod length, and also set a minimum completeness (a sort of percentage of match for returning the fuzzy word result). The minimum completeness defaults to 40. This has the benefit of allowing partial word search and also accommodating/correcting most spelling errors in your search terms. You must be somewhat careful with this as leaving your configuration too wide open can cause a lot of false positive results that may seem odd to end users.
With the default settings, Fuzzy search does not impact our “blue” search results, which are still lacking the nodes with “blue-green” in the content. It does correct searches like “yellowgreen” which returns Article Two, containing the distinctly spelled “yellow-green”. Interestingly, Fuzzy search (with the default configuration) also seems to break some search results that were previously appropriate. For example, a search for “green”, which returns Basic Page One using the default search implementation, fails to yield any results with Fuzzy Search enabled.
Fuzzy Search definitely has the potential to enhance the default search behavior, but pay close attention to your configuration and keep an eye on how it impacts your search results.
The Porter-Stemmer module is more of a default search enhancement than a search replacement. It is much lighter weight than Fuzzy Search, and does not include any configuration options. It should definitely be considered an English-only (and predominantly en_US) enhancement. The technique used in Porter-Stemmer is to break words down by removing common modifiers like ‘ing’, ‘er’, or ‘ed’, and allow the search to process all such combinations of the word. This results in a search for “test” returning nodes with content including “tests”, “test”, or “testing”.
Porter-Stemmer is very convenient for a lot of use cases within the English language. It is also easy to setup. Nonetheless, for our example, the apparent lack of knowledge when it comes to the ‘ish’ suffix, and the lack of hyphen handling (which is appropriate based on the module’s functionality), leave something to be desired in our searches. In our example content, Porter-Stemmer does not enhance the default search behavior. It is important to note that Porter-Stemmer IS effective in a lot of scenarios, just not this particular example set.
Search API Solr or Apache Solr Search Integration
Apache Solr is a popular and capable java search implementation. There are many options and configurations available to use Solr as a search processor for Drupal, and the specifics of the various configurations are beyond the scope of this article. I will note that there are two popular Drupal 7 modules that support Solr, the primary distinction being whether or not they rely on the Search API module.
The Solr search engine requires JDK presence and must be installed separately within a Java container (Jetty, Tomcat, or some equivalent).
I will use the Apache Solr Search Integration module. Adequate instructions for a basic installation are available at Drupal.org, thought I recommend reviewing some of the links on that page when considering Solr use on larger production sites.
Note that with the default schema.xml provided in the Apache Solr Search Integration module, partial search strings are still not well supported. By default Apache Solr does include most of the functionality found in Porter-Stemmer (though with much more added administration and configuration). Adding an NGramTokenizerFactory tokenizer to the main “text” field type corrects this behavior, and adds partial search functionality. To highlight the behavior using our example content, with Apache Solr and an appropriate NGramTokenizerFactory tokenizer, a search for “ell” returns all content (matching “yellow” or “yellowish”).
At a high level, I prefer the configuration options and extensibility that Apache Solr offers. However, it is important to remember the goals of your site and your users when selecting a search tool. Keep in mind that in many cases Porter-Stemmer accommodates the bulk of your needs for automated search correction, and it does so in a way that keeps additional server overhead extremely low. The Fuzzy Search module is also a great step if Porter-Stemmer does not quite handle your desired scenarios, but as I mentioned read the docs and examples, and be sure that you understand what you are telling Fuzzy Search to do. It is a powerful module, and it is easy to alter the configuration in ways that can cause undesirable behavior. With the right configuration, Fuzzy Search can do a lot to improve your search results for users.
If you find yourself wanting more control or more advanced search features (like Faceted search results – a topic for another article), you would do well to investigate Apache Solr as a replacement for your Drupal 7 Search functionality. Keep in mind that very few shared hosting services support java servlet containers. This means that Apache Solr search is limited to sites on a VPS or dedicated server, or to those interested in a paid Solr search solution (like Acquia Search).
Leave a Reply