Back to articles list
- 11 minutes read

Looking for a Simple Full-Text Search?<br /> Try MySQL InnoDB + CakePHP with Word Stemming

Implementing a user-friendly search can be tricky, but it can also be done very efficiently. How do I know this? Not long ago, I needed to implement a search engine on a mobile app. The app was built on the Ionic framework and would connect to a CakePHP 2 backend. The idea was to display results as the user was typing. There were several options for this, but not all of them met my project’s requirements.

To illustrate what this kind of task involves, let’s imagine searching for songs and their possible relations (like artists, albums, etc).

Sample Song App

The records would have to be sorted by relevance, which would depend on if the search word matched fields from the record itself or from other columns in related tables. Also, the search should implement at least some basic word stemming. (Stemming is used to get the root form for a word. "Stems", "stemmer", "stemming", and "stemmed" all have the same root: “stem”.)

The approach presented here was tested with several hundred thousand records and was able to retrieve useful results as the user was typing.

Full-Text Search Products to Consider

There are several ways we could implement this kind of search. Our project had some constraints in relation to time and server resources, so we had to keep the solution as simple as possible. A couple of contenders eventually emerged:

Elasticsearch

Elasticsearch provides full-text searches in a document-oriented service. It’s designed to manage huge amounts of load in a distributed way: it can rank results by relevance, perform aggregations, and work with word stemming and synonyms. This tool is meant for real time searches. From their website:

Elasticsearch builds distributed capabilities on top of Apache Lucene to provide the most powerful full-text search capabilities available. Powerful, developer-friendly query API supports multilingual search, geolocation, contextual did-you-mean suggestions, autocomplete, and result snippets.

Elasticsearch can work as a REST service, responding to http requests, and it can be set up very quickly. However, starting the engine as a service requires you to have some server access privileges. And if your hosting provider doesn’t support Elasticsearch out of the box, you’ll need to install some packages.

The bottom line is that this product is a great option if you want a rock-solid search solution. (Note: You may need a VPS or dedicated server as the hardware requirements are pretty demanding.)

Sphinx

Like Elasticsearch, Sphinx also provides a very solid full-text search product: Craigslist serves more than 300,000,000 queries per day with it. Sphinx doesn’t provide a native RESTful interface. It is implemented in C, with a smaller hardware footprint than Elasticsearch (which is implemented in Java and can run on any OS with a jvm). You will also need root access to the server with some dedicated RAM/CPU to run Sphinx properly.

MySQL Full-Text Search

Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.

There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)

The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.

Database Selection Screen in Vertabelo

Making the Decisions

The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?

  • We had little time to implement the search function.
  • At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
  • We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
  • While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
  • Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.

Full-Text Searches in Boolean Mode

As mentioned before, natural language search is the simplest approach: just search for a phrase or a word in the columns where you have set a full-text index and you will get results sorted by relevance.

In the Normalized Vertabelo Model

Let’s see how a simple search would work. We’ll create a sample table first:

-- Created by Vertabelo (http://vertabelo.com)
-- Last modification date: 2016-04-25 15:01:22.153
-- tables
-- Table: artists
CREATE TABLE artists (
   id int(11) NOT NULL AUTO_INCREMENT,
   name varchar(255) NOT NULL,
   bio text NOT NULL,
   CONSTRAINT artists_pk PRIMARY KEY (id)
) ENGINE InnoDB;
CREATE  FULLTEXT INDEX artists_idx_1 ON artists (name);
-- End of file.

In natural language mode

You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:

SELECT 
    *
FROM
    artists
WHERE
    MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)

This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.

In boolean mode

We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:

SELECT 
    *
FROM
    artists
WHERE
    MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)

The wildcard operator in boolean mode

Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.

So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:

SELECT 
    *
FROM
    artists
WHERE
    MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Sorting by relevance in boolean mode

Now let’s see the calculated relevance for the search. This will help us understand the sorting we will be doing later with Cake:

SELECT 
    *, MATCH (name) AGAINST ('mich*' IN BOOLEAN MODE) AS rank
FROM
    artists
WHERE
    MATCH (name) AGAINST ('mich*' IN BOOLEAN MODE)
ORDER BY rank DESC

This query retrieves search matches and the relevance value that MySQL computes for each record. The engine optimizer will detect that we are selecting the relevance, so it won’t bother recalculating the rank.

Word Stemming in Full-Text Search

When we incorporate word stemming in a search, the search becomes more user-friendly. Even if the result is not a word in itself, algorithms try to generate the same root for derived words. For example, the stem “argu” is not an English word, but it can be used as the stem for "argue", "argued", "argues", "arguing","Argus" and other words.

Stemming improves results, as the user may enter a word that has no exact match but its “stem” does. Although PHP stemmer or Snowball’s Python stemmer could be an option (if you have root SSH access to your server), we will use the PorterStemmer.php class.

This class implements the algorithm proposed by Martin Porter to stem words in English. As stated by the author in his website, it’s free to use for any purpose. Just drop the file inside your Vendors directory within CakePHP, include the library in your model, and call the static method to stem a word:

//include the library (should be called PorterStemmer.php) within CakePHP’s Vendors folder
App::import('Vendor', 'PorterStemmer'); 

//stem a word (words must be stemmed one by one)
echo PorterStemmer::Stem(‘stemming’); 
//output will be ‘stem’

Our goal is to make searching fast and efficient and to be able to sort results by their full-text relevance. To do this, we will need to employ word stemming in two ways:

  1. The words entered by the user
  2. Song-related data (which we will store in columns and sort for results based on relevance)

The first type of word stemming can be accomplished like this:

App::import('Vendor', 'PorterStemmer');
$search = trim(preg_replace('/[^A-Za-z0-9_\s]/', '', $search));//remove undesired characters
$words = explode(" ", trim($search));
$stemmedSearch = "";
$unstemmedSearch = "";
foreach ($words as $word) {
   $stemmedSearch .= PorterStemmer::Stem($word) . "* ";//we add the wildcard after each word
   $unstemmedSearch = $word . "* " ;//to search the artist column which is not stemmed
}
$stemmedSearch = trim($stemmedSearch);
$unstemmedSearch = trim($unstemmedSearch);

if ($stemmedSearch == "*" || $unstemmedSearch=="*") {
   //otherwise mySql will complain, as you cannot use the wildcard alone
   $stemmedSearch = "";
   $unstemmedSearch = "";
}

We’ve created two strings: one to search for the artist name (without stemming), and one to search in the other stemmed columns. This will help us later build our ‘against’ part of the full-text query. Now let’s see how we can stem and sort the song’s data.

Denormalizing Song Data

Our sorting criteria will be based on matching the song’s artist (without stemming) first. Next will come the song’s name, album, and related categories. Stemming will be used on all the secondary search criteria.

To illustrate this, suppose I search for ‘nirvana’ and there is a song called ‘Nirvana Games’ by ‘XYZ’, and another song called ‘Polly’ by the artist ‘Nirvana’. The results should list ‘Polly’ first, as the match on the artist name is more important than a match on the song name (based my the criteria).

In order to do this, I added 4 fields in the songs table, one for each of the searching/sorting criteria we want:

ALTER TABLE `songs` 
ADD `denorm_artist` VARCHAR(255) NOT NULL AFTER`trackname`, 
ADD `denorm_trackname` VARCHAR(500) NOT NULL AFTER`denorm_artist`, 
ADD `denorm_album` VARCHAR(255) NOT NULL AFTER`denorm_trackname`,
ADD `denorm_categories` VARCHAR(500) NOT NULL AFTER`denorm_album`, 
ADD FULLTEXT (`denorm_artist`), 
ADD FULLTEXT(`denorm_trackname`), 
ADD FULLTEXT (`denorm_album`), 
ADD FULLTEXT(`denorm_categories`);

Our complete database model would look like this:




Whenever you save a song using add/edit in CakePHP, you just need to store the artist name in the column denorm_artist without stemming it. Next, add the stemmed track name in the denorm_trackname field (similar to what we did in the searched text) and save the stemmed album’s name in the denorm_album column. Finally, store the stemmed category set for the song in the denorm_categories field, concatenating the words and adding one space between each stemmed category name.

Full-Text Search and Relevance Sorting in CakePHP

Continuing with the example of searching for ‘Nirvana’, let’s see what a query similar to this can achieve:

SELECT 
 trackname, 
 MATCH(denorm_artist) AGAINST ('Nirvana*' IN BOOLEAN MODE) as rank1,   
 MATCH(denorm_trackname) AGAINST ('Nirvana*' IN BOOLEAN MODE) as rank2, 
 MATCH(denorm_album) AGAINST ('Nirvana*' IN BOOLEAN MODE) as rank3, 
 MATCH(denorm_categories) AGAINST ('Nirvana*' IN BOOLEAN MODE) as rank4 
FROM songs 
WHERE 
 MATCH(denorm_artist) AGAINST ('Nirvana*' IN BOOLEAN MODE) OR 
 MATCH(denorm_trackname) AGAINST ('Nirvana*' IN BOOLEAN MODE) OR 
 MATCH(denorm_album) AGAINST ('Nirvana*' IN BOOLEAN MODE) OR 
 MATCH(denorm_categories) AGAINST ('Nirvana*' IN BOOLEAN MODE) 
ORDER BY rank1 DESC, rank2 DESC, rank3 DESC, rank4 DESC

We would get the following output:

tracknamerank1rank2rank3rank4
Polly0.0906190574169159000
nirvana games00.090619057416915900

To do this in CakePHP, the find method must be called using a combination of ‘fields’, ‘conditions’ and ‘order’ parameters. Continuing with the former PHP example code:

//within Song.php model file
 $fields = array(
            "Song.trackname",
            "MATCH(Song.denorm_artist) AGAINST ({$unstemmedSearch} IN BOOLEAN MODE) as `rank1`",
            "MATCH(Song.denorm_trackname) AGAINST ({$stemmedSearch} IN BOOLEAN MODE) as `rank2`",
            "MATCH(Song.denorm_album) AGAINST ({$stemmedSearch} IN BOOLEAN MODE) as `rank3`",
            "MATCH(Song.denorm_categories) AGAINST ({$stemmedSearch} IN BOOLEAN MODE) as `rank4`"
        );

$order = "`rank1` DESC,`rank2` DESC,`rank3` DESC,`rank4` DESC,Song.trackname ASC";

$conditions = array(
            "OR" => array(
                "MATCH(Song.denorm_artist) AGAINST ({$unstemmedSearch} IN BOOLEAN MODE)",
                "MATCH(Song.denorm_trackname) AGAINST ({$stemmedSearch} IN BOOLEAN MODE)",
                "MATCH(Song.denorm_album) AGAINST ({$stemmedSearch} IN BOOLEAN MODE)",
                "MATCH(Song.denorm_categories) AGAINST ({$stemmedSearch} IN BOOLEAN MODE)"
            )
        );

$results = $this->find(‘all’,array(‘conditions’=>$conditions,’fields’=>$fields,’order’=>$order);

$results will be the array of songs sorted with the criteria we defined earlier.

This solution can be used to generate searches that are meaningful to the user – without requiring too much time from the developers or adding major complexity to the code.

Making CakePHP Searches Even Better

It’s worth mentioning that “spicing” the denormalized columns with more data can lead to better results.

By “spicing” I mean that you could include, into the denormalized columns, more data from additional columns you consider useful with the goal of making results more relevant, for instance if you knew that an artist’s country could figure in the search terms, you could add the country along with the artist name in the denorm_artist column. This would improve the quality of the search results.

From my experience (depending on the actual data that you use and the columns you denormalize) the topmost results tend to be really accurate. This is great for mobile apps, since scrolling down a long list can be frustrating for the user.

Finally, if you need to get more data from the tables that the song relates to, you always can make a join and get the artist, categories, albums, song comments, etc. If you are using CakePHP’s containable behavior filter, I’d suggest adding the EagerLoader plugin to accomplish the joins efficiently.

If you have your own approach to implementing full-text searching, please share it in the comments below. We can all learn from each other’s experience.

go to top