Solr in AWS: Shards & Distributed Search

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, database integration, and geospatial search. It is highly scalable, providing distributed search and index replication.

Documents are indexed into Solr thereby making them searchable. The problem is 2-fold as the size of the Solr index increases: (a) index becomes too large to fit on a single system and (b) a single query takes too long to execute. One solution is sharding.

Sharding is the process of breaking a single logical index in a horizontal fashion across records. It is a common database scaling strategy when you have too much data for a single database. In Solr terms, sharding is breaking up a single Solr core across multiple Solr servers. Solr has the ability to take a single query and break it up to run over multiple Solr shards, and then aggregate the results together into a single result set.

Fig 1: Solr Shards Architecture

This isn’t a completely transparent operation though. The key constraint is when indexing the documents you need to decide which Solr shard gets which documents. Solr doesn’t have any logic for distributing indexed data over shards. Then when querying for data, you supply a shards parameter that lists which Solr shards to aggregate results from. Lastly, every document needs a unique key (ID), because you are breaking up the index based on rows, and these rows are distinguished from each other by their document ID.

Setting up Solr in the Cloud

  • Launch an Amazon EC2 Linux instance with Java installed
  • Download Solr latest stable release
  • Unzip the Solr release and change the working directory to be the “example” directory
user@solr:/opt$ unzip apache_solr_3.3.0.zip
user@solr:/opt$ cd apache_solr_3.3.0/example

Copy the (multi-core supported) SolrHome directory, which contains the schema and config files, into this instance under /user/home

Start the Solr server by specifying the solr-home directory.

user@solr:/opt/apache_solr_3.3.0/example$ java –Dsolr.solr.home=/user/home/SolrHome –jar start.jar

Your solr server is up and running. Let’s consider that the Public URL for the solr server is:  http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities

Repeat the above steps again to launch a similar instance with a Public URL say http://ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities

We now have 2 Solr shards running in the Cloud!

Initializing the Solr shards

Let’s use Solrj, a java client to access Solr. It offers a java interface to add, update, and query the index.

public class SolrShards {
   private final int NO_OF_SHARDS = 2;
   private CommonsHttpSolrServer[] solrServerShard = null;

   public SolrShards () {
      String[] solrShardLocations = {" http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities ", " http://ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities "};
      try {
         solrServerShard = new CommonsHttpSolrServer[NO_OF_SHARDS];
         for (int i = 0; i < solrShardLocations.length; i++) {
            solrServerShard[i] = new CommonsHttpSolrServer(solrShardLocations[i]);
            ((CommonsHttpSolrServer)solrServerShard[i]).setParser(new XMLResponseParser());
         }
      }
      catch (MalformedURLException e) {
      }
   }
   ...
}

Indexing documents into shards

As mentioned above, Solr doesn’t have any logic for distributing indexed data over shards. uniqueId.hashCode() % numServers determines which server a document should be indexed at.

solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].addBean(businessEntity);
solrServerShard[businessEntity.getId().hashCode() % NO_OF_SHARDS].commit();

Distributed Search: Searching across shards

The ability to search across shards is built into the query request handlers. You do not need to do any special configuration to activate it. In order to search across two shards, you would issue a search request to Solr, and specify in a shards URL parameter a comma delimited list of all of the shards to distribute the search across. You can issue the search request to any Solr instance, and the server will in turn delegate the same request to each of the Solr servers identified in the shards parameter. The server will aggregate the results and return the standard response format.

public class SolrShards {
   …
   private CommonsHttpSolrServer getSolrServer () {
      String solrLocation = "http://ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities";
      CommonsHttpSolrServer solrServer = null;
      try {
         solrServer = new CommonsHttpSolrServer(solrLocation);
         ((CommonsHttpSolrServer)solrServer).setParser(new XMLResponseParser());
      }
      catch (MalformedURLException e) {
      }
      return solrServer;
   }
}

public void QueryBusinessEntities() {
   SolrServer solrServer = getSolrServer ();
   SolrQuery query = new SolrQuery();
   query.setParam("shards", "ec2-174-129-98-167.compute-1.amazonaws.com:8983/solr/businessentities, ec2-175-130-99-168.compute-1.amazonaws.com:8983/solr/businessentities");
   query.setQuery("address:India");
   query.addSortField("profit", SolrQuery.ORDER.asc);
   QueryResponse rsp = solrServer.query(query);
   …
}

That’s it! You have set up 2 Solr shards on 2 different machines and performed distributed search successfully.

Caveats:

  • Documents must have a unique key and the unique key must be stored (stored=”true” in schema.xml)
  • The unique key field must be unique across all shards.
  • You typically only need sharding when you have millions of records of data to be searched.
  • If running a single query is fast enough, and if you are looking for capacity increase to handle more users, then use the index replication approach instead!

Solr Standalone v/s Solr Shards – Performance Comparison

In the first test that I carried out, the performance of indexing was measured. From Table 1, it is observed that on an average it needed 1 hour to index 1 million records. But the point to notice is that when shards were used, twice the amount of data was indexed in the same time. (Standalone = 3.5 million, Shards = 8 million)

Table 1: Indexing Time

The second test pertains to measuring the performance of query response times. As you can see from Table 2, there is no observable difference in query response times for stand alone as well as sharded servers though the shards contained twice the amount of data.

Table 2: Query Response Times

References –

1. http://wiki.apache.org/solr/DistributedSearch
2. http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881

Advertisements
This entry was posted in Technical and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s