On the roadmap for future releases of Sitecore 9 is true SolrCloud support. This has been a sticking point with many scaled implementations of Sitecore since SolrCloud was first introduced in 2013. For the most part, Sitecore implementations have relied upon either the Solr Master-Slave model to ensure high availability and load balancing (at least for queries) or have muddled through with Solr Cloud approximations.
A history lesson
To understand where Sitecore stands with regard to Solr Cloud support, you have to know a little history behind how search was implemented in Sitecore. There are two mechanisms for finding items in Sitecore. The first is to use Sitecore’s database query mechanisms, which rely on an XPath-like syntax for traversing the content tree. This is a slow method, but is used for simple queries to traverse parent-child relationships and is often used with Sitecore’s field types that display hierarchical relations: the treelist, droplist, and droptree, for example.
The second method is to use a full text search provider. Back in the Sitecore 4 or 5 days, this was dtSearch, but Lucene quickly became the default full text search provider, due to the extensive documentation and examples available. Like dtSearch, Lucene was a full-text indexing engine meant to be used in embedded applications. These tools worked great on a developer machine, or on a single-sever “all-in-one” demo configuration, but caused issues when deploying into a multi-server environment. Since each server maintained its own index, issues with out-of-sync indexes were extremely common.
Solr, an extension of the Apache Lucene project, was introduced to solve this scalability issue. It offered an enterprise platform that used the Lucene API and supported many of the extensions and plugins developed by that community, and allowed indexing and search functions to run on a separate instance. Since the API was largely compatible with Lucene, it was easy for Sitecore and other CMS platforms to transition to Solr.
Solr scalability models
The early approach to Solr scalability was the master-slave model. One server in a Solr cluster — the master — performs all the write operations, but any server in the cluster can respond to queries. This approach is one of eventual consistency for read operations. If a slave fails, the load can be redistributed to the other members of the cluster. If the master fails, however, no new write operations can be performed until one of the slaves can be reconfigured to act as the new master.
Master-slave was a challenge for Sitecore for two reasons.
- Sitecore assumes that its indexing and search server shares a single URL. But really, it should know of two URLs: the load-balanced URL used for reads, and the single URL that points to the primary Solr server for writes.
- To recover from a master failure, manual intervention was required to promote a slave to a master. This requires a reboot of the slave Solr server, which is why a true high-availability Solr master-slave arrangement should really have been called a “master-slave-slave” configuration, and needs to contain a minimum of three instances.
To address the first problem, you need to configure your load balancer to spread all GET requests across all of your master and slave instances, but to route POST requests only to the master server. Or you need to modify the Sitecore code and configuration to separate writes from reads, and send each to a different URL. (The second approach may be required if your query parameters become so complex that you run into the GET request character limit for the query string.)
To address the second problem — which was an issue for all systems using Solr, not just Sitecore — the Solr project introduced SolrCloud.
SolrCloud has nothing to do with cloud computing, despite its name. You can have on-premise datacenter deployments of SolrCloud. It’s just the name given to the new high availability and scalability model that replaces the old master-slave-slave approach.
In a SolrCloud, ZooKeeper nodes are used to determine which Solr instance acts as the primary instance responsible for write operations. If the primary fails, one of the secondaries is promoted to act as the new primary automatically. There may be a few write errors as the problem gets detected, a new election of a primary occurs, and the new primary takes over write responsibilities, but this is generally a short interval. No manual intervention is required.
This is a great approach, except for one problem: Sitecore (and the Solr libraries it depends on) don’t understand ZooKeeper.
Who’s in charge of this zoo?
In the Java world, the Solr4j library has had support for ZooKeeper “baked-in” so that clients don’t have to figure out which Solr node is the primary. The CloudSolrClient in the Solr4j library client communicates with the Zookeeper nodes to discover Solr endpoints. For write operations, it helps determine which Solr instance will handle the write requests, and all index update operations are transparently sent to that instance. For read operations, the CloudSolrClient uses the LBHttpSolrClient class as a software load balancer. There’s no additional work required!
But Sitecore is a Microsoft.NET application, and its ContentSearch API ultimately relies upon a library called Solr.NET. Solr.NET is not maintained by the Apache Solr project itself, but by a group of open source developers. As a result, Solr.NET support for Solr versions often lag behind that of the Java client libraries. This is why Sitecore only supports up to the Solr 6 series, even though Solr 8 will be released very soon.
And for a long time, Solr.NET didn’t fully support Solr 6 — which is why it didn’t have the CloudSolrClient baked in, or any knowledge of how to communicate with Zookeeper. But as of January 2018, developers can finally get a version of Solr.NET with support for SolrCloud and versions up to Solr 7. There’s even a SolrNet.Cloud NuGet package.
Now all we have to do is wait for Sitecore to incorporate it into the .NET platform. What about it Sitecore? Can we have it in time for Symposium this year? Or at least by Christmas?