Building a Resilient Search Architecture for GitHub Enterprise Server: A Step-by-Step Guide

Overview

Search is the silent workhorse of GitHub Enterprise Server. It powers not only the obvious search bars and filtering experiences—like the Issues page—but also the Releases page, Projects page, and even the counts for issues and pull requests. When search goes down, the entire platform feels sluggish or breaks. Over the past year, our engineering team has overhauled the search architecture to make it far more durable, reducing the administrative burden on your team and letting you focus on what matters: shipping great software.

Building a Resilient Search Architecture for GitHub Enterprise Server: A Step-by-Step Guide — Source: github.blog

This guide walks through the problem, the solution, and the exact steps to implement a high-availability (HA) search setup that minimizes downtime and avoids common pitfalls. Whether you're an experienced GitHub Enterprise Server administrator or new to HA, you'll find actionable advice and code examples.

Prerequisites

Before you begin, ensure you have the following:

GitHub Enterprise Server 3.6 or later (the version where the new search architecture was introduced).
A High Availability (HA) configuration with at least one primary node and one replica node. If you don’t have HA set up, see the official guide.
Root or sudo access to all nodes.
Sufficient disk space for the Elasticsearch indexes (recommended at least 10% free space on the volumes used by search).
Network connectivity between nodes on ports 9200-9300 (Elasticsearch) and 443 (management).

Step-by-Step Instructions

1. Understand the Old Architecture (and Why It Failed)

In previous versions, GitHub Enterprise Server used a clustered Elasticsearch setup that stretched across the primary and replica nodes. This meant that primary Elasticsearch shards (which handle writes) could move to a replica node. If that replica was taken down for maintenance, the entire search system could deadlock: the replica would wait for Elasticsearch to become healthy, but Elasticsearch couldn’t become healthy until the replica rejoined. This brittle behavior forced administrators to follow exact upgrade sequences or risk index corruption.

The new architecture replaces clustering with a simpler search mirroring approach. Instead of a shared cluster, each node maintains its own independent Elasticsearch instance. The primary node acts as the authoritative source, and replicas asynchronously copy the index data. This eliminates the deadlock scenario and makes maintenance safe.

2. Prepare the Environment

Before making any changes, take a full snapshot of your GitHub Enterprise Server instance (both primary and replica). Then, put the replica into maintenance mode to prevent traffic during the transition:

ghe-ssh -r <replica-host> -- 'ghe-replica-start-maintenance'

Verify the replica is in maintenance mode:

ghe-ssh -r <replica-host> -- 'ghe-replica-status'

You should see Maintenance mode: active.

3. Migrate the Primary Node

The primary node must be upgraded first. SSH into the primary node and run the upgrade script that ships with GitHub Enterprise Server:

ghe-update <path-to-new-package>.pkg

After the upgrade completes, the new search architecture will be enabled automatically. The primary node will now run its own Elasticsearch instance in standalone mode—no longer participating in a cross-node cluster.

Wait for the services to restart completely:

ghe-status

All services should show running or up. Check Elasticsearch specifically:

curl http://localhost:9200/_cluster/health?pretty

You should see "status" : "green" and "number_of_nodes" : 1.

4. Migrate the Replica Node

With the primary stable, now upgrade the replica node. SSH into the replica and run:

ghe-update <path-to-new-package>.pkg

After the upgrade, the replica will automatically configure its own search instance and start mirroring the primary’s index. You can monitor the replication status:

ghe-search-mirror-status

Look for Replication lag: 0 or very low values (under 1000 documents). If lag is high, wait a few minutes and check again.

5. Test Failover

To ensure high availability actually works, simulate a primary failure. On the replica node (now upgraded), trigger a manual failover:

ghe-repl-promote

The replica will become the new primary. Verify that the search interface works by running a search on the new primary:

curl http://localhost:9200/<index>/_search?q=test

You should get results. If not, check the logs at /var/log/elasticsearch/.

After testing, fail back to the original primary (if desired) by running the promotion command there.

Common Mistakes

1. Not Putting the Replica in Maintenance Mode

Some administrators skip this step to save time. Without maintenance mode, the replica continues to serve traffic, and the upgrade may cause partial outages or data inconsistency. Always run ghe-replica-start-maintenance first.

2. Upgrading Both Nodes Simultaneously

Even though the new architecture is independent, you must upgrade the primary first. Upgrading both at once can cause the replica to try to connect to a primary that hasn’t fully restarted, leading to replication failures.

3. Ignoring Index Health

After upgrade, always check Elasticsearch health on both nodes. A yellow or red status indicates issues like missing replicas or unassigned shards. Run curl localhost:9200/_cluster/health on each node. If the status is not green, the index may be incomplete. Repair by reindexing or contacting support.

4. Forgetting to Verify Replication

Replication lag can grow unnoticed during high traffic. Schedule periodic checks with ghe-search-mirror-status and set up monitoring alerts if lag exceeds 10,000 documents. The new architecture handles this better, but vigilance is still required.

Summary

The search architecture overhaul removes the brittle cluster mode that previously plagued GitHub Enterprise Server HA setups. By migrating to independent Elasticsearch instances per node with async mirroring, you eliminate deadlock scenarios and reduce maintenance complexity. Follow the upgrade order—primary first, then replica—and always test failover. With these steps, your search will remain highly available, even during planned maintenance or unexpected failures.

Tags: