Le Vacataire -- a simple skeletal failover system

License

This project is copyright (2010) by TD Meyer, and is available to you under the terms of the GNU Affero General Public License.

See http://www.gnu.org/licenses/agpl-3.0.html

Project Motivation

This started out as a need for a quick and dirty "heartbeat" program. A basic use case would involve creating two identical machines (say, two webservers with the same applications/content) and placing them behind the a load balancer. The specific need was to manage an apache http server acting as a proxy to an apache tomcat application server, but the system was designed to work with any kind of service.

A note on terminology: To avoid ambiguity, I'm trying to be consistent:

  • Load Balancer - the internet-facing component of your deployment, something that responds to HTTP requests on port 80, dispatching those requests to one or more active webservers behind it (on a private LAN)
  • Web Server, Web Service - The program running on your physical machine that responds to those requests from the load balancer (above)
  • Application Server - Something like apache tomcat, that responds to proxied requests from the web server

    Load Balancing Basics

    In a situation where you have a reasonably effective load balancer, you can keep both web servers running, and the load balancer will manage state in such a way that associations created between a web user and a particular server are maintained between requests. The easist way to accomplish this is through a "round-robin" mechanism, which iterates through the list of available web servers (from 1..n) and dispatches requests equally between them. More sophisticated mechanisms take into account additional factors like server load and response time.

    A stateful load balancer will keep connections persistent between a specific client and a specific server, which is helpful, because if I'm using server cookie session management, I may not want to log in again and again for each request.

    Less capable load balancers (I won't mention any names) will simply check for an active service ("Is port 80 open on this server? Good.") and direct requests in a round-robin (or other) fashion, ignoring session state.

    This is a Bad Thing for session management.

    This program works specifically in these instances:

  • Your load balancer can detect an outage (e.g. "No service is running on Server 2, port 80, so desist in sending requests to it")
  • Your web application requires stateful load balancing but your load balancer won't do it (These do exist)

    Architecture

    A typical deployment looks like this:

    
                          +---------------+
                          | Load Balancer |
                          +---------------+
                                 |                        
                    -------------+------------
                    |                        |
               +----------+            +----------+
               | Server 1 |            | Server 2 |
               +----------+            +----------+
    
                      
    

    In this case, the "Load Balancer" functions less as a load balancer, and more as a 'failover detector.' Both servers are powered on, but only Server 1 is running the web service.

    Le Vacataire should be configured to run on both servers. One (Server 1) is identified as the "Master," and the other (Server 2) is the "Standby." The software running on Server 1 will poll the desired web services periodically, and when a failure event (see below) is discovered, it will notify its peer (Server 2) that it's having trouble. Server 2 then starts up the failover/standby services, while Server 1 shuts its own down as a precaution.

    Communication between the servers can take place over a public network (given the right ports are allowed through any firewalls) or through a private, nonroutable network or VPN (depending on your situation). One admin was able to deploy with Amazon AWS--using their load balancer in one location, and two web servers in different locations.

    At this point, the load balancer magic takes place, and the load balancer notices Server 1 is offline (e.g. not responding on port 80, since we shut down all the web services), but Lo! It detects Server 2 is now active, and it re-routes requests to Server 2. In the next release, we'll add notification functionality so Server 1's outage can be signaled (email or SNMP) to a human operator for intervention.

    Once Server 1 is manually recovered, Server 2 is asked to quiesce and the master/standby relationship is resumed. Automagically, at least in part, by the load balancer.

    Failure Events

    You can write plugin code to detect any kind of 'event' you want. I provided a few examples in the plugins/ directory

  • pcheck.py - Ensures that the specified process is running. If something has critical has died, we can detect it by its absence from the process table.
  • memcheck.py - Checks to see if we have enough available memory (buffers in core, swap space) to continue operation. You can configure these values as a percentage.
  • content.py - Looks for incomplete content. An example use case would be using Apache HTTP to proxy a local or remote Apache Tomcat server (though as you can imagine, the same logic would let you do nginx and wsgi/python or whatever else). If the remote server isn't returning a complete content stream (as identified by the number of bytes returned by the request you specify), it means something went wrong on the backend--like an uncaught exception, a broken app server, or some other condition that's causing the server to behave erratically. If the number of bytes returned by the remote service isn't exactly equal to the value you specify, then we'll assume it has failed.

    TODO

  • Fancier load balancers can be configured to poll a web server (machine) to query server health. The plugins one would write for this service could be configured to respond to those requests, enhancing the ability of the web server to dispatch requests (depending on its balancing algorithm)
  • I'm not totally sure this works yet. I've actually used it in production for a period of time, but that's no guarantee of correctness.
  • It could also benefit from a cleaner configuration/interface/install.