Skip to main content

Clustering Technologies

I have been asked to give a very brief overview of the clustering technologies that we can utilise for high availability. We are, therefore, going to ignore high power computational clustering, as this is about more power rather than redundancy. The two main techniques that we use are a shared-resource cluster (usually some kind of disk array) and a network load balancing cluster, which does exactly what it says on the tin! We'll deal with each of these in turn here, but they can be used together to provide a complete solution.

The goals of Server Clusters are to share computing load over several systems whilst maintaining transparency for system administrators. There is also transparency for users, in that a user has no idea which node in the cluster they have connected to, or indeed that they are connected to a cluster at all as it will appear as one machine. If a component fails users will suffer degraded performance only, but do not lose connectivity to the service. If more power is required at a later date, it is easy to add more components to the cluster, thereby sharing the load across more systems.

Three principle features of clusters:
  • Availability - continued service in the event of the failure of a component
  • Scalability - new components can be added to compensate for increased load
  • Simplified management - administer the group of systems & applications as one unit

Shared-Resource Clustering

The Shared Resource Cluster is used for Line-of-business applications, e.g. database, messaging, file/print servers. Behind this technology is a physical shared resource, such as a direct attached storage device. Services then fail over from one node to another in the cluster, ensuring high availability of core, critical services. A rolling upgrade process enables the addition or upgrading of resources, which, when coupled with the high availability, ensures that line-of-business applications are online when needed. This technology removes the physical server as the single point of failure.

This type of cluster doesn't give us huge scalability like the Network Load Balancing cluster, as each node 'owns' the required storage at any one time and the others don't. Imagine that you are running a DataBase Management System (DBMS) on the cluster, then only one node can run the DBMS at any one time and all access to the DB is via that node. However, if it fails, then another node will 'seamlessly' take over. Now, in order for this to happen quickly and smoothly all nodes in the cluster will run the DBMS 'minimized', i.e. they will load it into memory, but not service any requests. Then, in the event of a failure, they can start responding very quickly after detecting the failure. The important thing to remember is that they are not all running the DBMS and servicing requests at the same time. Also, there is no replication to worry about, as the actual data is saved on the shared storage array that all nodes have access to. It is like ripping the hard disk out of one machine and putting it in the other. Of course, after a failover, we can failback the service when the node has been fixed.

The logical view of a shared resource cluster is shown here for a 2-node cluster running 4 services (actually, these are a bit made up as you wouldn't run a web server on this, you'd run that on an NLB cluster). Physically, there are two nodes (server boxes) in the cluster and we are running 2 services or virtual servers on each (the blue balls). This is not to be confused with virtualization with VMWare or Hyper-V, we aren't virtualizing whole servers, just services. However, these services are exposed as separate machines to the network.

The diagram shows the client view of the cluster, which has two physical nodes ( and, which aren't directly accessed by clients, only for IT support, configuration and cluster services. However, each 'Virtual Server' is advertised in the DNS with a name and IP address, e.g. with an IP address of Therefore, clients wanting to connect to Exchange will connect to C_VS3 on, which is usually running on CNode2 on How does this happen? Well, a machine can have multiple IP addresses assigned to each network card. So CNode2 has a dedicated IP address of as well as the IP addresses of the 'Virtual Servers' and If this node fails, then CNode1 will take over and these IP addresses will be assigned to it. In this way, the name and IP address of a service doesn't change even in the event of the failure of the node you were connected to.

Network Load Balancing Cluster

A Network Load Balancing Cluster (NLB) is used literally for load balancing network traffic and processing load across multiple machines, e.g. Web server farm, remote desktops, VPN & streaming media. This cluster type gives high availability and good scalability. Additional nodes can easily be added to the cluster to expand it and if any single node fails the cluster automatically detects the failure and redistributes the extra load, presenting a seamless service to users. This is achieved by load balancing all incoming IP traffic. Some of the benefits include the ability to scale Web applications by quickly and incrementally adding additional servers in a rolling-upgrade whilst ensuring Web sites are always online. The important distinction between this and the shared resource cluster is that all the nodes are running the same service with the same data at the same time, e.g. web servers with the same website on their local storage or a common network location. There are several different solutions to NLB, each with advantages and disadvantages. The most common are: RRDNS, central dispatcher and Negotiated Statistical Mapping.

The simplest form of NLB is to use Round-Robin Domain Name Service (RRDNS), which, as the name suggests, simply issues IP addresses from a list in a round-robin fashion. For example, if you have four web servers ( ... all serving the same website, then you can enter the Alias into your DNS pointing at each of the four nodes. When a client queries your DNS for the IP address of then they will be given the first node in the list, The next query to the DNS will result in the IP address of being given out, etc. This has the advantages of being cheap and easy. Also, you don't need any special equipment or nodes that are aware of clustering technologies. However, the disadvantages are that the load is not distributed fairly and there is no detection of failed nodes. Imagine if every fourth query required very heavy processing, and the rest were simple GET requests. One node would get hammered and the others would sit around spinning their wheels. Also, if one node fails, the DNS server will still send every fourth query to that node as it doesn't know.

The second, and in a lot of ways most sophisticated, is the central dispatcher. This relies on a central device to receive all incoming requests and distribute them out among a set of nodes. This central device does not do the processing itself, it is merely a control to distribute work fairly and to healthy nodes only. The IP address of the central dispatcher is all that is advertised, but responses will come from the nodes directly. The advantages of this are that the dispatcher knows the capabilities of each node, so can distribute requests proportionately, and it knows the current workload of the nodes through querying. So, nodes won't get swamped, as the dispatcher will re-distribute the workload. It also means that you can run different services on different subsets of nodes. The disadvantages of this are cost and a single-point of failure - if your central dispatcher goes down, then the whole cluster is offline. Of course, you can have warm and hot-standby central dispatchers, but this can get very expensive.

The final method to look at here is negotiated statistical mapping. In this scenario, there is no central dispatcher or single-point of failure. The nodes all negotiate what load they will take and answer requests based on a statistical view of requests. Each node in the cluster will have two IP addresses, one dedicated address for it alone and the common cluster IP address. It is this common address that is advertised to clients requesting connection. In this way, all nodes in the cluster will receive all the packets for the cluster. The node that this request is mapped to will respond and all the others will discard the packet. Mapping can be done by individual IP addresses, subnets, etc. If one node fails, then the cluster will renegotiate the mappings and converge to a new model excluding the failed node - this happens within a couple of seconds. Similarly, when adding a new node, re-convergence is triggered and traffic will be distributed to the new node as well. The advantages of this are that it can be cheap to implement, as you have standard server hardware and there is only specific software to configure and control the cluster, and it has no single point of failure. However, there are disadvantages, namely that one node could get hammered again, as the nodes don't know what requests will actually come in as this is all based on statistical models. Also, it is more difficult to have subsets of nodes for different services.


There are two types of clusters to consider for high availability, namely the Shared Resource Cluster and the Network Load Balancing Cluster, which can be used separately or together to provide a complete clustered solution. The selection of which to deploy is based on the requirements of your service. An example is an e-commerce website, which would usually employ both technologies together in a complete solution - the web server farm at the front end will run a NLB cluster, whilst the backend database that they all access for live data will run on a Shared Resource cluster. Whose technology and implementation you choose will depend on budget, platform and feature requirements.


Popular Posts

Trusteer or no trust 'ere...

...that is the question. Well, I've had more of a look into Trusteer's Rapport, and it seems that my fears were justified. There are many security professionals out there who are claiming that this is 'snake oil' - marketing hype for something that isn't possible. Trusteer's Rapport gives security 'guaranteed' even if your machine is infected with malware according to their marketing department. Now any security professional worth his salt will tell you that this is rubbish and you should run a mile from claims like this. Anyway, I will try to address a few questions I raised in my last post about this. Firstly, I was correct in my assumption that Rapport requires a list of the servers that you wish to communicate with; it contacts a secure DNS server, which has a list already in it. This is how it switches from a phishing site to the legitimate site silently in the background. I have yet to fully investigate the security of this DNS, however, as most

Web Hosting Security Policy & Guidelines

I have seen so many websites hosted and developed insecurely that I have often thought I should write a guide of sorts for those wanting to commission a new website. Now I have have actually been asked to develop a web hosting security policy and a set of guidelines to give to project managers for dissemination to developers and hosting providers. So, I thought I would share some of my advice here. Before I do, though, I have to answer why we need this policy in the first place? There are many types of attack on websites, but these can be broadly categorised as follows: Denial of Service (DoS), Defacement and Data Breaches/Information Stealing. Data breaches and defacements hurt businesses' reputations and customer confidence as well as having direct financial impacts. But surely any hosting provider or solution developer will have these standards in place, yes? Well, in my experience the answer is no. It is true that they are mostly common sense and most providers will conform

Trusteer's Response to Issues with Rapport

I have been getting a lot of hits on this blog relating to Trusteer's Rapport, so I thought I would take a better look at the product. During my investigations, I was able to log keystrokes on a Windows 7 machine whilst accessing NatWest. However, the cause is as yet unknown as Rapport should be secure against this keylogger, so I'm not going to share the details here yet (there will be a video once Trusteer are happy there is no further threat). I have had quite a dialogue with Trusteer over this potential problem and can report that their guys are pretty switched on, they picked up on this very quickly and are taking it extremely seriously. They are also realistic about all security products and have many layers of security in place within their own product. No security product is 100% secure - it can't be. The best measure of a product, in my opinion, is the company's response to potential problems. I have to admit that Trusteer have been exemplary here. Why do I