Skip to main content

The 5 Restoration Phases of a Secure and Dependable System

We all want our systems to be secure and dependable, indeed the two topics are interlinked. Dependability requires high availability management, which has several aspects to it. We can try to achieve Fault Avoidance, with fault prevention and fault removal, but this isn't actually possible in all cases. For example, hard disk drives will have physical wear out due to moving parts, power supplies do not run indefinitely, etc. Therefore, we move towards Fault Acceptance. Fault acceptance relies on fault forecasting, to try to determine the most likely causes of faults, and fault tolerance to enable the system to continue functioning in the event of a fault. With fault tolerance we build redundancy into the system so that faults do not result in system failures. However, there are times when even our most fault tolerant systems will fail. What do we do then? Well, obviously we need to recover as quickly as possible.

The 5 restoration phases of a system are as follows:
  1. Diagnostic Phase - find the fault, diagnose the problem and determine the appropriate course of action to recover
  2. Procurement Phase - identify, locate, transport and physically assemble replacement hardware, software and backup media
  3. Base Provisioning Phase - configure the system hardware and install the base Operating System (OS)
  4. Restoration Phase - restore the entire system from the backup media, including system files and user data
  5. Verification Phase - verify the correct functionality of the entire system as well as the integrity of the user data
It can sometimes be quite hard to diagnose the actual root cause of a fault, as certain faults will sometimes show up in confusing ways. A few months ago I was investigating a problem with a machine that I was told was an OS problem - "Windows keeps blue screening with errors; bl***y Microsoft!" However, on further investigation, it had absolutely nothing to do with the OS and therefore Microsoft. Actually, there was a memory fault. One bank of RAM was faulty and was causing so many errors that the OS couldn't recover. Simply changing the pair of RAM modules sorted the problem out and the machine has been running reliably since. The problem with this type of fault is that it manifests in such a way as to look like a different fault.
The procurement phase can be tricky, as it takes a long time to get components delivered. This is where fault forecasting come in. If we know the most likely faults, then we can keep stock of those hardware components and make sure that any system recovery media is at hand. Obviously, we need backup media to be stored off site as well, but we will need copies on site for quick restoration when the building hasn't suffered damage. Of course, many hardware vendors will offer Service Level Agreements (SLA) as to how quickly they can replace or repair your hardware, but some things you may want to deal with in-house as it will still be quicker than the standard 4 hour fix (or longer).
The final three phases are all centred around restoration of data from backup media. This brings about the point that you must backup your system, not just your data. How long will it take you to install the OS from scratch, install all the additional services and make all the configuration changes? This will take too long. If you have backed up your system state, then this can be restored onto a base OS very much more quickly and without making mistakes or omissions. Another aspect to think about is how have you backed your system up? You need to choose a backup scenario that suits the amount of data and the speed at which you need to recover. Remember that a server with 200GB of data backed up onto a DLT tape drive that supports an average transfer rate of 5MBps, will take over 11 hours to restore, assuming that you restore from the full backup only. Any incremental backups taken since the last full backup will also have to be restored. OK, we can go faster than this, even with tape backup, but the solution needs to fit your system and a live backup solution may be required. This is where virtualisation of servers can help dramatically. Virtualisation is more often sold as 'green IT', being more efficient and cheaper. However, a major benefit is the ability to snapshot running machines and redeploy them in seconds. If one hardware box fails, you can migrate the virtual machine onto another box until the first is fixed. This can take just a few seconds or minutes, depending on the architecture of your solution.
Availability is inversely proportional to the total downtime in the period covered and is usually expressed as a percentage, e.g. 99.9% availability, which equates to around 8 hours 45 minutes downtime per annum. The downtime is the sum of all outages in that period. Therefore, we need to decrease the frequency and length of those outages. The frequency is reduced by building fault tolerant systems and the length by having good restoration policies and practices. It is vital that IT staff know the restoration policy and have practiced it. You need to set out a clear timeline of what has to happen during restoration. Certain systems and services will need to be restored first. Your database server may be the most important server to you, but without the network, DNS, DHCP and directory services no other machines will be able to connect to your server anyway and it may rely on some of those services during start up. Also, don't leave it until you are having to restore a failed system to see if the policy works. You must practice and test the policy to make sure that it does.

Comments

  1. I also had that problem hard disk drives have physical wear out due to moving parts, power supplies do not run indefinitely, but now I think this post will help me out to shortout my problems next time.

    ReplyDelete
  2. Glad it might help. Incidentally, I have another post called ‘How Reliable is RAID?’, which shows you how to calculate the reliability of your RAID solution and predict how many disk failures and system failures you are likely to suffer in a given period.

    ReplyDelete

Post a Comment

Popular Posts

Trusteer or no trust 'ere...

...that is the question. Well, I've had more of a look into Trusteer's Rapport, and it seems that my fears were justified. There are many security professionals out there who are claiming that this is 'snake oil' - marketing hype for something that isn't possible. Trusteer's Rapport gives security 'guaranteed' even if your machine is infected with malware according to their marketing department. Now any security professional worth his salt will tell you that this is rubbish and you should run a mile from claims like this. Anyway, I will try to address a few questions I raised in my last post about this. Firstly, I was correct in my assumption that Rapport requires a list of the servers that you wish to communicate with; it contacts a secure DNS server, which has a list already in it. This is how it switches from a phishing site to the legitimate site silently in the background. I have yet to fully investigate the security of this DNS, however, as most

Web Hosting Security Policy & Guidelines

I have seen so many websites hosted and developed insecurely that I have often thought I should write a guide of sorts for those wanting to commission a new website. Now I have have actually been asked to develop a web hosting security policy and a set of guidelines to give to project managers for dissemination to developers and hosting providers. So, I thought I would share some of my advice here. Before I do, though, I have to answer why we need this policy in the first place? There are many types of attack on websites, but these can be broadly categorised as follows: Denial of Service (DoS), Defacement and Data Breaches/Information Stealing. Data breaches and defacements hurt businesses' reputations and customer confidence as well as having direct financial impacts. But surely any hosting provider or solution developer will have these standards in place, yes? Well, in my experience the answer is no. It is true that they are mostly common sense and most providers will conform

Trusteer's Response to Issues with Rapport

I have been getting a lot of hits on this blog relating to Trusteer's Rapport, so I thought I would take a better look at the product. During my investigations, I was able to log keystrokes on a Windows 7 machine whilst accessing NatWest. However, the cause is as yet unknown as Rapport should be secure against this keylogger, so I'm not going to share the details here yet (there will be a video once Trusteer are happy there is no further threat). I have had quite a dialogue with Trusteer over this potential problem and can report that their guys are pretty switched on, they picked up on this very quickly and are taking it extremely seriously. They are also realistic about all security products and have many layers of security in place within their own product. No security product is 100% secure - it can't be. The best measure of a product, in my opinion, is the company's response to potential problems. I have to admit that Trusteer have been exemplary here. Why do I