I’ve been extolling the virtues of backup a lot recently, which has led to a few interesting conversations (and a lot of people rolling their eyes and going back to talking about climbing).
One of these conversations was with Glyn from the amazing Open Energy Monitor project, who are wrestling with a backup implementation for their forums. Their requirements are very similar to a lot of small businesses who rely on hosted solutions, so in this post I’m going to talk about small-scale backup strategies and give some suggestions on how to approach the task.
In a following post I will describe how to set up a simple backup solution using Rsnapshot for machines which don’t have SSH or rsync available.
A good place to start is figuring out the level of backup you actually need. There are a couple of industry terms which are worth considering here:
Once you’ve defined your RTO and RPO for each service you’re aiming to protect you’re in a good position to decide on your backup and recovery strategy. If you’ve decided that you have an RPO of less than 30 minutes and an RTO of less than 1 hour you the solutions described below aren’t suitable for you. Take some time to look up high availability clusters, live database replication, or a plethora of other options.
The first thing most people do when realising they need a backup is to copy their files to a USB pen drive or another computer. Maybe a week later the files are copied again, overwriting the original backup. 10 minutes later you realise you deleted something really important before you did the second backup. Disaster!
Fortunately all modern backup solutions can be configured to only backup files which have changed since the previous backup, retaining deleted files until they are older than a specified age.
How many copies of the backups do you need? What are the chances that the backup will fail just when you need it most? I’m pretty paranoid so even for for my home backups I have at least two copies in different locations and the machine hosting my primary backup uses RAID 1 to duplicate the data onto two disks.
Rsnapshot does some clever hard-linking to give you the appearance of many independent backups without having to store many copies of the same files. In the example below the same directory has been backed-up two days in a row (daily.0 is the most recent). Even though file1.txt has been deleted at some point and is missing from the latest backup, an older version is still in the system. If file2.exe and file3.xml haven’t changed they will only be stored on disk once.
backup_root |- daily.0 | |-directory1 | |-file2.exe | |-file3.xml | |- daily.1 |-directory1 |-file1.txt |-file2.exe |-file3.xml
Disk space is pretty inexpensive these days so it’s feasible for most small-scale scenarios to retain quite a long history. An example of a reasonable strategy is:
This has raised some eyebrows. Why instigate a strategy for backing up five year’s worth of data? Why retain so many backups for the latest day? It seems like overkill. But a better question is why not? If your business is still going in 5 years and has grown to 100x its original size then shelling out for a couple of disks is the least of your worries. If you do run out of space older backups can be removed, but why do it prematurely?
The space requirements for the above strategy can be calculated pretty easily. Let’s say that we start with a 100Gb website, and every day users upload 1Gb of new videos.
|Initial data size||100Gb|
|Daily change (additions or modifications)||1Gb|
In two years the total size of the data will be no more than 2Tb. In reality some content will be removed so the final value is likely to be lower. To calculate the backup space required.
|Backups||Number retained||Size per increment||Total|
So the maximum space requirement for the above would be 2 terabytes. That means a total spend of < £250 for a couple of 3Tb disks to run in RAID 1 (another copy is always good!), and most use cases will have more modest requirements.
The backup will most likely contain a lot of sensitive information, so it’s worth taking some time to consider the implications of your backup particularly for:
Encryption helps mitigate the risks for both scenarios (although you then need to think about how to protect your encryption keys or passphrases).
Here you have a couple of options: encrypt the data before transmitting it or encrypt it once it’s been received.
There are a plethora of drive encryption tools out there, and I have worked with and can recommend Truecrypt for both Linux and Windows platforms. Ubuntu natively supports dm_crypt and LUKS which makes the setup and management a little easier. If the entire drive is encrypted you’ll need to enter a passphrase or use a key to mount the device each time it’s connected (hint: a YubiKey can solve this issue elegantly) but if the drive is lost or stolen there’s almost no way for the data to be recovered by an attacker, assuming a reasonable encryption algorithm has been used. A word of warning though. The same is true if you forget the passphrase or lose the key.
Another option is to encrypt the archive files rather than the entire disk using GPG.
If you have encrypted the data before transmitting it then half the problem is already solved, however it is easy to overlook the username and password used to make the connection.
If you have access to a shared hosting instance using FTP (which is unencrypted), check if the provider offers FTPS (FTP over SSL). Similarly anyone using telnet should look for SSH. Any good hosting service will provide secure access, and it is a good idea to disable their insecure counterparts.
This is by far the most overlooked aspect of any backup plan. Many people get their backups in a nice encrypted TAR or database dump, sit back and think the task is over, but what use is a backup if it doesn’t get you back up and running? It’s all too easy to omit a critical configuration file, or accidentally dump the data for your database but not the schema, so testing is critical.
The simplest way to test a website backup is to use a fresh machine and build the database and files from the backup and check that the site loads and doesn’t throw errors.
Be sure to test often though. It’s easy for errors to creep in, or files to be added to the server without being added to the backup scheme. The subject of regression testing backups and automating the backup verfication process is defintely out of the scope of this post!
In the next post I’ll detail how to set up a simple Rsnapshot server to back up one or more remote sites over FTPS.
Was this post useful? Why not help others find it by sharing on twitter. While you're there, get in touch and let me know what you think!