4.5, admins, and backup/restore

By | February 21, 2007

ZCS 4.5 helps make admins’ jobs easier — a lot easier in some cases. This post discusses advanced search in the admin console, backup and restore in the admin console, backup performance improvements, and good policies for creating a recoverable system.

Advanced Search for Users, Domains, Servers
We’ve added an advanced search capability to the admin console. It includes a search builder similar to what has been available in the end user client. We had heard from many customers that they wanted to create complex searches, such as “show me all domains with xyz in them” or “show me all users on server 3″ or “show me all users with last name xyz on server 5″. You can construct all these searches and quite a few more now, and since the search uses LDAP indexes it’s fast. This feature is AJAX helping admins.

Backup and Restore in the Admin Console
For the ZCS 4.5 Network Edition release we’ve also done a lot to help your success with backup and restore. First, we’ve extended the admin console’s ability to manage the backup and restore process. From the admin console you can now review all the backups, both fulls and incrementals, that exist on your system, and whether they ran successfully or not. You can also initiate an immediate backup. As in previous versions you can restore a single account, either to itself (the same account name) or to a new account. You can also choose whether you want to restore to a particular full or to “now”, applying all the data that is available for that account.

For the next major release we’ll add point in time recovery in the admin console. This will enable you to restore any mailbox (or set of mailboxes) to any point in time for which you have backups. For example, you could restore Sally’s mailbox to Sally_restored using data from 3PM last Tuesday, when she knows she had the key message she needs. Note that point in time recovery is already available with the CLI (zmrestore).

Backup Performance Improvements
We’ve also made some big improvements to backup performance on the server in 4.5. The main source of the improvements is an increased ability to make multiple concurrent i/o requests of the system when copying data. After some experimentation we decided to copy with a pool of worker threads, each of which takes responsibility for the serial copy of a particular file. There are, of course, enough files that need to be touched that this provides for as much parallelism as the disk system can handle. If you have i/o bandwidth available with 4.0′s backups you will see a decrease in backup time. Our tests showed solid improvements. We’ve seen initial full backups on systems with 1000 users (3 U320 10K RPM disks) take 40% of the time of 4.0. Subsequent full backups are even faster, taking only 18% of the time required by 4.0 code. Your mileage may vary of course, but we think you’ll like it.

Backup Management and Risks
Speaking of backups, you are running them, aren’t you? I’ve been surprised at the support cases we see where customers have either never set up backups or are doing backups on to the same disk (even the same partition) as their data. Remember that there are bugs in every piece of software — filesystem, drivers, firmware, even ZCS — and you need to protect against them as well as disk failure. Keeping your backup on the same partition as your data leaves you vulnerable, even if you have a RAID backing store.

To check what backup schedule you have, run the following (Note: all commands that follow should be run as the zimbra user):

zmschedulebackup -q

You’ll (hopefully) see something like:

Current Schedule:
f 0 1 * * 0
i 0 1 * * 1-6
d 1m 0 0 * * *

The results show when fulls (f), incrementals (i), and deletions (d) will run using standard crontab syntax. (In fact, this information is pulled from crontab for the zimbra user; cron invokes zmbackup for all these operations.) If you don’t get any output back, you don’t have backups running! The schedule above shows that fulls are run on Sundays at 1 AM and incrementals every other day at 1 AM. It also says that every day at midnight any backups older than 1 month will be purged.

If you don’t have backups configured, or don’t understand what you do have, there is a simple fix:

zmschedulebackup -D

This one command will set the backup schedule and deletion schedule to the default, which is what is shown above. That is all you have to do to make sure you have a reasonable backup schedule! The default schedule should be fine for smaller sites. It will put backups into /opt/zimbra/backup. It’s your job to make sure that is on a different disk and partition than your data.

Custom Schedules
You can use zmschedulebackup if you do need to set up a different schedule. Putting 3 cron lines into a single command line can be a little messy, so you may want to dump the schedule to a file, edit the file, and then copy-paste the desired schedule into the command.

zmschedulebackup -s > /tmp/sched.txt
vi /tmp/sched.txt

sched.txt will have something like

f "0 1 * * 6" i "0 1 * * 0-5" d 1m "0 0 * * *"

The cron timing follows the f/i/d letter. Deletion (d) is a little different — it has the age of backups to preserve between the d and the cron-style time to run. Once in the editor make your desired change. For example, to keep only backups younger than 8 days, change the “1m” to “8d”. Then, copy-paste your file’s contents into a zmschedulebackup -R (R for replace existing schedule):

zmschedulebackup -R  f "0 1 * * 6" i "0 1 * * 0-5" d 8d "0 0 * * *"

You can use the -A (append) option to add more timings to the backup schedule, creating a ruleset that is complicated if needed. There’s a wiki page describing zmschedulebackup if you’d like to learn more.

Disk Layout and Filesystems
As I mentioned earlier we sometimes see systems with all the data, redologs, and backups on one disk or one partition. A full discussion on disk layout would take more than a blog, but in the context of reliability here are a few quick tips:

  • Put your redologs (/opt/zimbra/redolog) on a different disk and partition than your live data (mail store, indexes, and MySQL data). If you don’t do this and you were to lose both the live data and the redolog, the latest time you could restore to is the time of your last backup (full or incremental) that wasn’t also lost. That means data loss. Consider the cases if redologs and the live data are separate. If you lost the live data, then by using backups and redologs you will be able to restore to the point in time of the crash. If you lost just the redologs the server would halt immediately. In that case you will probably need to call support when the server comes back up to check for any MySQL/filesystem inconsistencies, but you should not lose any data.

  • You can put your redologs and backups on the same disk and partition assuming you have some other place you have moved the full backups to. This is not the best for performance, but it’s OK for reliability.

  • We run ext3 in data=ordered mode for the vast majority of our testing. There’s a good article (based on the 2.4 kernel, and courtesy of IBM) on ext3 here. Going forward I would like to see us do more testing in this area, but for now this a safe path with reasonable performance.

We’d like to make backup and restore as easy as possible for admins. If you have an idea for an improvement please either drop us a note or file an enhancement request.


Comments are closed.