Below is a slightly modified version of a post that I made to my high school batch mailing list. We were having a discussion about OS vulnerabilities and the importance of making backups.
--------------------------------------
Years ago, I was a full time database administrator. If there is an unforgivable sin in database administration, it is losing data. Database performance may be slow... that's okay, that's fixable. But if I lose data, I should probably be making sure that my resume is up to date. :)
And so I tried to learn everything I can about database backup and data protection. Full backup, incremental backup, cold backup, hot backup, replication, standby databases, mirroring disks, breaking the mirrors, hardware raid, software raid, volume managers, hot spares, etc. I made sure I studied and practiced everything until I was satisfied that I knew what I needed to know. But not only that, I also made sure that I knew how to recover in case of disaster. I would occasionally take some of my backup and try to do a restore just to make sure that I can really recover should there be a need to do so. To put things in perspective, I was dealing with a database that was approaching 2 terabytes in size so this was not a case of simple backup/recovery scenario.
But as one of us pointed out, disks fail. If one of my disks should fail, my hope was that it would fail in an unmistakable way. I hoped that if a disk failed, the other parts of the system would know in an unmistakable way (like an I/O failure) such that the safeguards that we put in place (like hot spares) would kick in. If this kind of failure happened, we can deal with it and we would know how to fix it.
But what kept me up at night was the thought that there is a possibility that at one point in time, my data has experienced what we call
silent corruption. This can happen with hard disks. But it does not stop there. Silent corruption can also happen when you have a hard disk controller failure. I can also imagine that there could be bugs with device drivers. I also know that some disks have firmware and that, too, can be a source of problems. Again, if these errors result in a clear I/O error, then that's the best thing that could ever happen because then, we would know that there is a problem and we could deal with it. But what if it is an undetected error? And what if it goes undetected for a long period of time?
So if my database had a silent corruption, how do I know when it happened? How do I know which of my backups is good? A silent corruption could mean that all my backups are useless because it is possible that the problem is older than my oldest backup. Sure, there are database utilities to verify a database for inconsistencies but when does one run these utilities against a 2TB database? Before I do a backup? How long will that take? Will it fit into the backup window?
Anyway, enough with databases. The problem with silent data corruption is obviously not unique to databases. It is a problem that can happen with any type of data.
So how does one safeguard against this type of problem? There could be other ways but one way is to use Solaris with ZFS. :-) ZFS has
self healing features (look at slides 13-15). There's also a screencast demo
hereSome web sites like
strongspace are already using it even before its official release.
How stable is it considering that it is not even officially released? Read
this blog from the lead test engineer:
Testing ZFS reminds me of when I was working for IBM at NASA on the Space Shuttle Flight Software Project....
And here's
one more from another test engineer.
All in all, this means that we put our code through more abuse in about 20 seconds than most users will see in a lifetime. We run this test suite every night as part of our nightly testing, as well as when we're developing code in our private workspaces. And it obeys Bill's first law of test suites: the effectiveness of a test suite is directly proportional to its ease of use. We simply just type "ztest" and off it goes.
For those who want to learn more,
here is a good starting point. ZFS home page is
here.