Sunday, June 04, 2006

Silent data corruption

Below is a slightly modified version of a post that I made to my high school batch mailing list. We were having a discussion about OS vulnerabilities and the importance of making backups.

--------------------------------------

Years ago, I was a full time database administrator. If there is an unforgivable sin in database administration, it is losing data. Database performance may be slow... that's okay, that's fixable. But if I lose data, I should probably be making sure that my resume is up to date. :)

And so I tried to learn everything I can about database backup and data protection. Full backup, incremental backup, cold backup, hot backup, replication, standby databases, mirroring disks, breaking the mirrors, hardware raid, software raid, volume managers, hot spares, etc. I made sure I studied and practiced everything until I was satisfied that I knew what I needed to know. But not only that, I also made sure that I knew how to recover in case of disaster. I would occasionally take some of my backup and try to do a restore just to make sure that I can really recover should there be a need to do so. To put things in perspective, I was dealing with a database that was approaching 2 terabytes in size so this was not a case of simple backup/recovery scenario.

But as one of us pointed out, disks fail. If one of my disks should fail, my hope was that it would fail in an unmistakable way. I hoped that if a disk failed, the other parts of the system would know in an unmistakable way (like an I/O failure) such that the safeguards that we put in place (like hot spares) would kick in. If this kind of failure happened, we can deal with it and we would know how to fix it.

But what kept me up at night was the thought that there is a possibility that at one point in time, my data has experienced what we call silent corruption. This can happen with hard disks. But it does not stop there. Silent corruption can also happen when you have a hard disk controller failure. I can also imagine that there could be bugs with device drivers. I also know that some disks have firmware and that, too, can be a source of problems. Again, if these errors result in a clear I/O error, then that's the best thing that could ever happen because then, we would know that there is a problem and we could deal with it. But what if it is an undetected error? And what if it goes undetected for a long period of time?

So if my database had a silent corruption, how do I know when it happened? How do I know which of my backups is good? A silent corruption could mean that all my backups are useless because it is possible that the problem is older than my oldest backup. Sure, there are database utilities to verify a database for inconsistencies but when does one run these utilities against a 2TB database? Before I do a backup? How long will that take? Will it fit into the backup window?

Anyway, enough with databases. The problem with silent data corruption is obviously not unique to databases. It is a problem that can happen with any type of data.

So how does one safeguard against this type of problem? There could be other ways but one way is to use Solaris with ZFS. :-) ZFS has self healing features (look at slides 13-15). There's also a screencast demo here

Some web sites like strongspace are already using it even before its official release.

How stable is it considering that it is not even officially released? Read this blog from the lead test engineer:

Testing ZFS reminds me of when I was working for IBM at NASA on the Space Shuttle Flight Software Project....
And here's one more from another test engineer.
All in all, this means that we put our code through more abuse in about 20 seconds than most users will see in a lifetime. We run this test suite every night as part of our nightly testing, as well as when we're developing code in our private workspaces. And it obeys Bill's first law of test suites: the effectiveness of a test suite is directly proportional to its ease of use. We simply just type "ztest" and off it goes.

For those who want to learn more, here is a good starting point. ZFS home page is here.

3 comments:

Alex Gorbachev said...

Thanks for nice ZFS links.
Regarding databases and specifically Oracle - using RMAN to backup your files you can check for blocks consistency at the same time. You can even run RMAN on your split mirror backups (that what we do for example).

Alex Gorbachev said...

Heh... just read in the presentation you referenced. It's great! Thanks!
Just few comments:
"Simple administration
● “You're going to put a lot of people out of work.”
– Jarod Jenson, ZFS beta customer"

Something reminds me of Oracle marketing on self managing 10g database and all DBA are out of their jobs... well, reality is quite an opposite! ;-)

I also didn't get the slide on RAID-Z - desription is quite puzzling.

jforonda said...

Hi Alex,

I'm glad to hear that you liked the slides/pointers. I don't have time to talk about RAID-Z right now.... perhaps, I will write something like a "simplified raid-z" post sometime next week or so as I am quite busy this week.

Regarding the "putting lots of people out of work" comment by Jarod, maybe he was referring to the simplicity and reliability of ZFS. I've worked with several volume managers and ZFS is far simpler than the ones I've worked with.

I also plan to make a post about ZFS features that may be of interest to DBAs. My only problem is where to get the time to write. :)

BTW, Solaris 10 Update 2 just came out last week. Update 2 now includes ZFS. This means that ZFS is now fully supported. Before Update 2, ZFS was released on OpenSolaris sometime November last year.


James