Wednesday, January 31, 2007

Faulty FC port meets ZFS

Right now, I'm in a project that is taking a lot of my time and as such, I am behind my normal readings. Whenever I'm in this situation, my mailing list subscriptions are usually the first ones to suffer.

Anyway, I just saw this post from the zfs-discuss mailing list at opensolaris.org. It was posted about two months ago but I think that it is worth posting here as well specially since a lot of my ZFS posts have been getting some encouraging levels of page hits.

The post talks about how ZFS caught a situation where different layers of the I/O stack were trusting each other to a point where data was being silently corrupted. In fact, even the administrators trusted their hardware so much. The post says (all emphasis mine):

there was no mirroring or protection at the server level, we had delegated that function to the DMX3500

I've seen this kind of "trust" in some sites. But that turned out to be a bad decision:

As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems.
...
ZFS was the first one to pick up on the silent corruption

Ah, silent data corruption, the kind of problem that used to keep me up at night. :)

No comments: