Wednesday, January 31, 2007

Faulty FC port meets ZFS

Right now, I'm in a project that is taking a lot of my time and as such, I am behind my normal readings. Whenever I'm in this situation, my mailing list subscriptions are usually the first ones to suffer.

Anyway, I just saw this post from the zfs-discuss mailing list at opensolaris.org. It was posted about two months ago but I think that it is worth posting here as well specially since a lot of my ZFS posts have been getting some encouraging levels of page hits.

The post talks about how ZFS caught a situation where different layers of the I/O stack were trusting each other to a point where data was being silently corrupted. In fact, even the administrators trusted their hardware so much. The post says (all emphasis mine):

there was no mirroring or protection at the server level, we had delegated that function to the DMX3500

I've seen this kind of "trust" in some sites. But that turned out to be a bad decision:

As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems.
...
ZFS was the first one to pick up on the silent corruption

Ah, silent data corruption, the kind of problem that used to keep me up at night. :)

Tuesday, January 30, 2007

What a pleasant surprise (the power of google)

I use Google Analytics and FeedBurner SiteStats to monitor the traffic and subscriptions of this blog. This site has a very modest traffic but for the past few months now, I noticed that more and more readers are landing to this site from a google search. I also noticed that most of the searches were for silent data corruption, a topic which I had a previous entry. Currently, this specific entry is number two on google search result when one uses the search term silent data corruption.

When I tried this search today, I got this:



No wonder. What a pleasant surprise! :)

Sunday, January 14, 2007

ZFS Compression

This post is a result of a thread in oracle-l. It is provided to show the basics of ZFS compression.

This was done on my single core Sun Ultra 20 workstation running Solaris. The output has been slightly reformatted to make it fit the screen.


Note that this is a very simple case on a machine that is not doing any significant amount of work. Please don't use the information here as the basis of any decision you will make, specially for critical systems. Your mileage will most likely vary.

Here's a ZFS pool named p1 whose compression bit is turned on.


root@u20# zfs get compression p1
NAME PROPERTY VALUE SOURCE
p1 compression on local
root@u20#

Let's create two filesystems under this pool and name them u (for uncompressed) and c (for compressed).

root@u20# zfs create p1/u
root@u20# zfs create p1/c

By default, their compression bits are turned on because they inherit this characteristic from their parent (p1).

root@u20# zfs get compression p1/u p1/c
NAME PROPERTY VALUE SOURCE
p1/c compression on inherited from p1
p1/u compression on inherited from p1
root@u20#

Let's turn off the compression bit for the u (uncompressed) filesystem.

root@u20# zfs set compression=off p1/u

Let's verify that it was actually turned off.

root@u20# zfs get compression p1/u p1/c
NAME PROPERTY VALUE SOURCE
p1/c compression on inherited from p1
p1/u compression off local
root@u20#

Let's go get some reasonably-sized test file.

root@u20# pwd
/10046/tmp
root@u20# ls -l
total 1589543
-r--r--r-- 1 jforonda staff 2962342454 Sep 11 2004 test_ora_17255_t.trc
root@u20#

There... that file is an Oracle extended SQL trace file that is around 2.8G in size.

Let's make two copies of this file. First, to the uncompressed filesystem:

root@u20# ptime cp /10046/tmp/test_ora_17255_t.trc /p1/u
real 1:20.604
user 0.002
sys 14.613
root@u20#

Then, to the compressed filesystem:

root@u20# ptime cp /10046/tmp/test_ora_17255_t.trc /p1/c
real 1:20.228
user 0.002
sys 12.637
root@u20#

Before going further, please note the output of ptime for each of the times we copied the files. There's less than one second difference between the two out of an approximate duration of 80 seconds for each copy operation. Also, note that the copy operation to the compressed filesystem was actually faster. Hmm... interesting.

Doing an ls on the two files will show that they have exactly the same number of bytes.

root@u20# ls -l /p1/[uc]/*
-r--r--r-- 1 root root 2962342454 Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 2962342454 Jan 13 14:21 /p1/u/test_ora_17255_t.trc
root@u20#

And in fact, the Solaris digest command tells us that they have exactly the same contents.

root@u20# digest -a md5 -v /p1/[uc]/*
md5 (/p1/c/test_ora_17255_t.trc) = 97f86fcfdfc3f21a68ffc1892a945e77
md5 (/p1/u/test_ora_17255_t.trc) = 97f86fcfdfc3f21a68ffc1892a945e77
root@u20#

But the amount of space that they occupy on disk is not the same. The file residing in the compressed filesystem is around 1/3 the size of the file that resides in the uncompressed filesystem.

root@u20# du -sh /p1/[uc]/*
776M /p1/c/test_ora_17255_t.trc
2.8G /p1/u/test_ora_17255_t.trc
root@u20#

A better way to see the compression ratio is to use the zfs get command:

root@u20# zfs get compressratio p1/u p1/c
NAME PROPERTY VALUE SOURCE
p1/c compressratio 3.64x -
p1/u compressratio 1.00x -
root@u20#

Applications don't have to know that a file is compressed -- ZFS does the compression and decompression on the fly. Applications should be able to read the files normally like this:

root@u20# tail -5 /p1/u/test_ora_17255_t.trc
END OF STMT
PARSE #21:c=0,e=123,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621053
BINDS #21:
EXEC #21:c=0,e=259,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621393
EXEC #3:c=0,e=3416,p=0,cr=1,cu=3,mis=0,r=1,dep=0,og=4,tim=740525622047
root@u20#

root@u20# tail -5 /p1/c/test_ora_17255_t.trc
END OF STMT
PARSE #21:c=0,e=123,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621053
BINDS #21:
EXEC #21:c=0,e=259,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621393
EXEC #3:c=0,e=3416,p=0,cr=1,cu=3,mis=0,r=1,dep=0,og=4,tim=740525622047
root@u20#

Now, how good is the default ZFS compression compared to something like gzip? Let's find out by compressing the file that is in the uncompressed filesystem.

root@u20# ptime gzip /p1/u/test_ora_17255_t.trc
real 1:27.206
user 1:19.614
sys 2.985
root@u20#

As expected, gzip is much better.

root@u20# ls -hl /p1/[uc]/*
-r--r--r-- 1 root root 2.8G Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 99M Jan 13 14:21 /p1/u/test_ora_17255_t.trc.gz
root@u20#

root@u20# ls -l /p1/[uc]/*
-r--r--r-- 1 root root 2962342454 Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 104023266 Jan 13 14:21 /p1/u/test_ora_17255_t.trc.gz
root@u20#

root@u20# du -sh /p1/[uc]/*
776M /p1/c/test_ora_17255_t.trc
99M /p1/u/test_ora_17255_t.trc.gz
root@u20#

The above shows the following:

Size of original file: 2.8GB
Size of file compressed by ZFS: 776MB
Size of file compressed by gzip: 99MB

Why is this so? Well, programs like gzip can take their time and optimize for compression ratio in favor of elapsed time. ZFS, on the other hand, is concerned about other things beside compression ratio so it has to strike the right balance between speed and compression ratio. That said, the ZFS man page says:

compression=on | off | lzjb


Controls the compression algorithm used for this dataset. There is currently only one algorithm, "lzjb", though this may change in future releases.

Saturday, January 13, 2007

The day is coming...

This is in reference to my previous entry titled "Is the day coming...".

Sun's ThinGuy says it is coming soon. You can see a demo here.

And yes, it is called win4solaris.

Thursday, January 11, 2007

Is the day coming...

... when you will be able to run Windows on top of Solaris x64, just like you can run Windows on top of Linux using VMWare today? As of today, Solaris is supported only as a guest operating system under VMWare.

If Solaris can be a host operating system to another OS, then it is possible to use some of the Solaris features on the hosted OS. For example, today, it is possible to use DTrace on a Linux system running on a Solaris BrandZ Zone. In a similar fashion, if one can run Windows on top of Solaris, it is then possible for Windows to indirectly take advantage of features such as DTrace and ZFS. And with things like resource management, it then becomes possible to cap the resources used by that instance of the Windows OS. The thought of being able to do this excites me no end. :)

Anyway, I've always wondered what Sun is going to do in this area. A few years ago, we had a product called SunPCi. This product is a combination of hardware and software. The hardware component was a card that goes into a SPARC box. This card had an AMD processor. With this hardware/software combination, one can then install Windows and some Linux distros on top of Solaris. It was similar to VMWare in so many ways. I used SunPCi a lot and I loved it.

When Solaris x86/64 was announced, I thought that Sun would then take the SunPCi software component and make it run on Solaris x64/86. But that has not happened.

Then, last October, ThinGuy mentioned that Sun is working with Win4Lin. Today, I see this.

Hmmm.... win4solaris, anyone?