Friday, June 16, 2006

What's wrong with this picture?

Every now and then, I get pulled into computer performance improvement projects. In a lot of places I've been to, some people try to solve performance issues by throwing hardware at the problem. Most of the time, this does not solve their original problem. Sometimes, it even makes things worse.

What I usually see is that computer systems are doing too many unnecessary things. They are bogged down by doing things that they are not supposed to be doing in the first place. When I get into this situation, I focus on unnecessary workload reduction. In most cases, reducing or removing unnecessary workload solves the problem.

So when I saw this, my first reaction was "Why?".

At Sun, we take a different approach. We use energy efficient servers and desktops. The result is that not only do we save power to run our computers, we also don't need as much power to cool the space.

Unnecessary power consumption reduction.

Wednesday, June 07, 2006

Clearing some misconceptions about Sun and Solaris

I've been following some threads on the Oracle-L list recently and I noticed that there are some misconceptions about Sun and Solaris by some of the members of the Oracle-L community. I made some posts to set the record straight. Let me just summarize them here:

1. In a thread titled "Anyone used 10g Release 2 (10.2.0.1.0) for Solaris Operating System (x86-64)", a member of the list asked:

I have been told I may have to install and support Oracle Database 10g Release 2 (10.2.0.1.0) for Solaris Operating System (x86-64). I have plenty of experience with Solaris for Sparc but the database for the x86 version (64 bit) was just released in March I believe. I am curious about the experiences of anyone that may have this configuration. Any information would be greatly appreciated.
and one of the replies was:

You're paying the cost of using exotic platforms. I sincerely doubt that anybody will be able to respond to your question.

This surprised me because I know that late last year, Oracle choose the Solaris 10 Operating System as its preferred open source 64-bit development and deployment environment. So I replied and provided a link to the press release.
I also pointed out that Solaris is open source because I get the impression that to a lot of people, "open source = Linux" and that Linux is the only Unix-like OS than can run on x86/64 hardware.

The original question also implies that some people think that Solaris 10 SPARC and Solaris 10 x86/64 are entirely different products. The fact is that both are built from the same source tree and so experience with Solaris SPARC will carry over nicely to the x86/64 platform. I know this for a fact because I use both Solaris SPARC and Solaris x86/64.

2. In another thread with the subject "Oracle's relationships with expert DBAs (and the rest of us mere mortals)", another member of Oracle-L said:

Also, I'd suggest you replace those 20 Sun servers with a couple of Opteron-loaded standard boxes. Just for your main production environment, mind you. Might be easier to handle a couple of standard PC boxes than all those refrigerator-size monsters out in the big, cold room.

I understand the reference to "refrigerator-sized monsters" for it is true that Sun has some of these high-end machines and I have used some of these machines in the past. So I pointed out that maybe, the person had this kind of machine in mind. But if, for whatever reason, one prefers Opteron based machines or smaller, less expensive, and cooler running machines in general, then Sun has its Galaxy (from $745) and CoolThreads (from $2,995) line of servers.

My experience is not unique. Sun's new CEO, Jonathan Schwartz, recently said in his blog:

I was with a big potential customer yesterday - in the Fortune 100. After a day of briefings from our technical folks, I joined the meeting to see how we were doing. I asked him and his team how much of what they'd seen was new to them.

He said, "about 70% was a complete surprise."

Ouch. That's not good.

To test, I asked, "before today, did you know that Solaris was open source, or ran on Dell, HP and IBM hardware, not just Sun's?" "Nope."

And like I said, this was a Fortune 100 opportunity.



Sunday, June 04, 2006

Silent data corruption

Below is a slightly modified version of a post that I made to my high school batch mailing list. We were having a discussion about OS vulnerabilities and the importance of making backups.

--------------------------------------

Years ago, I was a full time database administrator. If there is an unforgivable sin in database administration, it is losing data. Database performance may be slow... that's okay, that's fixable. But if I lose data, I should probably be making sure that my resume is up to date. :)

And so I tried to learn everything I can about database backup and data protection. Full backup, incremental backup, cold backup, hot backup, replication, standby databases, mirroring disks, breaking the mirrors, hardware raid, software raid, volume managers, hot spares, etc. I made sure I studied and practiced everything until I was satisfied that I knew what I needed to know. But not only that, I also made sure that I knew how to recover in case of disaster. I would occasionally take some of my backup and try to do a restore just to make sure that I can really recover should there be a need to do so. To put things in perspective, I was dealing with a database that was approaching 2 terabytes in size so this was not a case of simple backup/recovery scenario.

But as one of us pointed out, disks fail. If one of my disks should fail, my hope was that it would fail in an unmistakable way. I hoped that if a disk failed, the other parts of the system would know in an unmistakable way (like an I/O failure) such that the safeguards that we put in place (like hot spares) would kick in. If this kind of failure happened, we can deal with it and we would know how to fix it.

But what kept me up at night was the thought that there is a possibility that at one point in time, my data has experienced what we call silent corruption. This can happen with hard disks. But it does not stop there. Silent corruption can also happen when you have a hard disk controller failure. I can also imagine that there could be bugs with device drivers. I also know that some disks have firmware and that, too, can be a source of problems. Again, if these errors result in a clear I/O error, then that's the best thing that could ever happen because then, we would know that there is a problem and we could deal with it. But what if it is an undetected error? And what if it goes undetected for a long period of time?

So if my database had a silent corruption, how do I know when it happened? How do I know which of my backups is good? A silent corruption could mean that all my backups are useless because it is possible that the problem is older than my oldest backup. Sure, there are database utilities to verify a database for inconsistencies but when does one run these utilities against a 2TB database? Before I do a backup? How long will that take? Will it fit into the backup window?

Anyway, enough with databases. The problem with silent data corruption is obviously not unique to databases. It is a problem that can happen with any type of data.

So how does one safeguard against this type of problem? There could be other ways but one way is to use Solaris with ZFS. :-) ZFS has self healing features (look at slides 13-15). There's also a screencast demo here

Some web sites like strongspace are already using it even before its official release.

How stable is it considering that it is not even officially released? Read this blog from the lead test engineer:

Testing ZFS reminds me of when I was working for IBM at NASA on the Space Shuttle Flight Software Project....
And here's one more from another test engineer.
All in all, this means that we put our code through more abuse in about 20 seconds than most users will see in a lifetime. We run this test suite every night as part of our nightly testing, as well as when we're developing code in our private workspaces. And it obeys Bill's first law of test suites: the effectiveness of a test suite is directly proportional to its ease of use. We simply just type "ztest" and off it goes.

For those who want to learn more, here is a good starting point. ZFS home page is here.

Saturday, June 03, 2006

Hello world

My name is James Foronda. I work for Sun Microsystems' US Data Center Solutions - Enterprise Migration practice.