Sunday, March 25, 2007

Blog facelift in progress

I've decided to do a blog facelift that I have been wanting to do for a while now.

First is the change in the blog template. I liked the look of the old template (Scribe) but the narrow post body made it problematic to format posts that include source code or screen output. The new template is a slightly modified version of the Stretch Denim Light template. As a result of the new template, I will be modifying some posts that include source code or screen output to make them look better with this new blog template. These modifications will be purely cosmetic in nature and subscribers can safely ignore them in their feedreaders if they have read the post before.

Next, I have finally started a blogroll (shame on me) :).

Finally, I added links of websites that I hope readers of this blog will find useful. If you are running Oracle on Solaris, you should definitely check the links to the Solaris Internals website.

Thursday, March 08, 2007

OraSRP v3.0b - screenshots for discussion

March 9, 2007
1. I changed the title of this post from OraSRP v3.0b - short program to approximate session elapsed time to OraSRP v3.0b - screenshots for discussion.
2. Added another screenshot of a short program to look for the minimum and maximum values of tim.

This entry is created to support some points being made in the comment section of the entry titled OraSRP v3.0b - questionable session elapsed time. It was created for the sole purpose of displaying this screenshot the screenshots below because it looks like blogspot does not support screenshots in the comment section.

1. A short program to approximate session elapsed time of the trace file being discussed.

2. A short program to find the minimum and maximum tim values of the trace file being discussed.

Comments are disabled for this entry. If you want to join the discussion, please go here.

Saturday, March 03, 2007

OraSRP v3.0b - questionable session elapsed time

Note to subscribers: If your reader does not properly display some formatted content like bold and italics, I suggest you visit the blog site itself.

For those who may just be passing by, this is a continuation of two previous blog entries:

If you have not read these two entries, I suggest that you do so first, specially the first one, since that is where the possible problem with the calculation of session elapsed time was initially described. I also suggest that you read Egor's comment for that entry.

As the title says, this entry is about an observation that I made regarding the discrepancy between the session elapsed time reported by OraSRP, and the contents of the raw Oracle Extended SQL Trace File (simply referred to as trace file from hereon). Earlier, I said that OraSRP reports a number that is around 16 seconds more than what I can readily see and approximate from the trace file. In his comment, Egor explained:

It depends on how to calculate elapsed time.
For example, OraSRP ignores 'tim' in 'PARSING IN CURSOR' lines.
Also, if trace file starts from WAIT lines then OraSRP will sum up elapsed times of such waits and substract it from the first 'tim' value.

Thank you, Egor, for that explanation. But please allow me to be more specific.

Session Start Timestamp
Here's a screenshot that shows the top of the trace file (sensitive information has been crossed out, please click to enlarge):

From that screenshot, I make the following observations:

  • There are no WAIT lines before the first 'PARSING IN CURSOR' line so the mention of WAIT lines in Egor's comments is not applicable to this specific situation.
  • Let's follow Egor's comment and ignore the tim value from the 'PARSING IN CURSOR' line (line 22). The next tim value after line 22 comes from the EXEC line (line 25). The value of tim for that line is 3928564402472. Those who are closely following the values of the timestamps may notice that the value of tim from the EXEC line (line 25) is less than the value of tim from the 'PARSING IN CURSOR' line. In this specific case, the EXEC tim is 488 microseconds (or 488/1,000,000 of a second) before the tim from 'PARSING IN CURSOR'. This may seem a little bit odd but if you happen to have the Optimizing Oracle Performance book, the reason why this is so is explained on pages 100-101 in the section titled "Oracle Release 9".

At this point, I think it is safe to make the following statements regarding the start timestamp:

  • The actual start timestamp can not be greater than 3928564402472. By that I mean that it can be less, but not 16,000,000 microseconds less.

  • I believe that the earliest timestamp that one can get from this specific tracefile is the one in the 'SESSION ID' line (line 19). In fact, the Optimizing Oracle Performance book describes this line as the line that "identifies the time and the session at which the first trace line was emitted" (page 104). Purists will probably disagree with me but personally, I can tolerate not using the value in line 19 as start time, specially in cases where the session elapsed time is big (like in this case where it is more than 14,171 seconds). I say this for the following reasons:

    • The 'SESSION ID' time is in a different format and converting it to the tim format is not always straightforward. Of course, if one can figure out a reliable way to convert the 'SESSION ID' time to tim format, then this is the time that should be used as session start time.

    • The difference between the times that are represented between lines 19 and 25 is likely very small, just like the difference between lines 22 and 25 is also small (488 microseconds). I can personally tolerate this difference knowing that it is a small one-time occurence. And in the context of what is being discussed here now, I can not imagine a valid case where this difference would go as high as 16 seconds (or 16,000,000 microseconds).

Session End Timestamp

Here's a screenshot that shows the bottom of the trace file (sensitive information has been crossed out, please click to enlarge):

From that screenshot, I make the following observations:
  • The last line that has a tim is the EXEC line (line 2156805) and the value of tim from this line is 3942736181428.
  • I said in the first post that the actual end timestamp can not be less than the tim value in the EXEC line above. By that, I implied that it can be more, but not 16,000,000 microseconds more. In this specific case, there are several WAIT lines that follow this EXEC line. The ela values for these WAIT lines must be accounted for.

The total amount of time that can be attributed to these WAIT lines is the sum of their ela values:

Line # ela
------- -----
2156806 1
2156807 560
2156809 917
2156810 2
2156811 518
------- -----
Total 1,998

jforonda@u20$ bc

To get the session's end timestamp, simply add the last tim value (3942736181428 from line 2156805) to the total ela of the WAIT lines above (see Clock Walk Formulas, p101-102, Optimizing Oracle Performance).

jforonda@u20$ bc

To the best of my knowledge, that's the maximum value of the end timestamp that one can get from this specific tracefile. If anyone can see anything that I've missed, I will welcome your corrections.

Session Elapsed Time

The session elapsed time should just be the difference between the end timestamp and start timestamp:

jforonda@u20$ bc

So there you go. That is how I would calculate this specific trace file's session elapsed time. Is this how others would do it? If what I presented above is wrong, where did it go wrong? If the method above is wrong, is it so way off to a point that it can not be relied on? Is it even reasonably close enough to be within the bounds of what is acceptable? Comments are welcome, whether you agree or disagree. Moderation is on to control spam.

Assuming that the calculation above is correct (or at least reasonably close enough to what it should be), it means that nothing much has changed from the initial observations that I made in the first post -- that is, there is a discrepancy of around 16 seconds from the value reported by OraSRP:

jforonda@u20$ bc

Again, assuming that the calculation above is correct, some people might then ask: why bother with 16 seconds specially since it is just around 0.11% of the total? I'm very interested to hear what others think about this. Does this 16-second discrepancy tell you anything? At the very least, does it suggest anything? What kinds of questions come up on your mind as a result of this? If you are a person who does Oracle performance analysis and improvement in high profile situations, is this 16-second discrepancy acceptable to you? If you have some level of understanding of the trace file and you have a programming background at the same time, is this discrepancy acceptable to you? Why? Why not? Again, your comments are welcome. If there are no comments, I will take this up in a future post when my schedule permits.

If you find this series of posts was worth your time, I would be honored if you subscribe.

Thursday, March 01, 2007

OraSRP author replies

Note to subscribers: If your reader does not properly display some formatted content like bold and italics (I know that at the very least, Bloglines has this problem), I suggest you visit the blog site itself.

First of all, I would like to thank OraSRP's author, Egor Starostin, for dropping by and making a comment on yesterday's blog entry, OraSRP v3.0b - a very short initial impression. I've decided to put my reply in a separate blog entry so that readers who do not subscribe to comment feeds can pick it up.

With regards to my observation that the number of lines that OraSRP reports is not equal to the actual number of lines in the raw trace file, Egor says:

OraSRP ignores some lines like
'===' or '*** 2007-02-28 12:45:08.424' and doesn't report them as 'processed'. So, 'Total lines processed in trace file' != 'Total lines in trace file'.

That's good to know, and I suspected that much. But without that clarification, a different line count may mean that there are some relevant lines that are being missed. And when relevant lines are missed, it then becomes possible that the output may be affected to a point that it no longer presents a correct profile. When the profile is not correct, then that may lead people to try to solve problems that are not the real ones to begin with. This is very important to me because the real strength of a correct profile, at least as far as I'm concerned, is its ability to pinpoint the root cause of a problem.

Of course, I have no problems bypassing irrelevant lines and since this is already being done, I respectfully suggest that the number of lines in the raw trace file still be reported, even as a separate line item, so that people don't get thrown off by differing numbers.

Now, let's double check the numbers and take into account the types of lines that Egor mentioned above. Let me deal with the '^===' lines first:

jforonda@u20$ egrep '^===' 4071.trc | head -5

That looks good enough. Now, let's count them:

jforonda@u20$ egrep '^===' 4071.trc | wc -l

So we have 2,165 lines that fit this criteria.

For the date/timestamp lines:

jforonda@u20$ egrep '\*\*\* 200' 4071.trc | head -5
*** 2004-10-18 19:51:28.453
*** 2004-10-18 19:55:50.302
*** 2004-10-18 19:56:07.899
*** 2004-10-18 19:56:48.561
*** 2004-10-18 19:57:41.135

jforonda@u20$ egrep '\*\*\* 200' 4071.trc | wc -l

That's another 106 lines. If we add these two numbers to the total reported by OraSRP, we get a total of 2,152,611:

jforonda@u20$ bc

This means that there are still 4,239 lines that are not accounted for:

jforonda@u20$ bc

This is still a lot of lines that are not unaccounted for. Now, Egor's comment said "OraSRP ignores some lines like..." and that makes me think that there are other types of lines, aside from those that were already mentioned, that are being bypassed on purpose. If Egor gives me a complete list of the types of lines that are being bypassed on purpose, I'd be happy to check my trace file.

With regards to the different totals between the “with idle” and "without idle” runs, Egor said:

As for different times with/without idle waits recognition -- I believe it's a bug. Unfortunatey, I can't reproduce it yet, so I can't fix it too.

In as much as I want to help, I will say now that I can not give this trace file to Egor. As most people reading this probably know, 10046 trace files contain confidential information. This is also the reason why I crossed out the instance and node names from the screenshots that I posted yesterday. If there are readers who can share their trace files to Egor, I suggest that they try to do what I've done here and see if they can reproduce the problems that I have described.

Now, there's still that issue about the total elapsed time but I have to attend to other things so I will stop for now. I will post another entry to talk about the elapsed time within the next day or so. If you want to follow this topic, it will make it easier if you subscribe. :)

Tuesday, February 27, 2007

OraSRP v3.0b - a very short initial impression

Update: This has evolved into a series of posts so I added links at the bottom to guide readers who may want to read the entire series of posts.

Last Friday, February 23, Doug Burns mentioned in his blog that OraSRP v3.0b (beta) has been released. I have a special interest in the Oracle Extended SQL Trace File because I have used it to help me solve a lot of performance problems. My interest with the trace file started after reading the excellent book Optimizing Oracle Performance by Cary Millsap and Jeff Holt of Hotsos.

Anyway, I've tried earlier versions of OraSRP but I had some problems. Off the top of my head:
  • It could not finish processing certain files. This has nothing to do with speed. It had something to do with OraSRP not being able to finish processing large files with a lot of BIND lines in them. I had two big files (44+ and 66+ million lines) where OraSRP would just stall. I ran OraSRP against these files on several platforms (Solaris, Linux, Windows) and I had to kill them after two days. The workaround to this type of problem was to bypass or limit the processing of BIND lines.
  • I compared the output to that of the Hotsos profiler (v4.0.16). There were some numbers that were too far apart, so far that they can not be attributed to rounding off. This does not mean, of course that OraSRP was wrong. I'm just saying the outputs of the two profilers were vastly different in some sections.
Again, those two bullet points are for an earlier version of OraSRP. I also want to make clear that I did not have time to prove which output is correct and so I could not conclude that OraSRP was wrong; after all, it is entirely possible that there are some bugs in the Hotsos profiler itself. But I have to say that the big differences in the numbers made me very uncomfortable not only because I know that the people at Hotsos wrote THE book on the Oracle Extended SQL Trace File, but also because I have used earlier versions of the Hotsos profiler to help me solve some very nasty performance problems at some of Sun's big customers.

And so it is natural for me to want to see what is new with OraSRP v3.0b. This new version is a complete rewrite from Python to the D programming language (not to be confused with a similarly named language for Solaris DTrace). It is also in binary-only format for Linux and Windows.

These days, my main workstation runs Solaris and I no longer have a machine that runs Linux so I had to use the Windows version. Here's a screenshot of the first two sections of the report:

Please note the second arrow pointing at "Total lines processed in trace file". OraSRP is reporting 2,150,340 lines. That did not look right to me. So I went back to the raw trace file and did this:

jforonda@u20$ wc -l 4071.trc
2156850 4071.trc

jforonda@u20$ bc

Hmm... I know that this specific trace file is 'clean' and it does not have things like those "dump from memory" lines. So it seems like there are 6,510 lines that are somehow not accounted for. What does this mean? Is it a big deal? I'll let the reader draw their own conclusions.

Next, I looked at the total elapsed time for the session. In the image above, that information would be where I drew an arrow at the very bottom of the page. For the total elapsed time, OraSRP reports 14,187.9512 seconds. Again, that didn't look right to me. So again, I went back to the raw trace file. To get a quick approximation of the session start timestamp, I did this:

jforonda@u20$ grep tim 4071.trc | head -1
PARSING IN CURSOR #5 len=69 dep=1 uid=413 oct=42 lid=413 tim=3928564402960 hv=2004533713 ad='ab3a9520'

And to get a quick approximation of the session end timestamp:

jforonda@u20$ grep tim 4071.trc | tail -1
EXEC #15:c=0,e=1077,p=0,cr=3,cu=1,mis=0,r=1,dep=0,


I used the term "quick approximation" because I understand that those values may not be the absolute start and end timestamps. However, for the purpose of this blog entry, they should be good enough. I think it is safe to say the following:

  • the actual start timestamp can not be greater than 3928564402960
  • the actual end timestamp can not be less than 3942736181428
So I went back to my trusty old command line calculator. :) Note that I set the scale to 6 and divided by 1,000,000 because this is an Oracle9i trace file and so the tim value is expressed in microseconds.

jforonda@u20$ bc

So according to the raw trace file, the elapsed time is approximately 14,171 seconds. In fact, I can probably conclude at this point that the elapsed time can not be less than 14,171 seconds. OraSRP reports 14,187 seconds or around 16 seconds more. Okay, OraSRP's elapsed time is not less than my rough estimate but a 16-second difference does not look quite right. Hmm...

Then I noticed that OraSRP has an option to recognize idle 'SQL*Net message from client' events which was turned on in my initial run. So I made one more run with this option turned off. Here's a screenshot:

Now, the title of this blog entry partly reads "a very short initial impression" and it is getting too long :). So... can you spot the difference between the two screenshots?

OraSRP v3.0b posts to date

OraSRP v3.0b - a very short initial impression
OraSRP author replies
OraSRP v3.0b - questionable session elapsed time

Wednesday, January 31, 2007

Faulty FC port meets ZFS

Right now, I'm in a project that is taking a lot of my time and as such, I am behind my normal readings. Whenever I'm in this situation, my mailing list subscriptions are usually the first ones to suffer.

Anyway, I just saw this post from the zfs-discuss mailing list at It was posted about two months ago but I think that it is worth posting here as well specially since a lot of my ZFS posts have been getting some encouraging levels of page hits.

The post talks about how ZFS caught a situation where different layers of the I/O stack were trusting each other to a point where data was being silently corrupted. In fact, even the administrators trusted their hardware so much. The post says (all emphasis mine):

there was no mirroring or protection at the server level, we had delegated that function to the DMX3500

I've seen this kind of "trust" in some sites. But that turned out to be a bad decision:

As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems.
ZFS was the first one to pick up on the silent corruption

Ah, silent data corruption, the kind of problem that used to keep me up at night. :)

Tuesday, January 30, 2007

What a pleasant surprise (the power of google)

I use Google Analytics and FeedBurner SiteStats to monitor the traffic and subscriptions of this blog. This site has a very modest traffic but for the past few months now, I noticed that more and more readers are landing to this site from a google search. I also noticed that most of the searches were for silent data corruption, a topic which I had a previous entry. Currently, this specific entry is number two on google search result when one uses the search term silent data corruption.

When I tried this search today, I got this:

No wonder. What a pleasant surprise! :)

Sunday, January 14, 2007

ZFS Compression

This post is a result of a thread in oracle-l. It is provided to show the basics of ZFS compression.

This was done on my single core Sun Ultra 20 workstation running Solaris. The output has been slightly reformatted to make it fit the screen.

Note that this is a very simple case on a machine that is not doing any significant amount of work. Please don't use the information here as the basis of any decision you will make, specially for critical systems. Your mileage will most likely vary.

Here's a ZFS pool named p1 whose compression bit is turned on.

root@u20# zfs get compression p1
p1 compression on local

Let's create two filesystems under this pool and name them u (for uncompressed) and c (for compressed).

root@u20# zfs create p1/u
root@u20# zfs create p1/c

By default, their compression bits are turned on because they inherit this characteristic from their parent (p1).

root@u20# zfs get compression p1/u p1/c
p1/c compression on inherited from p1
p1/u compression on inherited from p1

Let's turn off the compression bit for the u (uncompressed) filesystem.

root@u20# zfs set compression=off p1/u

Let's verify that it was actually turned off.

root@u20# zfs get compression p1/u p1/c
p1/c compression on inherited from p1
p1/u compression off local

Let's go get some reasonably-sized test file.

root@u20# pwd
root@u20# ls -l
total 1589543
-r--r--r-- 1 jforonda staff 2962342454 Sep 11 2004 test_ora_17255_t.trc

There... that file is an Oracle extended SQL trace file that is around 2.8G in size.

Let's make two copies of this file. First, to the uncompressed filesystem:

root@u20# ptime cp /10046/tmp/test_ora_17255_t.trc /p1/u
real 1:20.604
user 0.002
sys 14.613

Then, to the compressed filesystem:

root@u20# ptime cp /10046/tmp/test_ora_17255_t.trc /p1/c
real 1:20.228
user 0.002
sys 12.637

Before going further, please note the output of ptime for each of the times we copied the files. There's less than one second difference between the two out of an approximate duration of 80 seconds for each copy operation. Also, note that the copy operation to the compressed filesystem was actually faster. Hmm... interesting.

Doing an ls on the two files will show that they have exactly the same number of bytes.

root@u20# ls -l /p1/[uc]/*
-r--r--r-- 1 root root 2962342454 Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 2962342454 Jan 13 14:21 /p1/u/test_ora_17255_t.trc

And in fact, the Solaris digest command tells us that they have exactly the same contents.

root@u20# digest -a md5 -v /p1/[uc]/*
md5 (/p1/c/test_ora_17255_t.trc) = 97f86fcfdfc3f21a68ffc1892a945e77
md5 (/p1/u/test_ora_17255_t.trc) = 97f86fcfdfc3f21a68ffc1892a945e77

But the amount of space that they occupy on disk is not the same. The file residing in the compressed filesystem is around 1/3 the size of the file that resides in the uncompressed filesystem.

root@u20# du -sh /p1/[uc]/*
776M /p1/c/test_ora_17255_t.trc
2.8G /p1/u/test_ora_17255_t.trc

A better way to see the compression ratio is to use the zfs get command:

root@u20# zfs get compressratio p1/u p1/c
p1/c compressratio 3.64x -
p1/u compressratio 1.00x -

Applications don't have to know that a file is compressed -- ZFS does the compression and decompression on the fly. Applications should be able to read the files normally like this:

root@u20# tail -5 /p1/u/test_ora_17255_t.trc
PARSE #21:c=0,e=123,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621053
BINDS #21:
EXEC #21:c=0,e=259,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621393
EXEC #3:c=0,e=3416,p=0,cr=1,cu=3,mis=0,r=1,dep=0,og=4,tim=740525622047

root@u20# tail -5 /p1/c/test_ora_17255_t.trc
PARSE #21:c=0,e=123,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621053
BINDS #21:
EXEC #21:c=0,e=259,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=740525621393
EXEC #3:c=0,e=3416,p=0,cr=1,cu=3,mis=0,r=1,dep=0,og=4,tim=740525622047

Now, how good is the default ZFS compression compared to something like gzip? Let's find out by compressing the file that is in the uncompressed filesystem.

root@u20# ptime gzip /p1/u/test_ora_17255_t.trc
real 1:27.206
user 1:19.614
sys 2.985

As expected, gzip is much better.

root@u20# ls -hl /p1/[uc]/*
-r--r--r-- 1 root root 2.8G Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 99M Jan 13 14:21 /p1/u/test_ora_17255_t.trc.gz

root@u20# ls -l /p1/[uc]/*
-r--r--r-- 1 root root 2962342454 Jan 13 14:23 /p1/c/test_ora_17255_t.trc
-r--r--r-- 1 root root 104023266 Jan 13 14:21 /p1/u/test_ora_17255_t.trc.gz

root@u20# du -sh /p1/[uc]/*
776M /p1/c/test_ora_17255_t.trc
99M /p1/u/test_ora_17255_t.trc.gz

The above shows the following:

Size of original file: 2.8GB
Size of file compressed by ZFS: 776MB
Size of file compressed by gzip: 99MB

Why is this so? Well, programs like gzip can take their time and optimize for compression ratio in favor of elapsed time. ZFS, on the other hand, is concerned about other things beside compression ratio so it has to strike the right balance between speed and compression ratio. That said, the ZFS man page says:

compression=on | off | lzjb

Controls the compression algorithm used for this dataset. There is currently only one algorithm, "lzjb", though this may change in future releases.

Saturday, January 13, 2007

The day is coming...

This is in reference to my previous entry titled "Is the day coming...".

Sun's ThinGuy says it is coming soon. You can see a demo here.

And yes, it is called win4solaris.

Thursday, January 11, 2007

Is the day coming...

... when you will be able to run Windows on top of Solaris x64, just like you can run Windows on top of Linux using VMWare today? As of today, Solaris is supported only as a guest operating system under VMWare.

If Solaris can be a host operating system to another OS, then it is possible to use some of the Solaris features on the hosted OS. For example, today, it is possible to use DTrace on a Linux system running on a Solaris BrandZ Zone. In a similar fashion, if one can run Windows on top of Solaris, it is then possible for Windows to indirectly take advantage of features such as DTrace and ZFS. And with things like resource management, it then becomes possible to cap the resources used by that instance of the Windows OS. The thought of being able to do this excites me no end. :)

Anyway, I've always wondered what Sun is going to do in this area. A few years ago, we had a product called SunPCi. This product is a combination of hardware and software. The hardware component was a card that goes into a SPARC box. This card had an AMD processor. With this hardware/software combination, one can then install Windows and some Linux distros on top of Solaris. It was similar to VMWare in so many ways. I used SunPCi a lot and I loved it.

When Solaris x86/64 was announced, I thought that Sun would then take the SunPCi software component and make it run on Solaris x64/86. But that has not happened.

Then, last October, ThinGuy mentioned that Sun is working with Win4Lin. Today, I see this.

Hmmm.... win4solaris, anyone?