Saturday, March 03, 2007

OraSRP v3.0b - questionable session elapsed time

Note to subscribers: If your reader does not properly display some formatted content like bold and italics, I suggest you visit the blog site itself.



For those who may just be passing by, this is a continuation of two previous blog entries:


If you have not read these two entries, I suggest that you do so first, specially the first one, since that is where the possible problem with the calculation of session elapsed time was initially described. I also suggest that you read Egor's comment for that entry.

As the title says, this entry is about an observation that I made regarding the discrepancy between the session elapsed time reported by OraSRP, and the contents of the raw Oracle Extended SQL Trace File (simply referred to as trace file from hereon). Earlier, I said that OraSRP reports a number that is around 16 seconds more than what I can readily see and approximate from the trace file. In his comment, Egor explained:

It depends on how to calculate elapsed time.
For example, OraSRP ignores 'tim' in 'PARSING IN CURSOR' lines.
Also, if trace file starts from WAIT lines then OraSRP will sum up elapsed times of such waits and substract it from the first 'tim' value.

Thank you, Egor, for that explanation. But please allow me to be more specific.

Session Start Timestamp
Here's a screenshot that shows the top of the trace file (sensitive information has been crossed out, please click to enlarge):


From that screenshot, I make the following observations:

  • There are no WAIT lines before the first 'PARSING IN CURSOR' line so the mention of WAIT lines in Egor's comments is not applicable to this specific situation.
  • Let's follow Egor's comment and ignore the tim value from the 'PARSING IN CURSOR' line (line 22). The next tim value after line 22 comes from the EXEC line (line 25). The value of tim for that line is 3928564402472. Those who are closely following the values of the timestamps may notice that the value of tim from the EXEC line (line 25) is less than the value of tim from the 'PARSING IN CURSOR' line. In this specific case, the EXEC tim is 488 microseconds (or 488/1,000,000 of a second) before the tim from 'PARSING IN CURSOR'. This may seem a little bit odd but if you happen to have the Optimizing Oracle Performance book, the reason why this is so is explained on pages 100-101 in the section titled "Oracle Release 9".

At this point, I think it is safe to make the following statements regarding the start timestamp:

  • The actual start timestamp can not be greater than 3928564402472. By that I mean that it can be less, but not 16,000,000 microseconds less.

  • I believe that the earliest timestamp that one can get from this specific tracefile is the one in the 'SESSION ID' line (line 19). In fact, the Optimizing Oracle Performance book describes this line as the line that "identifies the time and the session at which the first trace line was emitted" (page 104). Purists will probably disagree with me but personally, I can tolerate not using the value in line 19 as start time, specially in cases where the session elapsed time is big (like in this case where it is more than 14,171 seconds). I say this for the following reasons:

    • The 'SESSION ID' time is in a different format and converting it to the tim format is not always straightforward. Of course, if one can figure out a reliable way to convert the 'SESSION ID' time to tim format, then this is the time that should be used as session start time.

    • The difference between the times that are represented between lines 19 and 25 is likely very small, just like the difference between lines 22 and 25 is also small (488 microseconds). I can personally tolerate this difference knowing that it is a small one-time occurence. And in the context of what is being discussed here now, I can not imagine a valid case where this difference would go as high as 16 seconds (or 16,000,000 microseconds).


Session End Timestamp

Here's a screenshot that shows the bottom of the trace file (sensitive information has been crossed out, please click to enlarge):


From that screenshot, I make the following observations:
  • The last line that has a tim is the EXEC line (line 2156805) and the value of tim from this line is 3942736181428.
  • I said in the first post that the actual end timestamp can not be less than the tim value in the EXEC line above. By that, I implied that it can be more, but not 16,000,000 microseconds more. In this specific case, there are several WAIT lines that follow this EXEC line. The ela values for these WAIT lines must be accounted for.

The total amount of time that can be attributed to these WAIT lines is the sum of their ela values:

Line # ela
------- -----
2156806 1
2156807 560
2156809 917
2156810 2
2156811 518
------- -----
Total 1,998

jforonda@u20$ bc
1+560+917+2+518
1998
jforonda@u20$

To get the session's end timestamp, simply add the last tim value (3942736181428 from line 2156805) to the total ela of the WAIT lines above (see Clock Walk Formulas, p101-102, Optimizing Oracle Performance).

jforonda@u20$ bc
3942736181428+1998
3942736183426
jforonda@u20$

To the best of my knowledge, that's the maximum value of the end timestamp that one can get from this specific tracefile. If anyone can see anything that I've missed, I will welcome your corrections.


Session Elapsed Time

The session elapsed time should just be the difference between the end timestamp and start timestamp:

jforonda@u20$ bc
scale=6
(3942736183426-3928564402472)/1000000
14171.780954
jforonda@u20$

So there you go. That is how I would calculate this specific trace file's session elapsed time. Is this how others would do it? If what I presented above is wrong, where did it go wrong? If the method above is wrong, is it so way off to a point that it can not be relied on? Is it even reasonably close enough to be within the bounds of what is acceptable? Comments are welcome, whether you agree or disagree. Moderation is on to control spam.


Assuming that the calculation above is correct (or at least reasonably close enough to what it should be), it means that nothing much has changed from the initial observations that I made in the first post -- that is, there is a discrepancy of around 16 seconds from the value reported by OraSRP:

jforonda@u20$ bc
scale=6
14187.9512-14171.780954
16.170246
jforonda@u20$


Again, assuming that the calculation above is correct, some people might then ask: why bother with 16 seconds specially since it is just around 0.11% of the total? I'm very interested to hear what others think about this. Does this 16-second discrepancy tell you anything? At the very least, does it suggest anything? What kinds of questions come up on your mind as a result of this? If you are a person who does Oracle performance analysis and improvement in high profile situations, is this 16-second discrepancy acceptable to you? If you have some level of understanding of the trace file and you have a programming background at the same time, is this discrepancy acceptable to you? Why? Why not? Again, your comments are welcome. If there are no comments, I will take this up in a future post when my schedule permits.

If you find this series of posts was worth your time, I would be honored if you subscribe.

17 comments:

Brandon Allen said...

I'm not sure what would explain the 16 second difference, but I would say that whether or not it is acceptable would depend on the duration of the problematic situation I was tracing. For example, if I was working on something that was running for 30 seconds but used to run in only 1 second, then I'd be very concerned about an inaccuracy/discrepancy of 16 seconds, but if I was working on something that was running for 30 minutes, I wouldn't be too concerned about losing 16 seconds.

jforonda said...

Thanks for the comment, Brandon. I will refrain from making further comments for the time being and let others join in, if they want.

Thanks for dropping by.

egor said...

James,
sorry for the late answer.

Here's another bit of info how OraSRP calculates session elapsed time.
For every cursor OraSRP calculates min/max times. And session mintime is a minimum of all cursors mintimes.

Would you please to download http://oracledba.ru/orasrp/orasrp.exe and run it with --debug flag? It will print to stderr a couple of lines (among many others) 'min/max time in trace'.

Norman Dunbar said...

Hi James.

Nice work on testing out OraSRP and while you have raised some good points about potential problems I have to agree with Brandon that whether the missing 16 seconds is a bad thing 'depends'.

I have to admit to being a big fan of OraSRP especially as I tried to write a 10046 parser myself (might even finish it one day) and it is incredibly difficult.

I shall have a go with some of my own trace files - none as big as 44-66 million lines though and see how V3 matches up to V2.

I know that I found a couple of foibles with v2 - log file sync waits were being included as part of the cursor following them rather than as part of the commit (xctend) that preceeded them. However, it does save a huge amount of time analysinng traces.

Thanks for the work you have done here.

Cheers,
Norman.

jforonda said...

Guys,

Thank you all for your comments. They are all appreciated.

Like I said earlier, I will hold off my comments regarding the significance of the perceived 16-second discrepancy for now.

At this time, I would like to take a moment to clear out certain things:

> Here's another bit of info how OraSRP calculates session elapsed time.
> For every cursor OraSRP calculates min/max times. And session mintime is a minimum of all cursors mintimes.


Let me draw attention to the 'tim' value that was used in the calculation. That 'tim' value comes from the EXEC line in line 25. That EXEC line is a result of the statement that was parsed in line 22. Line 22's SQL text is in line 23. That SQL text is the 'alter session...' command -- the command that activated the trace. So my question is: Could there be a 'tim' value within this trace file that could come 16,000,000 microseconds before that?

But let's put aside OraSRP for a while and talk about the way that the calculation was done in the body of this post. I put out my calculation because I want it to go under the scrutiny of people like yourselves. If you have the time, I would appreciate to have your feedback on this. Does anyone disagree with the way the calculation for this specific trace file was done? If you disagree, why? If it is not correct, do you see any case where it would be off by 16,000,000 microseconds?

In a nutshell, all I've done there is to say:
- here's the start time
- here's the end time
- subtract start from end and that is elapsed time


> Would you please to download http://oracledba.ru/orasrp/orasrp.exe and run it with --debug flag? It will print to stderr a couple of lines (among many others) 'min/max time in trace'.


The next paragraph is a copy of parts of my answer to Egor in the other post (some emphasis added):

The debug option might be good for development purposes but for the purpose of independently checking the output of OraSRP (which is what I've done), I really prefer using separate, well-known, and well-accepted tools, if possible. That is the reason why in my posts, I used things like grep and wc. After all, who would question the output of something like 'wc -l'? Some people may question it... but I won't.

But what I would like to add to that is that I don't want to drag myself towards this direction. If others can duplicate the things I have described in this series of posts, then my hope is that someone would do this -debug thing. Like most of us, I too have time constraints and so I would like to limit what I do here to the description of what I have observed and believe to be suspect areas. I've tried to do that with this 'session elapsed time' post as best as I could by putting out the information that I think is needed and showing how the calculation is done. Others who have their own trace files and who have the time and inclination to do so should be able to do the things I've done, if they want to go to that direction.

> I tried to write a 10046 parser myself (might even finish it one day) and it is incredibly difficult.
It is good that you mention that, Norman. I myself have a programming background and although I don't claim to be an expert on the trace file, I do see myself as one who probably has an above-average knowledge on the trace file. I have played out this problem in my mind and I have to say that it is not a simple program to write. And in my mind, the fact that it is not simple makes independent verification all the more important.

egor said...


> In a nutshell, all I've done there is to say:
> - here's the start time
> - here's the end time
> - subtract start from end and that is elapsed time

Why do you think that you calculate (actually you just look at the start/end of trace) start/end times correctly?
What if your trace file contain (min tim) somewhere around 100th line?

Or what if your trace file contain info from more than one session? In that case when you try to calculate session time by substracting (last line tim) from (first line tim) then you calculate it wrong.

Or maybe it's just a bug in OraSRP.

Since you are unable to send me trace file, I try to understand where is the problem via debug information. Since you don't want to send me debug info then all I can say is "yes, you can't calculate elapsed time by just substracting last tim value from first tim value because sometimes it gives you wrong results".

jforonda said...

Egor,

> Why do you think that you calculate (actually you just look at the start/end of trace) start/end times correctly?

Perhaps we should talk about start and end times separately. We can start with the min tim and then talk about the end tim once we come into an agreement about the min tim. I hope that is okay with you. Please let me know if you want to do it another way.

>What if your trace file contain (min tim) somewhere around 100th line?

I'm actually tempted to write a short script now to simply find the min and max tim. But before I do that, I would like to repeat something I said earlier and hope that it will put the start time issue to rest:

Let me draw attention to the 'tim' value that was used in the calculation. That 'tim' value comes from the EXEC line in line 25. That EXEC line is a result of the statement that was parsed in line 22. Line 22's SQL text is in line 23. That SQL text is the 'alter session...' command -- the command that activated the trace. So my question is: Could there be a 'tim' value within this trace file that could come 16,000,000 microseconds before that?

Please allow me to rephrase: the start tim used in the calculation comes from the command that activated the 10046 trace.

With that, I would now like to ask if you still have any objections about using that tim value as the start timestamp. If you have any objections, can you please give us the reason why?

So that's all I can say about the start tim for now. I did not see any specific objections about the end timestamp. Should I now assume that you agree with the end timestamp?

>Or what if your trace file contain info from more than one session? In that case when you try to calculate session time by substracting (last line tim) from (first line tim) then you calculate it wrong.

I collected this trace file myself and I can assure you that it comes from a single session.

>Since you are unable to send me trace file, I try to understand where is the problem via debug information. Since you don't want to send me debug info then all I can say is "yes, you can't calculate elapsed time by just substracting last tim value from first tim value because sometimes it gives you wrong results".

I did already explain my reasons for these. My hope is that other readers may be able to duplicate the things that I have described here and help you out with the -debug.

If there are readers who can duplicate the things I've described here, I urge you to speak up. If you can not help with the -debug (for whatever reason), a simple "yeah, I see it too" will be a big help in this conversation.

Thank you very much.

James

jforonda said...

Egor,

Something just crossed my mind and I realized that I should probably have mentioned it earlier. The way that the trace file was activated was by inserting the 'alter session...' command in the program's source code, right after the connection to the database was established. IOW, it was not activated from a 3rd-party session.

I hope that gives this more context.

James

jforonda said...

Egor,

I create a new post just for the purpose of displaying a screenshot. Please have a look at it here:

http://jforonda.blogspot.com/2007/03/orasrp-v30b-short-program-to.html

First, please let me know if there is anything you want to change in that short Python program.

Please note that the line numbers and values of start and end time are consistent with what has been discussed in the body of this post.

Please also note that the "0.001998 of a second" difference shown by 'bc' is consistent with the total of the ela from the WAIT lines after the last EXEC (also discussed in the body of this post).

I hope this helps.

Thank you.

James

egor said...


I hope this helps.
All I want is either explain the way OraSRP calculates elapsed time for your trace file or just confirm that it's a bug in OraSRP. So, for me, your python program doesn't help at all.

Since you are unable to send me trace file and you don't want to send me debug info then all I can say is "yes, you can't calculate elapsed time by just substracting last tim value from first tim value because sometimes it gives you wrong results".

Let's look at the following tiny example:

PARSING IN CURSOR #1 len=16 dep=0 uid=1 oct=3 lid=0 tim=1000000 hv=1000000 ad='5ff9ad80'
cursor 1000000 dep0
END OF STMT
PARSE #1:c=1000,e=1000,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=1000000
EXEC #1:c=1000,e=1000,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=1001000
WAIT #1: nam='SQL*Net message from client' ela= 1000000 p1=1952673792 p2=1 p3=0


According to your python program elapsedtime=1000 (mintime=1000000 and maxtime=1001000). The same result will be in case if you use well known tools like grep/head/tail/bc.
According to OraSRP elapsedtime=1002000.
Which result is correct?
I believe that 1002000 is more precise answer here than 1000.

jforonda said...

Egor,

> So, for me, your python program doesn't help at all.

I'm sorry if the first one didn't help. I probably was not clear enough about the intent and looking back now, the title of the post being "OraSRP v3.0b - short program to approximate session elapsed time" probably added to it being taken out of context.

So please allow me to try one more time. I updated the page that I used to display the screenshots for this specific discussion. It is still here:

http://jforonda.blogspot.com/2007/03/orasrp-v30b-short-program-to.html

I added the second screenshot (the first one is still there so it is preserved). The second screenshot shows a slightly modified version of the first program. This time, the program does not calculate elapsed time. It simply looks for min and max tim values and prints them together with the total lines read.

The intent of this program is to attempt to answer one of your earlier questions, namely:

>What if your trace file contain (min tim) somewhere around 100th line?

The output of the program shows that the minimum tim found in the trace file is the one in line 25, the same one that was used in the calculation shown in the body of this post.

Now, on top of that, please note other things I've mentioned earlier so that everything is in context:

- The way that the trace file was activated was by inserting the 'alter session...' command in the program's source code, right after the connection to the database was established. IOW, it was not activated from a 3rd-party session.

- The start tim used in the calculation comes from the command that activated the 10046 trace.

- The second screenshot shows the minimum tim as the same one as that used in the calculations.

I hope this helps.

James

egor said...

So, your first tim line is 25, last tim line is 2156805. What are the cursor# values in these lines?
Also, in order to provide valuable info you simple program need to print all WAIT lines (actually only cursor# values and ela values are necessary) before line 25 and after line 2156805.

jforonda said...

> So, your first tim line is 25, last tim line is 2156805. What are the cursor# values in these lines?
Also, in order to provide valuable info you simple program need to print all WAIT lines (actually only cursor# values and ela values are necessary) before line 25 and after line 2156805.


The answers to these can be found in the two screenshots in the body of this post. The first screenshot shows the top of the tracefile, and the second one shows the bottom.

The answers are:

Cursor #:
line 25 - cursor# 5
line 2156805 - cursor #15

WAIT lines:
Before line 25 - No WAIT lines precedes this line. Please see screenshot of top of tracefile.

After line 2156805 - there's several of these. I used them as part of the calculation to determine end timestamp. Those are the ones that total to 1,998 microseconds. Again, please refer to body of post and the screenshot that represents the bottom of the tracefile.

egor said...

So, the chances that it's a bug in OraSRP are increasing.
You can write a small program (or just use grep) to grep lines 'PARSING IN CURSOR/PARSE/EXEC/FETCH' (i.e. no cursors text, no binds, just numbers) and send them to me.

jforonda said...

>So, the chances that it's a bug in OraSRP are increasing.

I'm glad to hear that.

>You can write a small program (or just use grep) to grep lines 'PARSING IN CURSOR/PARSE/EXEC/FETCH' (i.e. no cursors text, no binds, just numbers) and send them to me.

I'm on the road right now but let me just quickly say that I think I've given you all the information that is needed to calculate the session elapsed time for this specific trace file. At this point, I believe that you should now be able to say for sure if the calculation that was done in the body of this post is correct or not.

Please note that as I mentioned earlier, I want to put a limit to what I do here, and I really don't want to be drawn into the debugging phase.

egor said...

> I believe that you should now be able to say for sure if the calculation that was done in the body of this post is correct or not.
Strictly speaking -- no. But I think that with high probability this calculation is correct.

> I really don't want to be drawn into the debugging phase.
Sure, I'll try not to bother you at all.

jforonda said...

Egor,

First of all, I'm sorry for the delay in moderating the comments. I can see that you posted your comment last Saturday and I just approved it now (Tuesday). As I briefly mentioned in my previous comment, I'm currently on the road.

Anyway...

>Strictly speaking -- no.
Can you be more specific? What information is missing for you to be sure that the calculation described in the main post is correct or not? Again, please keep in mind that we are just talking about this specific trace file.

>But I think that with high probability this calculation is correct.
It would be nice if we can get to a point where you have no doubts whatsoever. However, I also recognize that that is not always possible.