Monday, August 18, 2008

I'm alive

Yes I am :D ... just in case my inactivity here raised any doubts...and I won't surprised if it did because, despite several reminders from my mentor that a blog update has been pending, I have been putting it off for the time that I have something substantial ( or .. was it my laziness? ). Now that the pencils-down date has arrived, I see no further excuse for postponing it.

Since the last time..
  • I got my disk and CPU info without much use of guru mode code (used the queue_stats tapset.. thought had to create my own copy of it where I could change the default time unit to milliseconds instead of microseconds.
The tapset says..
# qstats.stp: Queue statistics gathering tapset

# -------------------------------------------------------------------------

# The default timing function: microseconds. This function could

# go into a separate file (say, qstats_qs_time.stp), so that a user

# script can override it with another definition.

function qs_time () { return gettimeofday_ms () }

# -------------------------------------------------------------------------

Till that is not done.. I might have to stick to my own copy of queue stats.
  • My renderer module is up. Even though it supports only svg for now, I'll extend it to support other formats very soon.
  • I gather per process CPU statistics which show how much system and user time they take (got this idea from bootprobe).
  • I trace sys_open and gather statistics like which process reads/write to what file etc.
  • I also trace the blockIO (in this case I just provide a way to bring out the blockIO information as gathered by the tapset in XML format).
The idea behind tracing points 3, 4 and 5 is to have as much information as possible at least in text format so that even if it cannot be rendered (will terribly clutter the graph if rendered), we can get as much detail as possible.
As all the above information is timestamped, correlation is very easy.
  • The bootlimn (as it has been named tentatively) installs and uninstalls very cleanly.
  • A jar file is packaged along with the source code. It can be run simply by executing ./bootlimn.sh .
  • A build.xml (to be used with ant) is also available.
  • Failure of a part of bootlimn does not crash the entire application. It still tries to give as much output as possible.For example,if one of the XML files cannot be parsed, the others are not affected and neither is the renderer module unless it is *very* critical for the creation of the graph.Even if the XML generated is screwed, the bootlimn still renders till the first occurence of improper entry.
  • The user can specify where to stop by changing the -c option in stpcaller.

Known bugs (Taken care of)):
  • The XML created, sometimes, has negative timestamps. (see update 3)
  • The IOblock parser gives errors at times.(This again is because of the screwed XML) ( see update 1).

What needs more work:
  • All the unique processes are rendered. The user as of now has no control over the degree of detail.
  • The state transitions can be bettered.
  • The CPU wait stats can also be added ( code already present in the stps, just requires slight modification in the XSD and corresponding changes to parser and renderer.. will do it soon)
  • Other formats of images to be supported.( next task at hand after debugging)
  • Header information needs to be added (this will be done soon too)

And anymore that will be suggested when the code is reviewed ( code can be checked out from here ).


Update 1: A temporary workaround is to define the sector, bdev etc fields ( which get some funny values on very rare occasion) as a String type so that just a single instance of screwed up XML does not hinder the parsing of the entire file. Not the best solution but just a temporary workaround as there is no further calculation based on these fields and the only function that uses them is a 'tostring' which converts them to a string anyway.

Update 2: Egads!!! revision 14 is sort of broken.. I am rectifying it.. please don't checkout the code now.

Update 3: The negative timestamps error has been solved ***phew***. The code can now be checked out. Logging has been changed to disk as opposed to in memory (see comments for further details).