Re: profiling results

Earl Hood <earl@xxxxxxxxxxxx> · Fri, 13 Apr 2007 15:57:40 -0500

(I'm back from out-of-town and catching up on email)

On April 12, 2007 at 20:34, "Jeff Breidenbach" wrote:

> Does this look reasonable to people? Anything obviously
> weird?

> Total Elapsed Time = 7.334524 Seconds
>   User+System Time = 3.794524 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>  20.3   0.773  1.455      7   0.1104 0.2079  mhonarc::sort_messages

Sorting does not surprise me.  MHonArc does not keep a persistent
sorted data structure, so it resorts everytime new messages are added
(under the assumption that messages may come in in arbitrary order).

This can definitely be painful if one updates an archive on-the-fly
versus doing a queuing-batch model.  In the latter, multiple messages
may be added in a single invocations, avoiding the resorting for
each message added.

Do you invoke mhonarc for each new message for a list or do you
queue up messages for a given list (over a specified period) before
invoking mhonarc for the list?

Note, sorting includes thread sorting, which is the most complicated.
Some speed increase may be possible by disabling SUBJECTTHREADS
(this is mentioned in the Performance Tips doc).  However, disabling
SUBJECTTHREADS may have a usability impact for messages that fail
to define the proper reference headers.

For large scale usage, a (robust) persistent data structure is
needed.  However, such a structure would require a redesign of
mhonarc internals.

>  18.6   0.707  0.707 446811   0.0000 0.0000  mhonarc::get_time_from_index

This is due to the Perl 4 legacy code base.  The unique index for
each message also contains the date-time stamp applicable for the
message.

It may be possible to add in a new hash to just maintain the date-time
information to avoid the split() operation each time get_time_from_index
is invoked.  This will cause an increase in the database size (and
in memory size), but it may be negligable in the grand-scheme of
things.

I think when mhonarc was first written (and it was not called mhonarc),
I favored reducing the numbering of hashes used versus performance
gains (since performance was not a real issue since I did not forsee
mhonarc being used at such a large scale).

>  14.7   0.558  0.558   4805   0.0001 0.0001  MHonArc::RFC822::tokenise

This code is non-trivial since it does full RFC-822 parsing.
Older versions of mhonarc used to use a more simple parsing routine,
but a more robust routine was required as mhonarc evolved (and
to address bugs in email name add address extraction).

>  14.4   0.548  2.264  13800   0.0000 0.0002  mhonarc::replace_li_var

Minimizing variable usage in resource files is the main way to
reduce the calls to this routine.  However, resource file maintenance
concerns may trump any performance hit gained.

>  5.09   0.193  0.193  13037   0.0000 0.0000  mhonarc::compute_msg_pos

This is part of resource variable resolution.  See
<http://www.mhonarc.org/MHonArc/doc/guides/performance.html#mesg_spec>
on how to minimize the performance impact of this routine.

>  4.77   0.181  0.561   9538   0.0000 0.0001  MHonArc::UTF8::Encode::clip

This actually is more efficient than using the default CHARSETCONVERTERS
model.

I.e.  Encoding everything to UTF-8 is more efficient (assuming
proper resource settings).  In MHonArc's default configuration,
charset conversion can be very costly when dealing with non-ASCII
messages.

Years ago, I discovered this when doing my own profiling tests
on MHonArc when performance complaints were raised when more
extensive charset routines were added.

>  4.48   0.170  0.319      1   0.1700 0.3193  mhonarc::get_resources

This loads in the resource file(s).

--ewh