Re: Feature request

Earl Hood <earl@xxxxxxxxxxxx> · Tue, 20 Dec 2005 16:37:10 -0600

On December 20, 2005 at 15:27, Ken Bass wrote:

>   When I saw that API callback the other day, I was initially excited. 
> But when I looked in detail, it did not seem like I had access to the 
> message header or body. It would have been usefull if the API passed in 
> some type of hash/assoc array so user defined fields/comments could be 
> passed back to into the message being converted. I had to abandon this 
> route.

Which API calls did you look at? $mhonarc::CBMessageConverted
provides header info along with the filename info.  I'm guessing that
CBMessageConverted may be too late for you?  It appears you want the
filename info in one of the header-read-based callbacks.  Correct?

> > Agree on the last part.  If you are processing news spools, why
> > are there no message-ids?
> 
>   That is my delimma. My archive is from 1996 to present. For certain 
> years the messages were from a mailing list and other years a newsgroup. 
> I recently reorganized/expanded my archive and upgraded to the latest 
> version. In the process, I added hundreds of thousands of messages. When 
> I viewed the cronological view, there were some entries that had empty 
> bodies with subject of '[no subject]', author 'Unknown', with todays 
> date. Without a way to map them, I have no way to trace to the input and 
> see what is wrong. For the cases of no message id's, I found some 'temp 
> files' among these messages and some 0 length messages. Those files were 
> processed by mhonarc and resulted in some of the mystery entries.

One option is to added a message-id to each message before passing
the data to mhonarc.  I.e. Do some pre-processing on the data to
clean things up before passing to mhonarc.  The pre-processing could
include deleting 0 byte files.

> Some the other 'input problems' I encountered during this archive 
> rebuild were:
> 
> Warning: Unrecognized character set: x-user-defined

See charsetaliases and charsetconverters.  If a charset is not
recognized, mhonarc fallbacks to the default charset, us-acii (which
can be changed via a resource).

> Warning: Unrecognized time zone, ","

I'm guessing a strange date format.

> Warning: Unrecognized time zone, "-5:00"

Numeric time offsets are not supposed to have a ':'.

> Warning: No end boundary delimiter found in message body

A MIME multipart with no end boundary.  The code is pretty good
at dealing with this, so usually the warning can be ignored.

> Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.pl 
> line 91, <GEN70164> line 18.

Base64 encoded data is badly formatted.
(Side note: You should upgrade from Perl 5.8.0.  I think 5.8.0 is
kind of buggy).

> Even with a message id available, grepping through hundreds of thousands 
> of messages for each warning takes a while and really slowed down the 
> process.

Yep.

I agree that a more verbose operating mode will be useful, like a -debug
that prints out much more detail about what is going on.  I've had
a few cases myself where it would have been handy.

Of course, such a mode is not handy in cases where a problem is discovered
later on any you want to know which input message maps to a given
HTML file.

--ewh