Re: Feature request

Ken Bass <kbass@xxxxxxxxxxx> · Tue, 20 Dec 2005 15:27:45 -0500

Earl Hood wrote:

On December 18, 2005 at 00:11, Ken Bass wrote:

2) Add a resource that contains the 'input' filename. Basically, I would 

like there to be a mapping between the generated data and the input 

file. I am thinking MHonarc should add something like:

<!--X-Input-Filename: /var/archive/mbox/file1.txt -->

This his has been discussed in the past.  The input filename is
not always known, and it varies based on the style of input: mbox
or mh.  What does exist are callbacks (see the API appendix and
$mhonarc::CBMessageConverted) that provides hooks.  The callbacks
were added due to a request by a user.

 When I saw that API callback the other day, I was initially excited. 

But when I looked in detail, it did not seem like I had access to the 

message header or body. It would have been usefull if the API passed in 

some type of hash/assoc array so user defined fields/comments could be 

passed back to into the message being converted. I had to abandon this 

route.

 Another option I started to implement was to add an '$INPUTFILE$' 

resource variable (kind of like $MSG$ but for the input). This would 

allow the flexibility so the feature is optional and allow the name to 

be used in many ways - meta tags, comments, URLs, etc. The user could 

simply add their own tags however they want in their output.

 I also looked at annotation but that didn't pan out either unless msgs 

were added one at a time.

When converting large archives, sometimes there are errors in the 

processing and at least by examining the HTML source you could see

which input file causes it. The message ID is not always useful, 

especially when MHonarch generate its own id.

Agree on the last part.  If you are processing news spools, why
are there no message-ids?

 That is my delimma. My archive is from 1996 to present. For certain 

years the messages were from a mailing list and other years a newsgroup. 

I recently reorganized/expanded my archive and upgraded to the latest 

version. In the process, I added hundreds of thousands of messages. When 

I viewed the cronological view, there were some entries that had empty 

bodies with subject of '[no subject]', author 'Unknown', with todays 

date. Without a way to map them, I have no way to trace to the input and 

see what is wrong. For the cases of no message id's, I found some 'temp 

files' among these messages and some 0 length messages. Those files were 

processed by mhonarc and resulted in some of the mystery entries.

Some the other 'input problems' I encountered during this archive 

rebuild were:

Warning: Unrecognized character set: x-user-defined
Warning: Unrecognized time zone, ","
Warning: Could not parse date for message
Warning: Unrecognized character set: utf-7
Warning: Unrecognized character set: ibm850
Warning: Unrecognized character set: ibm864
Warning: Unrecognized time zone, "-5:00"
Warning: Bad year (1956) using current
Warning: No end boundary delimiter found in message body

Premature end of base64 data at /usr/lib/perl5/site_perl/5.8.0/base64.pl 

line 91, <GEN70164> line 18.

Even with a message id available, grepping through hundreds of thousands 

of messages for each warning takes a while and really slowed down the 

process.

For normal everyday additions from a news spool which is what happens 

normally, I agree that there should be a message id. In my case I was 

processing older messages which led me down this path.

A problem with tracking the filename is it increases the amount of
data stored in the dbfile.  The callback API could be used to track
the info for those interested.  Alternatively, mhonarc could be
modified to have diagnostic data in message pages that are preserved
during edits so the info is not lost (which would require a new
deliminting token to preserve such info).

 I modified my mhonarc and added a '%InputFile' hash which stores the 

filename. I set it after read_mail_header() call when the input is a 

directory. During output_mail(), if it is defined (which it wont be for 

single adds or adds from stdin) I add a '<!--X-InputFile: 

/var/archive/mbox/file1.txt -->'. I did not add it to the database which 

I guess means it could not be recreated? This could probably be used 

with a $INPUTFILE$ resource variable, but I could not understand the 

code wrt mapping the index to the key during variable substitution.

Of course, if such changes were made, the feature would be optional
since revealing such information could be a security concern for
users.

 I thought about that too but in my specific case it did not bother me. 

The filename are just numbers organized in Year/Mon directories (though 

it does expose the name of a user account). Others might be concerned 

with this of course. With this mapping, when I see a problem message in 

the archive, I simple 'view page source' and can see immediately what 

file caused it. Due to the size of the archive, I'm considering putting 

a 'report this message' link in the TOPLINKS of each msg so that users 

can report odd stuff (or illegal content/porn/etc). Being able to map 

from the page the user visits to the original file would be helpful in 

this case also.