Re: Duplicate Message Clean-up

"James Roman" <jay@directlink.net> · Thu, 6 Sep 2001 15:38:21 -0500

I can take the 16,000 individual messages and add them to separate (empty)
archive.  By
doing this it weeds out all the extra messages and only the 4200 messages
appear in the
archive.  The problem with this is MHonArc can not properly mandle the
messages in this
form (HTML).

The messages got duplicated due to the fact that I never empty the source
mailbox and the
archive database got reset.  So, when it went to do its daily processing it
interpreted all
the messages in the mailbox as new messages.  This ended up creating all of
the duplicate
message files.  Though only 4200 messages show up via the scan command.

Jay

http://www.shadow-lands.com/orb
http://www.shadow-lands.com/sml

----- Original Message -----
From: "Earl Hood" <ehood@hydra.acs.uci.edu>
To: <mhonarc@ncsa.uiuc.edu>
Sent: Thursday, September 06, 2001 1:38 PM
Subject: Re: Duplicate Message Clean-up

> On September 6, 2001 at 07:35, "James Roman" wrote:
>
> > I have over 16,000 message files in my current database and only about
> > 4,200 of those messages are valid.  The rest are all duplicates.  I have
> > read the messages about testing for duplicates and I know what the
> > problem was.  Now for the hard part.
> >
> > Does anyone know how I can quickly clean-up all of the extra messages?
>
> Since the "dups" have separate message-ids (which must be the case
> since MHonArc would have prevented the duplicates from being archived
> if message-ids matched), the real task is gathering the list of dups.
> The RMM resource can be used to remove messages, but you must come up
> with the list.
>
> A possible approach to your problem is to write a Perl script that
> takes each message and computes its MD5 checksum (there is Perl module
> that does MD5 checksums), BUT only computing the checksum for the data
> between the following comment declarations in each message page:
>
> <!--X-Head-of-Message-->
> ...
> <!--X-Head-of-Message-End-->
> and,
> <!--X-Body-of-Message-->
> ...
> <!--X-Body-of-Message-End-->
>
> Just maintain a hash where the keys are the MD5 checksums and the
> values are the files.  Therefore, when a checksum is computed for a
> message page, the hash can be checked to see if there is another file
> that has the same checksum.  If so, you have a duplicate.
>
> Note, the above is only useful if the dups you are talking about are
> real dups, i.e. byte-for-byte they are the same with the exception of
> the message-ID given to them.  Also, the script logic could be
> complicated if you have dups, but some other inconsequential message
> headers could vary.  If so, you may have to play with how you handle
> the <!--X-Head-of-Message--> part.
>
> --ewh
>
>