[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: cleaning up yahoogroups messages?
I'm guessing you're hitting unescaped From lines.
This is my script for processing individual messages.
pid=$$
if [ -z "$*" ]
then
exit
fi
for f in $*
do
cat $f | sed -e '/^$/ q' >head.$pid
cat $f | sed -e '1,/^$/ d' >tmp.$pid
cat tmp.$pid | sed -e 's/^From/>From/' -e 's/^\./ \./' >body.$pid
cat head.$pid body.$pid >$f
done
rm head.$pid tmp.$pid body.$pid
>Hi Folks,
>
>I've been trying to migrate a collection of messages from yahoogroups to
>sympa (which uses mhonarc as it's archiving engine).
>
>There's a great little script, yahoo2mbox, that pulls messages form
>yahoogroups and aggregates them into an mbox file - ideal for processing
>by mhonarc.
>
>Unfortunately, when I run mhonarc on the mbox file, it seems to cut out
>the bodies of a lot of, but not all of the messages - leaving the header
>intact. It seems like messages that originated with MS Outlook are
>particularly likely to end up with empty bodies.
>
>Now I've read the archives of this list, and this seems to be a known
>problem with mhonarc filtering out malformed HTML, but I haven't seen
>any recent traffic indicating a solution of any sort.
>
>So... has anybody come up with a straightforward way to clean up an mbox
>file sufficiently for mhonarc to process? (e.g. a way to run the mbox
>file through HTML Tidy or some such)? Or can anybody offer some
>suggestions, recipes, recent experiences, etc.?
>
>Thanks much,
>
>Miles Fidelman
>
>
--
PEG Manager
pegmgr at peg dot com
[Index of Archives]
[Bugtraq]
[Yosemite News]
[Mhonarc Home]