[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML, control characters and MHonArc



On Fri, 05 Oct 2007, Earl Hood <earl@xxxxxxxxxxxx> wrote:

On October 5, 2007 at 08:45, Chris Hastie wrote:

Mostly this is working fine, but I have the occasional problem with
control characters in badly formatted emails. Specifically, a QP email
with the string =12 - MHonArc outputs the associated control character
to the XML. These characters are not valid in XML and the XML parser
chokes on them.

Have you tried out the TEXTENCODE resource to see how the
control characters are handled?  If generating XML, you may
want to use TEXTENCODE to normalize all character data to UTF-8.
See manual for examples.

I did experiment with TEXTENCODE. It produced some surprising results, but I may
have been getting the wrong end of the stick.

I started taking everything to UTF-8, and then through mhonarc::htmlize. My list
is UK based so '£' occurs quite often. This looked fine in the XML
output(viewing with Notepad++), but the final output failed to display it
correctly. I presumed that some issue with PHP reading UTF-8 was to blame. It
was noticable, however, that Notepad++ reported the file as being encoded as
ANSI.

I then tried taking everything to UTF-8 with TEXTENCODE and passing it through
MHonArc::CharEnt::str2sgml. The result was my '£' got encoded as something
very odd, &#xFFFD IIRC.

I'm sure outputting UTF-8 is the 'correct' way to go, it just seems to cause me
some headaches with later processing.

I see a quick mention of a similar problem back in 2000:
http://www.mhonarc.org/archive/html/mhonarc-users/2000-07/msg00040.html

Have things changed? Is there any way short of writing a custom filter,
or hacking/patching an existing one, that I can persuade MHonArc to
strip out XML illegal control characters?

Check the minimal API documented in an appendix of the manual.  There
is a callback you can register after a message has been converted.
Your callback can check for invalid characters and remove them.


Thanks, I'll take a look at that. At the moment I'm stripping non-legal XML
characters in my PHP script before passing the XML to the parser.

P.S. Please post you resource settings for creating XML.  Others
may be interested and it may be something to include in the docs.

Will try to tidy them up enough to be useful to someone else in the next day or
two. Where should they be posted - to this list? As attachments?

Thanks
--
Chris Hastie



[Index of Archives]     [Bugtraq]     [Yosemite News]     [Mhonarc Home]