Nick Kew wrote:
On Wed, 08 Nov 2006 00:48:39 -0500 mickg <mickg@xxxxxxxxx> wrote:Just to put my money where my mouth is, I have implemented a (stupid) prototype that does: If no known charset is native to libxml2 detected , a recompiled version of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler function) to convert from the source encoding to UTF-8.Interesting. You've gone one up on my aliasing proposal, for what looks like rather less work than I thought that would take. I might snarf the basic idea for Version 3.
Do you want the full working code once I clean up the memory problem? It is, after all, GPL, so it would be in good spirit for me to release the modified source. :) Although, to be truly honest, what the thing is doing IS somewhat backwards. The dataflow would be such (And I am more familiar with Python code, as the next snippet will show). data comes in if ctxt.encoder==None: obtain charset if need iconv to convert charset: ctxt.encoder=charset return enc=UTF-8 else: return enc proir to processing buf, if ctxt.encoder!=None: convert(buf) convert if encoder is set (non-null). This guarantees that either the data is in known enc to libxml, or was utf8 to begin with, or was converted to utf8, or conversion failed miserably (the miserable failure was logged.)
If no encoding info is specified, it assumes windows-1251 (yes, stupid, but still).But not stupid if we make it a configurable default!
Yeah, preferably via a directive such as HTMLSourceDefaultEnc windows-1251 or some such.
It does work on my _own_ website, where it quite happily converts win-1251 to utf-8. Once I fix the memory leak (any help appreciated), I'll be happy.See http://www.apachetutor.org/dev/pools for an easy way to deal with the memory.And a great many thanks to Nick Kew for getting me off my lazy ... to start coding (which, honestly, I am better at than administering systems).:-)BTW, I still have no clue why I cannot do this with mod_charset_lite.Neither am I. But a closer look at mod_charset_lite has been on my TODO list for so long it's probably on a permanent back-burner. Did you also look at the full mod_charset? AIUI it was written by Russian developers, so cyrillic was presumably important to them.
The thing about mod_charset, is that they assume no iconv, and do all internal translation. With translation settings and weird maps, where needed. This seems a bit insane to me, unless needed. I believe the reason was that we had: win1251 read as koi8, transcoded into LATIN1 Now, we need to make sense of *that*. Also, they do not cleanly support utf8 translation (they do not support translation back from utf8). iconv does. Honestly, remaking mod_proxy_html into mod_proxy_charset_convert would be trivial now, IMO. And maybe that's the better idea. Although that does duplicate mod_charset_light, at least I know it'll work. And , it would use libxml2 where possible, not iconv. mickg --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx