Just to put my money where my mouth is, I have implemented a (stupid) prototype that does: If no known charset is native to libxml2 detected , a recompiled version of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler function) to convert from the source encoding to UTF-8. If no encoding info is specified, it assumes windows-1251 (yes, stupid, but still). The main work is done by adding a const char * enc_from to ctxt this specifies, in iconv compatible terms, the source encoding. sniff_encoding is modified to return 0 when it encounters a non-native coding, and to set ctxt->enc_from (ctxt is added as a parameter to it) The function: size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t bytes, saxctxt *ctxt, ap_filter_t *f) { size_t len=0; if (ctxt->enc_from) { if (!xmlFindCharEncodingHandler(ctxt->enc_from)) { ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: no encoding handler found for '%s'", ctxt->enc_from); *newbuf=buf; return bytes; } else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: bytes: %d, ", bytes); len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from); ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d, ", len); if (len<0) { ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from); *newbuf=buf; return bytes; } buf=*newbuf; ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: encoding handler found for '%s'", buf); return len; } } else { *newbuf=buf; return bytes; } } calls the actual conversion. The function size_t ConvertInput(const char *in, char ** newbuf, int size, void * r, const char *encoding) { xmlChar *out; xmlChar *oldout; int ret; int out_size; int temp; size_t len=0; xmlCharEncodingHandlerPtr handler; if (in == 0) return 0; ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ; handler = xmlFindCharEncodingHandler(encoding); ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d %d",handler->input, handler->output, handler->iconv_in) ; if (!handler) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ; printf("ConvertInput: no encoding handler found for '%s'\n", encoding ? encoding : ""); return 0; } ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ; out_size = (size+1) * 2 - 1; out = (unsigned char *) xmlMalloc((size_t) out_size); oldout=out; ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s %d",size,out_size,encoding,in,handler->output) ; if (out != 0) { temp = size ; if (handler->input) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5") ; ret = handler->input(out, &out_size, in, &temp); } else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5a") ; ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size); } ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d %d",ret,temp,out_size) ; if ((ret < 0)) { if (ret < 0) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ; } else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful. Converter %i octets.",temp) ; } xmlFree(oldout); out = 0; out_size=-1; } else { out_size=( (size+1) * 2 - 1) - out_size; out = (unsigned char *) xmlRealloc(oldout, out_size+1 ); out[out_size] = 0; /*null terminating out */ ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"out %d, oldout %d",out,oldout) ; ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"len(OUT): %d",strlen(out)) ; } } else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No memory!") ; } *newbuf=out; return out_size; } does the actual conversion. It currently output a bit too much log info, and I suspect a memory leak from xmlMalloc. I honestly do not know enough about Apache to figure out when to free it (especially at 1AM). Oh, also, the proxy_html_filter function is modified at 4 points, so that bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f); is called, so that the conversion actually takes place, and so that when sniff_... returns 0, the return value is converted to XML_CHAR_ENCODING_UTF8. ****************************************************************************** * !!!THIS CODE IS *NOT* PRODUCTION QUALITY!!! * *IT HAS AT LEAST ONE MEMORY LEAK, AND LOGS WAY TOO MUCH TO THE ERROR LOG. * *Also, I am not sure of the security implications of passing the decoding off* *to iconv (Are there any buffer overflows in it? Could it be exploited by a * *specially crafted file in a particular encoding?) * ****************************************************************************** Also, I am not sure what this code will do to get&put method data. It does work on my _own_ website, where it quite happily converts win-1251 to utf-8. Once I fix the memory leak (any help appreciated), I'll be happy. And a great many thanks to Nick Kew for getting me off my lazy ... to start coding (which, honestly, I am better at than administering systems). Hopefully this helps someone. BTW, I still have no clue why I cannot do this with mod_charset_lite. mickg wrote:
Nick Kew wrote:On Tue, 07 Nov 2006 17:49:25 -0500 mickg <mickg@xxxxxxxxx> wrote:2 questions:I think I'd have to play with that hands-on to figure it outwith your attempted configuration.Was that an offer :) If yes, please say so, and shell account will be provided. (As the system is a VM, I will just clone it, and give access to that, so, if you mess it up, no problem).Well it could be, if you have the budget for my time. That's your most expensive option.Understood :)It might be worth trying mod_line_edit instead of mod_proxy_html. You sacrifice the markup support, but in your case the markup isn't properly supported anyway, and you probably benefit from the fact that it is also unaware of charsets.Hmm. Did not know about that module. Any idea where I can get the .so ?Same place you get the mod_proxy_html.so. Except I guess you got that from a third-party package. I supply binaries and basic support to registered users.Or an ubuntu package? Or how to compile the source, given a development environment?Read the apache docs on apxs. You'll probably need an apache-dev package on ubuntu. It's simpler than mod_proxy_html, because it doesn't rely on additional libraries.Understood, will do. Thank you!I should add that today's correspondence has prompted me to blog about mod_proxy_html 3.0, which will enable you to fix that charset problem by aliasing an unsupported charset to a similar supported one (windows cyrillic is probably similar enough to ISO cyrillic - aka ISO-8859-5 - for that to work). I'm inviting blog comments from anyone with great ideas for the next major release of mod_proxy_html.Actually, I think the characters are different in the upper register. What about letting mod_proxy do it's own transcoding, via iconv or some such? Maybe even a filter-architecture of it's own? As in, given a match, apply this filter to it? Although, that may be overkill for a simple matcher. mickg --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx
(Solved!) --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx