On Wed, 08 Nov 2006 00:48:39 -0500 mickg <mickg@xxxxxxxxx> wrote: > Just to put my money where my mouth is, I have implemented a (stupid) > prototype that does: If no known charset is native to libxml2 > detected , a recompiled version of mod_proxy_html now uses iconv > (eventually via the xmlFindCharEncodingHandler function) to convert > from the source encoding to UTF-8. > > If no encoding info is specified, it assumes windows-1251 (yes, > stupid, but still). > > The main work is done by adding a > const char * enc_from to ctxt > this specifies, in iconv compatible terms, the source > encoding. > > sniff_encoding is modified to return 0 when it encounters a > non-native coding, and to set ctxt->enc_from (ctxt is added as a > parameter to it) > > The function: > size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t > bytes, saxctxt *ctxt, ap_filter_t *f) { size_t len=0; > if (ctxt->enc_from) { > if (!xmlFindCharEncodingHandler(ctxt->enc_from)) { > ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, > f->r,"ConvertInput: no encoding handler found for '%s'", > ctxt->enc_from); *newbuf=buf; return bytes; > } else { > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > f->r,"ConvertInput: bytes: %d, ", bytes); > len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from); > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d, > ", len); if (len<0) { ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, > f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from); > *newbuf=buf; return bytes; > } > buf=*newbuf; > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > f->r,"ConvertInput: encoding handler found for '%s'", buf); return > len; } > } else { > *newbuf=buf; > return bytes; > } > } > > calls the actual conversion. > > The function > size_t > ConvertInput(const char *in, char ** newbuf, int size, void * r, > const char *encoding) { > xmlChar *out; > xmlChar *oldout; > int ret; > int out_size; > int temp; > size_t len=0; > xmlCharEncodingHandlerPtr handler; > > if (in == 0) > return 0; > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ; > > handler = xmlFindCharEncodingHandler(encoding); > > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d > %d",handler->input, handler->output, handler->iconv_in) ; if > (!handler) { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ; > printf("ConvertInput: no encoding handler found for '%s'\n", > encoding ? encoding : ""); > return 0; > } > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ; > > out_size = (size+1) * 2 - 1; > out = (unsigned char *) xmlMalloc((size_t) out_size); > oldout=out; > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s > %d",size,out_size,encoding,in,handler->output) ; if (out != 0) { > temp = size ; > if (handler->input) { > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > r,"z5") ; ret = handler->input(out, &out_size, in, &temp); > } > else { > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > r,"z5a") ; ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size); > } > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d > %d",ret,temp,out_size) ; if ((ret < 0)) { > if (ret < 0) { > ap_log_rerror(APLOG_MARK, > APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ; } > else { ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: > conversion wasn't succesful. Converter %i octets.",temp) ; } > xmlFree(oldout); > out = 0; > out_size=-1; > } else { > out_size=( (size+1) * 2 - 1) - out_size; > out = (unsigned char *) xmlRealloc(oldout, > out_size+1 ); out[out_size] = 0; /*null terminating out */ > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > r,"out %d, oldout %d",out,oldout) ; > > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, > r,"len(OUT): %d",strlen(out)) ; } > } else { > ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No > memory!") ; } > *newbuf=out; > return out_size; > } > > does the actual conversion. It currently output a bit too much log > info, and I suspect a memory leak from xmlMalloc. I honestly do not > know enough about Apache to figure out when to free it (especially at > 1AM). > > Oh, also, the proxy_html_filter function is modified at 4 points, so > that bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f); > is called, so that the conversion actually takes place, and so that > when sniff_... returns 0, the return value is converted to > XML_CHAR_ENCODING_UTF8. > > > > ****************************************************************************** > * !!!THIS CODE IS *NOT* PRODUCTION > QUALITY!!! * *IT HAS AT LEAST ONE MEMORY LEAK, AND > LOGS WAY TOO MUCH TO THE ERROR LOG. * *Also, I am not sure of the > security implications of passing the decoding off* *to iconv (Are > there any buffer overflows in it? Could it be exploited by a * > *specially crafted file in a particular > encoding?) * > ****************************************************************************** > > Also, I am not sure what this code will do to get&put method data. > > It does work on my _own_ website, where it quite happily converts > win-1251 to utf-8. Once I fix the memory leak (any help appreciated), > I'll be happy. > > > And a great many thanks to Nick Kew for getting me off my lazy ... to > start coding (which, honestly, I am better at than administering > systems). > > Hopefully this helps someone. > > > BTW, I still have no clue why I cannot do this with mod_charset_lite. > > > > mickg wrote: > > Nick Kew wrote: > >> On Tue, 07 Nov 2006 17:49:25 -0500 > >> mickg <mickg@xxxxxxxxx> wrote: > >> > >> > >>> 2 questions: > >>>> I think I'd have to play with that hands-on to figure it out > >>>> with your attempted configuration. > >>> Was that an offer :) If yes, please say so, and shell account > >>> will be provided. (As the system is a VM, I will just clone it, > >>> and give access to that, so, if you mess it up, no problem). > >> > >> Well it could be, if you have the budget for my time. > >> That's your most expensive option. > >> > > Understood :) > >>>> It might be worth trying > >>>> mod_line_edit instead of mod_proxy_html. You sacrifice the > >>>> markup support, but in your case the markup isn't properly > >>>> supported anyway, and you probably benefit from the fact that > >>>> it is also unaware of charsets. > >>>> > >>> Hmm. Did not know about that module. Any idea where I can get > >>> the .so ? > >> > >> Same place you get the mod_proxy_html.so. Except I guess you > >> got that from a third-party package. I supply binaries and > >> basic support to registered users. > >> > >>> Or an ubuntu package? > >>> > >>> Or how to compile the source, given a development environment? > >> > >> Read the apache docs on apxs. You'll probably need an apache-dev > >> package on ubuntu. It's simpler than mod_proxy_html, because it > >> doesn't rely on additional libraries. > >> > > Understood, will do. Thank you! > >> I should add that today's correspondence has prompted me to blog > >> about mod_proxy_html 3.0, which will enable you to fix that > >> charset problem by aliasing an unsupported charset to a similar > >> supported one (windows cyrillic is probably similar enough to > >> ISO cyrillic - aka ISO-8859-5 - for that to work). I'm inviting > >> blog comments from anyone with great ideas for the next major > >> release of mod_proxy_html. > >> > > Actually, I think the characters are different in the upper > > register. > > > > What about letting mod_proxy do it's own transcoding, via iconv or > > some such? > > Maybe even a filter-architecture of it's own? > > As in, given a match, apply this filter to it? > > Although, that may be overkill for a simple matcher. > > > > > > > > mickg > > > > > > --------------------------------------------------------------------- > > The official User-To-User support forum of the Apache HTTP Server > > Project. See <URL:http://httpd.apache.org/userslist.html> for more > > info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx > > " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx > > For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx > > > (Solved!) > > > --------------------------------------------------------------------- > The official User-To-User support forum of the Apache HTTP Server > Project. See <URL:http://httpd.apache.org/userslist.html> for more > info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx > " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx > For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx > -- Nick Kew Application Development with Apache - the Apache Modules Book http://www.apachetutor.org/ --------------------------------------------------------------------- The official User-To-User support forum of the Apache HTTP Server Project. See <URL:http://httpd.apache.org/userslist.html> for more info. To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx " from the digest: users-digest-unsubscribe@xxxxxxxxxxxxxxxx For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx