Re: [users@httpd] Question about mod_charset_light and mod_proxy_html (Solved!)

mickg <mickg@xxxxxxxxx> · Wed, 08 Nov 2006 00:48:39 -0500

Just to put my money where my mouth is, I have implemented a (stupid) prototype
that does: If no known charset is native to libxml2 detected , a recompiled version
of mod_proxy_html now uses iconv (eventually via the xmlFindCharEncodingHandler
function) to convert from the source encoding to UTF-8.

If no encoding info is specified, it assumes windows-1251 (yes, stupid, but still).

The main work is done by adding a
const char * enc_from  to ctxt
	this specifies, in iconv compatible terms, the source encoding.

sniff_encoding is modified to return 0 when it encounters a non-native coding,
and to set ctxt->enc_from (ctxt is added as a parameter to it)

The function:
size_t ConvertCtxtBuffer(const char * buf, char ** newbuf, size_t bytes, saxctxt *ctxt, ap_filter_t *f) {
        size_t len=0;
        if (ctxt->enc_from) {
            if (!xmlFindCharEncodingHandler(ctxt->enc_from)) {
                ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: no encoding handler found for '%s'", ctxt->enc_from);
                *newbuf=buf;
                return bytes;
            } else {
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: bytes: %d, ", bytes);
                len=ConvertInput(buf,newbuf,bytes,f->r,ctxt->enc_from);
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: len: %d, ", len);
                if (len<0) {
                        ap_log_rerror(APLOG_MARK, APLOG_ERROR, 0, f->r,"ConvertInput: conversion failed from '%s'", ctxt->enc_from);
                        *newbuf=buf;
                        return bytes;
                }
                buf=*newbuf;
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, f->r,"ConvertInput: encoding handler found for '%s'", buf);
                return len;
            }
        } else {
                *newbuf=buf;
                return bytes;
        }
}

calls the actual conversion.

The function
size_t
ConvertInput(const char *in, char ** newbuf, int size, void * r, const char *encoding)
{
  xmlChar *out;
  xmlChar *oldout;
  int ret;
  int out_size;
  int temp;
  size_t len=0;
  xmlCharEncodingHandlerPtr handler;

  if (in == 0)
    return 0;
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z1") ;

  handler = xmlFindCharEncodingHandler(encoding);

        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2 %d %d %d",handler->input, handler->output, handler->iconv_in) ;
  if (!handler) {
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z2a") ;
    printf("ConvertInput: no encoding handler found for '%s'\n",
           encoding ? encoding : "");
    return 0;
  }
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z3") ;

  out_size = (size+1) * 2 - 1;
  out = (unsigned char *) xmlMalloc((size_t) out_size);
  oldout=out;
        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z4 %d %d %s %s %d",size,out_size,encoding,in,handler->output) ;
        if (out != 0) {
                temp = size ;
                if (handler->input) {
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5") ;
                        ret = handler->input(out, &out_size, in, &temp);
                }
                else {
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z5a") ;
                        ret = iconv(handler->iconv_in,&in,&temp,&out,&out_size);
                }
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"z6 %d %d %d",ret,temp,out_size) ;
                if ((ret < 0)) {
                        if (ret < 0) {
                                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful") ;
                        } else {
                                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"ConvertInput: conversion wasn't succesful. Converter %i octets.",temp) ;
                        }
                        xmlFree(oldout);
                        out = 0;
                        out_size=-1;
                } else {
                        out_size=( (size+1) * 2 - 1) - out_size;
                        out = (unsigned char *) xmlRealloc(oldout, out_size+1 );
                        out[out_size] = 0;  /*null terminating out */
                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"out %d, oldout %d",out,oldout) ;

                        ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"len(OUT): %d",strlen(out)) ;
                }
        } else {
                ap_log_rerror(APLOG_MARK, APLOG_INFO, 0, r,"No memory!") ;
        }
  *newbuf=out;
  return out_size;
}

does the actual conversion. It currently output a bit too much log info, and I
suspect a memory leak from xmlMalloc. I honestly do not know enough about Apache
to figure out when to free it (especially at 1AM).

Oh, also, the proxy_html_filter function is modified at 4 points, so that
bytes=ConvertCtxtBuffer(buf,&buf,bytes,ctxt,f);
is called, so that the conversion actually takes place, and so that when
sniff_... returns 0, the return value is converted to XML_CHAR_ENCODING_UTF8.

******************************************************************************
*              !!!THIS CODE IS *NOT* PRODUCTION QUALITY!!!                   *
*IT HAS AT LEAST ONE MEMORY LEAK, AND LOGS WAY TOO MUCH TO THE ERROR LOG.    *
*Also, I am not sure of the security implications of passing the decoding off*
*to iconv (Are there any buffer overflows in it? Could it be exploited by a  *
*specially crafted file in a particular encoding?)                           *
******************************************************************************

Also, I am not sure what this code will do to get&put method data.

It does work on my _own_ website, where it quite happily converts win-1251 to
utf-8. Once I fix the memory leak (any help appreciated), I'll be happy.

And a great many thanks to Nick Kew for getting me off my lazy ... to start
coding  (which, honestly, I am better at than administering systems).

Hopefully this helps someone.

BTW, I still have no clue why I cannot do this with mod_charset_lite.

mickg wrote: