Re: proxy_html / xml2enc won't handle certain HTML entities

Antonio Suárez Pozuelo <a.suarez@xxxxxxxxxxxxxxx> · Thu, 14 May 2020 14:36:25 +0200 (CEST)

Hi, Nick. I'm afraid we're still having some issue with this.

Currently our conf is:

        ProxyPreserveHost       on
        ProxyHTMLEnable         on
        ProxyHTMLExtended       on

And our pages are showing fine, but non-english characters fed into <input type="text"> form fields ared posted incorrectly (badly encoded) to our backend server. This won't happen with ProxyHTMLCharsetOut set to "*" or explicitly to "ISO-8859-1"; but that configuration, you know, takes us to the starting point.

Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 response into UTF-8, which is fine. When submitting a form, I guess the browser will also encode its contents in UTF-8, but maybe proxy_html won't reverse-translate that into ISO-8859-1 before relaying it to the backend server. This can be enforced by adding an accept-charset="ISO-8859-1" attribute to the <form> tag (tested on Firefox 77.0b5), so: should proxy_html add that attribute to <form> tags automagically when parsing and translating HTML content?

Just speculating, I really don't know the internals of it. But I guess you do :)

Thanks in advance. Best regards,

Antonio

----- Mensaje original -----
De: "Nick Kew" <niq@xxxxxxxxxx>
Para: "users" <users@xxxxxxxxxxxxxxxx>
Enviados: Viernes, 8 de Mayo 2020 9:22:40
Asunto: Re:  proxy_html / xml2enc won't handle certain HTML entities

> On 8 May 2020, at 07:28, Antonio Suárez Pozuelo <a.suarez@xxxxxxxxxxxxxxx> wrote:
> 
> Hi Nick,
> 
> Your glass of wine was inspiring: just removed
> 
>>       ProxyHTMLCharsetOut     *   # Backend (Tomcat) charset is ISO-8859-1
> 
> and the problem's gone!

OK, thanks for confirming it.  I'm pretty sure now what's happening.

Libxml2 uses unicode (utf-8) internally, so for i18n to work, your iso-8859-1
gets converted before feeding to the parser.  But HTML entities are not
preserved: they get converted to their unicode representations.

ProxyHTMLCharsetOut is kind-of an afterthought: it converts unicode to
your choice of encoding.  But it doesn't deal with HTML entities.  So when
it encounters unicode sequences for your "&rarr;" et al, it just tries to
convert unicode to latin-1, and fails when there is no latin-1 representation.

As far as I know this doesn't really matter: unicode support is pretty-near
universal, so just leaving it in place has no real downside.  I'll think about
whether there's an easy fix to ProxyHTMLCharsetOut for cases like this,
but will more likely just add a note to the docs about the limitation.

> FYI, by increasing LogLevel to INFO, error log shows:

Basically just shows the problem isn't your backend.  My first reply was
leading to "if the debug info doesn't tell us what's wrong, I'll ask for a
test case to try and replicate the problem".  No need for that now!

Thanks for the report!

-- 
Nick Kew
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@xxxxxxxxxxxxxxxx
For additional commands, e-mail: users-help@xxxxxxxxxxxxxxxx