Search Postgresql Archives

Re: How to convert HEX to ASCII?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@xxxxxxxxxxxxxxxxxxx> wrote:
> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
> chars. This, for example, should return a 'ö':
>
> # SELECT chr(x'C3B6'::int);
>  chr
> -----
>
> (1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
  bin bytea = '';
  byte text;
BEGIN
  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
    IF length(byte) = 3 THEN
      bin = bin || decode(substring(byte, 2, 2), 'hex');
    ELSE
      bin = bin || byte::bytea;
    END IF;
  END LOOP;
  RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
  url_decode
--------------
 Hellö World!

db=# select url_decode('%EC%8E%B6');
 url_decode
------------
 쎶

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR:  invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT:  PL/pgSQL function "url_decode" line 13 at RETURN

----

On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
> set client_encoding to latin1;

> postgres=# select unencode('Hell%C3%B6%20World%21');
>   unencode
> ---------------
>  Hellö World!
> (1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux