Hello all just note 9.1 will have a bytea_agg aggregate regards Pavel Stehule 2011/12/2 Marti Raudsepp <marti@xxxxxxxxx>: > On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff > <foo@xxxxxxxxxxxxxxxxxxx> wrote: >> But i clearly have a missunderstanding of other chars, like umlauts or utf-8 >> chars. This, for example, should return a 'ö': >> >> # SELECT chr(x'C3B6'::int); >> chr >> ----- >> 쎶 >> (1 row) > > That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded > and actually decodes to the codepoint 00F6. > > There is a fundamental problem that a decoded URL may actually be a > binary string -- it might not have a textual representation at all. > But if text is what you want, RFC3986 strongly suggests using UTF-8 > for encoding text strings in URLs, and that works almost always in the > real world. > > So the *right* way is to first convert the URL to a binary "bytea" > type by fixing all the % escapes, then convert that to UTF-8 encoding > to handle multibyte characters. > > What I came up with is far from elegant because PostgreSQL lacks > convenient functions for bytea manipulation (no bytea_agg, etc). > Stealing a little from Merlin, this is what it looks like: > > CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text > LANGUAGE plpgsql IMMUTABLE STRICT AS $$ > DECLARE > bin bytea = ''; > byte text; > BEGIN > FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP > IF length(byte) = 3 THEN > bin = bin || decode(substring(byte, 2, 2), 'hex'); > ELSE > bin = bin || byte::bytea; > END IF; > END LOOP; > RETURN convert_from(bin, 'utf8'); > END > $$; > > db=# select url_decode('Hell%C3%B6%20World%21'); > url_decode > -------------- > Hellö World! > > db=# select url_decode('%EC%8E%B6'); > url_decode > ------------ > 쎶 > > This will break for binary-encoded data in URLs, though. > db=# select url_decode('%fa%fa%fa'); > ERROR: invalid byte sequence for encoding "UTF8": 0xfa > CONTEXT: PL/pgSQL function "url_decode" line 13 at RETURN > > ---- > > On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@xxxxxxxxx> wrote: >> set client_encoding to latin1; > >> postgres=# select unencode('Hell%C3%B6%20World%21'); >> unencode >> --------------- >> Hellö World! >> (1 row) > > Sorry, but AFAICT this makes a mess of encodings and only works by > pure luck. The server thinks it's sending the client LATIN1 text, but > it's actually UTF8-encoded and the last decoding step is done by your > terminal. > > Regards, > Marti > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general