RFC: Character set handling in OpenConnect

dwmw2 at infradead.org (David Woodhouse) · Wed, 30 Jul 2014 11:27:56 +0100

OpenConnect development started in 2008, on a modern Linux box. I hadn't
really operated a Linux box where the system locale was anything other
than UTF-8 for a number of years by then, and it seemed reasonable
enough for the charset handling to be fairly much non-existent, based on
the assumption that "everything is UTF-8, all of the time".

Now, however, OpenConnect has been ported to a number of systems where
that assumption isn't valid, so I've made an attempt to deal with this.

I think the Java, GNOME and KDE GUIs *were* all using UTF-8 anyway,
although I'm not sure about Shimo (Fabian?).

So all I've done so far is add conversion in main.c ? write_progress()
will convert the string from UTF-8 to the current locale before printing
it, while read_stdin() is now used for all user input and will convert
*to* UTF-8. And the command line arguments are likewise converted to
UTF-8 before being given to libopenconnect.

Now it can use non-ASCII passwords in non-UTF-8 systems. And I even have
things working under Windows having renamed my tun device to 'TAP?' and
specifying '--interface TAP?' on the command line.

The approach is still for *libopenconnect* to assume that everything is
UTF-8, and so far nothing's changed for users of the library.

However, that probably needs to change. At the very least, we need to 
convert file names from UTF-8 to legacy encoding before trying to open
them. I think this was already broken for GNOME and KDE users where the
strings (including filenames) will always have been UTF-8, and if they
are actually stored on the file system using a legacy locale then the
lack of conversion may already have been an issue.

So I think I need an internal function open_utf8() which will convert a
UTF-8 filename to legacy encoding before trying to open it. The legacy
encoding will be automatically discovered by nl_langinfo(CODESET), and
perhaps we'll want a new openconnect_set_legacy_charset() function to
allow the user to override that.

Does that sound reasonable? In the 20th century world of legacy locales,
is it reasonable to assume that the filename used in open() is in the
charset specified by LC_CTYPE? (Actually it's a per-filesystem thing,
but systems which predate UTF-8 aren't going to be coping with that
anyway, are they?)

And is there anything *else* I've missed? I suppose the tun device name
under Linux might also need to be converted, although Linux really
*ought* to be using UTF-8. Perhaps the $CISCO_BANNER environment
variable passed to the vpnc-script? Anything else?

-- 
dwmw2
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5745 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/openconnect-devel/attachments/20140730/65ed86d7/attachment-0001.bin>