Re: man/man7/pathname.7: Correct handling of pathnames

Alejandro Colomar <alx@xxxxxxxxxx> · Mon, 27 Jan 2025 18:37:40 +0100

[CC += наб]

Hi Jason,

On Mon, Jan 27, 2025 at 12:14:43PM -0500, Jason Yundt wrote:
> On Mon, Jan 27, 2025 at 04:53:10PM +0100, Alejandro Colomar wrote:
> > Right.  But then, when do you need to do encoding?
> 
> Personally, my preference is that programs use the locale’s codeset
> because I can override the locale codeset in the rare event that UTF-8
> isn’t the correct option.  In my previous example, I was able to set the
> LANG environment variable to jp_JP.SJIS so that I could run that old
> software in an environment where pathnames were encoded in Shift-JIS.
> If everything just always assumed a particular character encoding for
> pathnames, then I wouldn’t have been able to do that.

But if the program handles arbitrary strings, just like the kernel does,
that would work too.

> > > > -  Accept anything, but reject control characters.
> > > > -  Accept anything, just like the kernel.
> > > 
> > > These last two also aren’t quite complete recommendations.  If a GUI
> > > program wants to display a pathname on the screen, then what character
> > > encoding should it use when decoding the bytes?
> > 
> > Just print them as they got in.  No decoding.  Send the raw bytes to
> > write(2) or printf(3) or whatever.
> 
> I don’t think that printing is a good way for GUI applications to
> display text.  I don’t normally run GUI applications in a terminal, so
> I’m not normally able to see a GUI application’s stdout or stderr.  Most
> of the GUI applications that I use display pathnames as part of a larger
> window.  In order to do that, the GUI application needs to know which
> characters the bytes in the pathname represent so that the GUI
> application can draw those characters on the screen.

I would do in a GUI exactly the same as what command-line programs do:
pass the raw string to whatever API prints them.  If the string makes
sense in the current locale, it will be shown nicely.  If it doesn't
make sense, it will display weird characters, but that's not a terrible
issue.  Just run again with the appropriate locale.

For example, in the git repository of the Linux man-pages project, there
are commits authored by наб <nabijaczleweli@xxxxxxxxxxxxxxxxxx>.  
Whenever I see the git-log(1) in one of my systems with the C locale, I
see weird characters.  I just need to re-run with the C.UTF-8 locale.

But it handles the bytes correctly, even if they don't make sense to the
system.  If git(1) failed whenever a string doesn't make sense in the
current locale, the repo would be corrupted sooner than later.

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>
Attachment:
signature.asc

Description: PGP signature