Re: man/man7/pathname.7: Correct handling of pathnames

Jason Yundt <jason@jasonyundt.email> · Mon, 27 Jan 2025 18:07:30 -0500

On Mon, Jan 27, 2025 at 06:37:40PM +0100, Alejandro Colomar wrote:
> [CC += наб]
> 
> Hi Jason,
> 
> On Mon, Jan 27, 2025 at 12:14:43PM -0500, Jason Yundt wrote:
> > On Mon, Jan 27, 2025 at 04:53:10PM +0100, Alejandro Colomar wrote:
> > > Right.  But then, when do you need to do encoding?
> > 
> > Personally, my preference is that programs use the locale’s codeset
> > because I can override the locale codeset in the rare event that UTF-8
> > isn’t the correct option.  In my previous example, I was able to set the
> > LANG environment variable to jp_JP.SJIS so that I could run that old
> > software in an environment where pathnames were encoded in Shift-JIS.
> > If everything just always assumed a particular character encoding for
> > pathnames, then I wouldn’t have been able to do that.
> 
> But if the program handles arbitrary strings, just like the kernel does,
> that would work too.
> 
> > > > > -  Accept anything, but reject control characters.
> > > > > -  Accept anything, just like the kernel.
> > > > 
> > > > These last two also aren’t quite complete recommendations.  If a GUI
> > > > program wants to display a pathname on the screen, then what character
> > > > encoding should it use when decoding the bytes?
> > > 
> > > Just print them as they got in.  No decoding.  Send the raw bytes to
> > > write(2) or printf(3) or whatever.
> > 
> > I don’t think that printing is a good way for GUI applications to
> > display text.  I don’t normally run GUI applications in a terminal, so
> > I’m not normally able to see a GUI application’s stdout or stderr.  Most
> > of the GUI applications that I use display pathnames as part of a larger
> > window.  In order to do that, the GUI application needs to know which
> > characters the bytes in the pathname represent so that the GUI
> > application can draw those characters on the screen.
> 
> I would do in a GUI exactly the same as what command-line programs do:
> pass the raw string to whatever API prints them.  If the string makes
> sense in the current locale, it will be shown nicely.  If it doesn't
> make sense, it will display weird characters, but that's not a terrible
> issue.  Just run again with the appropriate locale.

OK, but how does that API figure out what characters to display?  What
character encoding should that API use when drawing the characters?  I
think that it’s OK to replace the current recommendation, but
pathname(7) should really explain how such an API would figure out what
characters need to be drawn on the screen.

> For example, in the git repository of the Linux man-pages project, there
> are commits authored by наб <nabijaczleweli@xxxxxxxxxxxxxxxxxx>.  
> Whenever I see the git-log(1) in one of my systems with the C locale, I
> see weird characters.  I just need to re-run with the C.UTF-8 locale.
> 
> But it handles the bytes correctly, even if they don't make sense to the
> system.  If git(1) failed whenever a string doesn't make sense in the
> current locale, the repo would be corrupted sooner than later.