Re: man/man7/pathname.7: Correct handling of pathnames

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jason, Florian,

On Mon, Jan 27, 2025 at 09:50:49AM -0500, Jason Yundt wrote:
> On Mon, Jan 27, 2025 at 12:22:49PM +0100, Alejandro Colomar wrote:
> > Hi Jason,
> > 
> > I think the recommendation to use the current locale for handling
> > pathnames isn't good.
> > 
> > If I use the C locale (and I do have systems with the C locale), then
> > programs running on that system would corrupt files that go through that
> > system.
> 
> I agree.  I think that this is just a limitation of the design of
> UNIX-like systems.  As long as users are allowed to choose different
> locale codesets, mojibake will always be possible.  Sometimes, you just
> have to temporarily switch to a different locale in order to make things
> work.  For example, I was trying to run some old Japanese software a
> while ago, and I had to add a Shift-JIS locale to my system in order to
> get it to work.
> 
> > Let's say you send me María.song, and I download it on a system
> > using the C locale.  Programs would fail to copy the file.
> 
> Not necessarily.  pathname(7) says “Paths should be encoded and decoded
> using the current locale’s codeset in order to help prevent mojibake.”
> In many cases, you don’t need to encode or decode a pathname.  Here’s a
> program that copies a file without encoding or decoding any pathnames:

Right.  But then, when do you need to do encoding?  Programs will either
receive the pathname from the command line, or read it from some file,
or create one of its own.

When creating a path of its own, it should restrict itself to the
Portable Filename Character Set, so encoding shouldn't be a problem.

When reading pathnames, they'll already be encoded suitably.

> > Instead, I think a good recommendation would be to behave in one of the
> > following ways:
> > 
> > -  Accept only the POSIX Portable Filename Character Set.
> 
> This one isn’t quite a complete recommendation.  The POSIX Portable
> Filename Character Set is just a character set.  It’s not a character
> encoding.  If we go with this one, then we would need to say something
> along the lines of “Encode and decode paths using ASCII and only accept
> characters that are in the POSIX Protable Filename Character Set.”
> 
> > -  Assume UTF-8, but reject control characters.
> > -  Assume UTF-8.
> 
> > -  Accept anything, but reject control characters.
> > -  Accept anything, just like the kernel.
> 
> These last two also aren’t quite complete recommendations.  If a GUI
> program wants to display a pathname on the screen, then what character
> encoding should it use when decoding the bytes?

Just print them as they got in.  No decoding.  Send the raw bytes to
write(2) or printf(3) or whatever.

> > The current locale should actively be ignored when handling pathnames.
> > 
> > I've modified the example in the manual page to use a filename that's
> > non-ASCII, to make it more interesting.  See how it fails:
> > 
> > 
> > What do you think?
> 
> I honestly don’t know what the recommendation should be.  Here’s what I
> would need to know in order to figure out what the recommendation should
> be.  A while ago, I asked this question on the Unix & Linux Stack
> Exchange [1]:
> 
> > What does a locale’s codeset get used for?
> >
> > According to glibc’s manual:
> >
> > > Most locale names follow XPG syntax and consist of up to four parts:
> > >
> > > ```
> > > language[_territory[.codeset]][@modifier]
> > > ```
> >
> > For example, you could have a locale named zh_CN.GB18030 which would
> > use the GB 18030 character encoding, or you could have a locale named
> > zh_CN.UTF-8 which would use the UTF-8 character encoding.
> >
> > Here’s where I’m confused: let’s say that I switch from a zh_CN.UTF-8
> > locale to a zh_CN.GB18030 locale. In that situation, some things that
> > used to be encoded in UTF-8 are now going to be encoded in GB 18030.
> > Which things will now be encoded in GB 18030? Will stdin, stdout and
> > stderr use GB 18030? What about argv? What about filesystem paths?
> >
> > Technically, a program can do whatever it wants and ignore the locale
> > completely, but let’s assume that programs are doing the correct thing
> > here. What is supposed to be encoded in GB 18030 if I use a
> > zh_CN.GB18030 locale?
> 
> I didn’t get an answer to that question, so I asked it again on the
> libc-help mailing list [2].  I got one response that was super helpful
> [3].  That response clearly said that paths should be encoded using the
> locale’s codeset.  If you think that answer was incorrect, then I would
> like a very specific list of things that should and should not be
> encoded using the locale’s codeset so that I can add it to the glibc
> manual (and maybe the POSIX standard if I can figure out how to
> contribute to that).
> 
> [1]: <https://unix.stackexchange.com/q/780404/316181>
> [2]: <https://sourceware.org/pipermail/libc-help/2024-August/006736.html>
> [3]: <https://sourceware.org/pipermail/libc-help/2024-August/006737.html>

Maybe Florian can comment.


Have a lovely day!
Alex

> 
> > Have a lovely day!
> > Alex
> > 
> > -- 
> > <https://www.alejandro-colomar.es/>

-- 
<https://www.alejandro-colomar.es/>

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux