Re: [PATCH v2] man/man7/path-format.7: Add file documenting format of pathnames

Jason Yundt <jason@jasonyundt.email> · Tue, 14 Jan 2025 16:00:46 -0500

On Tue, Jan 14, 2025 at 02:14:42PM +0100, Alejandro Colomar wrote:
> Maybe we should call the page pathname(7)?

I don’t really have an opinion one way or the other.  If you want, I can
submit a new version of this patch that changes it to pathname(7).

> There's a specific term for this: string.
> 
> Which means you don't need to explain so much about the null byte.
> It is understood that a string cannot contain null bytes (except for the
> terminator itself).

I purposefully avoided using the term string because I thought that
using the term string would make the manual harder to understand.  The
term string is associated with several different concepts, and those
concepts would hinder someone’s understanding of paths:

• The term string is often used to refer to counted strings, and counted
  strings can contain null bytes.  I’m more used to counted strings than
  null-terminated strings personally because I have more experience with
  Java, Python and Rust than I do with languages that default to using
  null-terminated strings.  I know that the Linux man-pages mainly focus
  on the C programming language, but paths in particular are something
  that applies to all programming languages.

• Even in the context of the C programming language, the term string can
  still refer to counted strings.  The Windows kernel has three
  different structures: ANSI_STRING [1], OEM_STRING [2] and
  UNICODE_STRING [3].  All three of them are counted and can contain
  null bytes.  As a result, it’s possible to create valid paths on
  Windows that contain NUL characters [4].  When I wrote this manual
  page, I wanted to make it clear that this was one of the ways that the
  Linux kernel differs from the Windows kernel.

• People often think of strings as sequences of characters.  In
  programming languages like Python, this is literally true (you have to
  convert a str object into a bytes object if you want to work with
  bytes instead of characters).  To have the best possible understanding
  of how the kernel handles paths, you should think of them as sequences
  of bytes, not as sequences of characters, and the term string makes
  people think of sequences of characters.

• When I’m writing code in C or C++ and I see a char *, I assume that
  it’s supposed to contains characters that are encoded in the execution
  character set.  That is not a good assumption for paths.

When I first tried to figure out character encoding of paths on Linux, I
found stuff like this post [5].  That post (among others) really helped
me understand paths better because it specifically describes paths as
sequences of bytes rather than strings

[1]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-string>
[2]: <https://learn.microsoft.com/en-us/previous-versions/windows/hardware/kernel/ff558741(v=vs.85)>
[3]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-_unicode_string>
[4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
[5]: <https://unix.stackexchange.com/a/39179/316181>

> I think I would skip this.  It is implicit by the fact that the only
> forbidden character in a filename is '/'.

OK, I’ll submit a v3 that removes that part.

> It might be good to mention that some filesystems restrict the valid
> characters in a filename.

OK, I’ll submit a v3 that adds an example of a filesystem that puts
restrictions on the bytes that can be in filenames.

> Do we want to recommend that?  IMHO, for maximum portability, programs
> should assume the Portable Filename Character Set (or at most some
> subset of ASCII), and fail hard outside of that, which will itself favor
> that users self-restrict to portable file names.

I have a concern about programs failing hard when paths contain
non-ASCII characters.  I have a lot of songs and medleys saved on my
computer.  The paths for over 10,000 of them contain non-ASCII
characters.  Most of those non-ASCII characters come from Chinese,
Japanese or Korean characters that are in the titles of songs or
medleys.  If programs failed hard on paths that contain non-ASCII
characters, what impact would that have on my music collection?

Even if we were to only use a subset of ASCII characters, I would still
be concerned about programs failing hard when paths contain characters
outside of the POSIX Portable Filename Character Set.  I dual boot Linux
and Windows.  When I installed Windows, it automatically created
partitions named “Microsoft reserved partition” and “Basic data
partition”.  At the moment, I can access those partitions using the
paths /dev/disk/by-partlabel/Microsoft\x20reserved\x20partition and
/dev/disk/by-partlabel/Basic\x20data\x20partition.  If programs failed
hard on paths that contain characters outside of the POSIX Portable
Filename Character Set, would I have to fall back to using /dev/sda1 and
/dev/sda2?