On Tue, Jan 14, 2025 at 02:14:42PM +0100, Alejandro Colomar wrote: > Maybe we should call the page pathname(7)? I don’t really have an opinion one way or the other. If you want, I can submit a new version of this patch that changes it to pathname(7). > There's a specific term for this: string. > > Which means you don't need to explain so much about the null byte. > It is understood that a string cannot contain null bytes (except for the > terminator itself). I purposefully avoided using the term string because I thought that using the term string would make the manual harder to understand. The term string is associated with several different concepts, and those concepts would hinder someone’s understanding of paths: • The term string is often used to refer to counted strings, and counted strings can contain null bytes. I’m more used to counted strings than null-terminated strings personally because I have more experience with Java, Python and Rust than I do with languages that default to using null-terminated strings. I know that the Linux man-pages mainly focus on the C programming language, but paths in particular are something that applies to all programming languages. • Even in the context of the C programming language, the term string can still refer to counted strings. The Windows kernel has three different structures: ANSI_STRING [1], OEM_STRING [2] and UNICODE_STRING [3]. All three of them are counted and can contain null bytes. As a result, it’s possible to create valid paths on Windows that contain NUL characters [4]. When I wrote this manual page, I wanted to make it clear that this was one of the ways that the Linux kernel differs from the Windows kernel. • People often think of strings as sequences of characters. In programming languages like Python, this is literally true (you have to convert a str object into a bytes object if you want to work with bytes instead of characters). To have the best possible understanding of how the kernel handles paths, you should think of them as sequences of bytes, not as sequences of characters, and the term string makes people think of sequences of characters. • When I’m writing code in C or C++ and I see a char *, I assume that it’s supposed to contains characters that are encoded in the execution character set. That is not a good assumption for paths. When I first tried to figure out character encoding of paths on Linux, I found stuff like this post [5]. That post (among others) really helped me understand paths better because it specifically describes paths as sequences of bytes rather than strings [1]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-string> [2]: <https://learn.microsoft.com/en-us/previous-versions/windows/hardware/kernel/ff558741(v=vs.85)> [3]: <https://learn.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-_unicode_string> [4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html> [5]: <https://unix.stackexchange.com/a/39179/316181> > I think I would skip this. It is implicit by the fact that the only > forbidden character in a filename is '/'. OK, I’ll submit a v3 that removes that part. > It might be good to mention that some filesystems restrict the valid > characters in a filename. OK, I’ll submit a v3 that adds an example of a filesystem that puts restrictions on the bytes that can be in filenames. > Do we want to recommend that? IMHO, for maximum portability, programs > should assume the Portable Filename Character Set (or at most some > subset of ASCII), and fail hard outside of that, which will itself favor > that users self-restrict to portable file names. I have a concern about programs failing hard when paths contain non-ASCII characters. I have a lot of songs and medleys saved on my computer. The paths for over 10,000 of them contain non-ASCII characters. Most of those non-ASCII characters come from Chinese, Japanese or Korean characters that are in the titles of songs or medleys. If programs failed hard on paths that contain non-ASCII characters, what impact would that have on my music collection? Even if we were to only use a subset of ASCII characters, I would still be concerned about programs failing hard when paths contain characters outside of the POSIX Portable Filename Character Set. I dual boot Linux and Windows. When I installed Windows, it automatically created partitions named “Microsoft reserved partition” and “Basic data partition”. At the moment, I can access those partitions using the paths /dev/disk/by-partlabel/Microsoft\x20reserved\x20partition and /dev/disk/by-partlabel/Basic\x20data\x20partition. If programs failed hard on paths that contain characters outside of the POSIX Portable Filename Character Set, would I have to fall back to using /dev/sda1 and /dev/sda2?