> On Wed, 20 Jun 2018 23:37:51 -0000, Michael Witten wrote: > On Thu, 21 Jun 2018 00:28:27 +0200, Jakub Wilk wrote: > >> POSIX actually requires that CHAR_BIT is exactly 8. >> >> It seems that at least some parts of the man page you wrote >> is based on the incorrect assumption than CHAR_BIT can be >> greater than 8 on POSIX systems. > > Thanks for the reply. > > There are 2 threads in the man page: > > * Documentation of the fact that Linux/POSIX define a byte > to be 8 bits, which I took to imply that CHAR_BIT on POSIX > is indeed supposed to be 8. > > * Documentation of the fact that in general, there's a > discrepancy in definitions that needs to be taken into > acount when dealing with the portability of both programs > and their data. > > Of course, perhaps the man page should be a little more > explicit in documenting that POSIX conformance demands > CHAR_BIT be 8 (e.g., by quoting the constraints placed on > <limits.h>); at one point, I did have a code example > demonstrating a test for POSIX compatibility as a means by > which to imply an 8-bit char, but I removed it because I > thought it was superfluous. > > That being said, POSIX compatibility is not the same as > POSIX-like; to be extra careful, it's probably better not to > rely on a resemblance to POSIX, but instead to use testing > that is more direct, which is why the examples don't rely on > just POSIX for determining CHAR_BIT (after all, you can just > test CHAR_BIT itself). More to the point, it is not intended > to help programmers who are targeting solely POSIX. > > Is there something in particular that is outright incorrect? This email provides an intermediate patch to be applied on top of the original patch (I'll re-submit a final patch at some point in the future); the result is hopefully less confusing about the issue raised. Just save this entire email as "/path/to/email" and then apply it to the repository with the following command: git apply /path/to/email In particular, the content of the man page has changed in the following notable ways, described using a pseudo-diff format that was produced by comparing renderings of the man page (the actual patch of the groff source comes after this pseudo-diff): BYTE(7) Linux Programmer's Manual BYTE(7) NAME - byte - 8 bits; the smallest addressable unit in the kernel + byte - exactly 8 bits; the smallest addressable unit in the kernel char - at least 8 bits; the smallest addressable unit in C SYNOPSIS byte A set of exactly 8 bits. char A set of at least 8 bits. + Under strict conformance to POSIX, a char comprises exactly 8 bits, + and is thus equivalent to a byte. + DESCRIPTION THEORY Linux has been designed to process data composed of bytes, such that each byte has a width of exactly 8; Linus Torvalds documented this fact explicitly (see REFERENCES, [0]; page 21): As far as the kernel is concerned, all data is a stream of 8-bit bytes, and the interpretation of those bytes (possibly by combining two or more bytes into a wider character) is left to the user programs. This same design decision is specified by at least POSIX.1-2004 and its - successors, POSIX.1-2008 and POSIX.1-2017 (see REFERENCES, [1–3]): + successors, POSIX.1-2008 and POSIX.1-2017 (see REFERENCES, [1–3]a): 3.84 Byte An individually addressable unit of data storage that is exactly an octet, used to store a character or a portion of a character[...] A byte is composed of a contiguous sequence of 8 bits. The least significant bit is called the "low-order" bit; the most significant is called the "high-order" bit. Note: The definition of byte from the ISO C standard is broader than the above and might accommodate hardware architectures with different sized addressable units than octets. [...] - 3.254 Octet [3.249 Octet under POSIX.1-2004] + 3.254 Octet ["3.249" under POSIX.1-2004] Unit of data representation that consists of eight contiguous bits. In contrast, and as foreshadowed by POSIX, the C programming language has been designed to process data composed of chars, such that each char has a width of exactly CHAR_BIT, where CHAR_BIT is an integer that - is at least 8 (see REFERENCES, [4]; pages 4, 27, and 44): + is at least 8 (see REFERENCES, [4]; pages 4, 27, 41, and 44): [...] — number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8 [...] + 6.2.5 Types + + [...] + + 15 The three types char, signed char, and unsigned char are collec‐ + tively called the character types. The implementation shall + define char to have the same range, representation, and behavior + as either signed char or unsigned char. + + [...] + ... 4 Values stored in non-bit-field objects of any other object type consist of n×CHAR_BIT bits, where n is the size of an object of that type, in bytes. The value may be copied into an object of type unsigned char [n] (e.g., by memcpy); the resulting set of bytes is called the object representation of the value[...] - Furthermore, in the C programming language, the keyword “char” is used - to specify an integral type; depending on the implementation of C, it - may be signed or unsigned (see REFERENCES, [4]; page 50): - - 6.2.5 Types - - [...] - - 15 The three types char, signed char, and unsigned char are collec‐ - tively called the character types. The implementation shall - define char to have the same range, representation, and behavior - as either signed char or unsigned char. + Fortunately, strict conformance to POSIX resolves the discrepancy + between Linux's “byte” and C's “char”; it does so by demanding that an + implementation of C define CHAR_BIT to be 8 (see REFERENCES, [1–3]b): + + {CHAR_BIT} + Number of bits in a type char. + [CX] ⇒ Value: 8 ⇐ PRACTICE In modern times, it is practically a standard that Linux's “byte” is - synonymous with C's “char”; it is practically a standard that the value - of CHAR_BIT is exactly 8, but this cannot be guaranteed in general. + synonymous with C's “char”; it is practically a standard that C's macro + CHAR_BIT is defined to be exactly 8, but this cannot be guaranteed in + general. ... For maximum portability, it is intended that every C program produce output by converting its “internal”, binary, implementation-specific representations of data into an “external”, character-based, largely human-readable “text stream”; such external data may then be parsed with corresponding input functions in order to recover a suitable internal representation (see REFERENCES, [4]; page 298; section 7.21.2, - "Streams"). For example: + "Streams"). + + For example, here is a C program that should run as intended on any + system that adheres to the C99 standard (provided the input matches the + format produced by the output of this program when run on that same + system): ... However, a text stream is not always practical, particularly when computational resources are at a premium; in that case, there is little alternative but to work more directly with an internal representation through what the C standard calls “binary streams”. As this involves details specific to a particular implementation of C, it becomes extraordinarily important to be cognizant of the sizes of data types, and the layout of multibyte data. - Be prepared to handle various corner cases. For example, consider the - venerable I/O functions read(2) and fread(3), which are often used to - process binary data in ways similar to the following: + For the sake of discussion, assume there exists a system that resembles + POSIX, but is not strictly conformant, such that it provides most of + the expected facilities, but without constraining the macro CHAR_BIT. + For example, consider the venerable I/O functions read(2) and fread(3), + which are often used to process binary data in ways similar to this: ... REFERENCES [0] Linux: a Portable Operating System ⟨https://www.cs.helsinki.fi/u/kutvonen/index_files/linus.pdf⟩;. 1997-01-31. Linus Torvalds. Master's Thesis at University of Helsinki. MD5: 5a9073ee2d3bb0d68f5895857e9cf9ca. [1] POSIX.1-2004; simultaneously "IEEE Std 1003.1™-2004" and "The Open Group Technical Standard Base Specifications, Issue 6". - Base Definitions (Volume 1): "Chapter 3. Definitions". - 2004 edition - ⟨http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html⟩;. - The Open Group. + The Open Group. 2004 edition + ⟨http://pubs.opengroup.org/onlinepubs/009695399/⟩;. + + [1]a Base Definitions (Volume 1): “Chapter 3. Definitions”. + 2004 edition + ⟨http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_84⟩;. + + [1]b Header <limits.h>. 2004 edition + ⟨http://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html#tag_13_24_03_06⟩;. [2] POSIX.1-2008; simultaneously "IEEE Std 1003.1™-2008" and "The Open Group Technical Standard Base Specifications, Issue 7". - Base Definitions (Volume 1): "Chapter 3. Definitions". - 2008 edition - ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/V1_chap03.html⟩;, - 2013 edition - ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/V1_chap03.html⟩;, - and 2016 edition - ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html⟩;. - The Open Group. + The Open Group. 2008 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/⟩;, + 2013 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/⟩;, and + 2016 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/⟩;. + + [2]a Base Definitions (Volume 1): “Chapter 3. Definitions”. + 2008 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/V1_chap03.html#tag_03_84⟩;, + 2013 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/V1_chap03.html#tag_03_84⟩;, + and 2016 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html#tag_03_84⟩;. + + [2]b Header <limits.h>. 2008 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/limits.h.html#tag_13_23_03_06⟩;, + 2013 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/limits.h.html#tag_13_23_03_06⟩;, + and 2016 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/limits.h.html#tag_13_23_03_06⟩;. [3] POSIX.1-2017; simultaneously "IEEE Std 1003.1™-2017" and "The Open Group Technical Standard Base Specifications, Issue 7". - Base Definitions (Volume 1): "Chapter 3. Definitions". - 2018 edition - ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html⟩;. - The Open Group. + The Open Group. 2018 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/⟩;. + + [3]a Base Definitions (Volume 1): “Chapter 3. Definitions”. + 2018 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html#tag_03_84⟩;. + + [3]b Header <limits.h>. 2018 edition + ⟨http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/limits.h.html#tag_13_23_03_06⟩;. [4] C11 Draft Standard. WG14 N1570 ⟨http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf⟩;. ISO/IEC 9899:201x, Programming languages — C. 2011-04-12. JTC1/SC22/WG14. Publically available working-draft of the C11 - standard, ISO/IEC 9899:2011. + standard (ISO/IEC 9899:2011). MD5: 658f5f4490464255b11e1d5502474deb. ... And, now, for the actual intermediate patch that can be fed to `git apply': diff --git a/man7/byte.7 b/man7/byte.7 index 24830e470..b044607da 100644 --- a/man7/byte.7 +++ b/man7/byte.7 @@ -33,7 +33,7 @@ .\" .TH BYTE 7 2018-06-20 "Linux" "Linux Programmer's Manual" .SH NAME -byte \- 8 bits; the smallest addressable unit in the kernel +byte \- exactly 8 bits; the smallest addressable unit in the kernel .br char \- at least 8 bits; the smallest addressable unit in C .SH SYNOPSIS @@ -45,9 +45,20 @@ bits. .PP .B char .IP "" 2 +.RS A set of .I at least 8 bits. +.PP +Under strict conformance to +.IR POSIX , +a +.I char +comprises +.I exactly +8\ bits, and is thus equivalent to a +.IR byte . +.RE .SH DESCRIPTION .SS THEORY .I Linux @@ -71,7 +82,7 @@ and its successors, .IR POSIX.1-2008 " and " POSIX.1-2017 (see .BR REFERENCES , -[1\[en]3]): +[1\[en]3]a): .IP "" 2 .RS .B 3.84 Byte @@ -94,8 +105,7 @@ the above and might accommodate hardware architectures with [...] .PP .B 3.254 Octet -.RB [ "3.249 Octet" -under +["3.249" under .IR POSIX.1-2004 ] .PP Unit of data representation that consists of eight contiguous bits. @@ -119,7 +129,7 @@ is an integer that is (see .BR REFERENCES , [4]; -pages 4, 27, and 44): +pages 4, 27, 41, and 44): .IP "" 2 .RS .IP "" 4 @@ -136,8 +146,8 @@ of the basic character set of the execution environment is possible to express the address of each individual byte of an object uniquely. .IP 3 -.IR "NOTE\ 2" \h'2m'A -byte is composed of a contiguous sequence of bits, the number of +.IR "NOTE\ 2" \h'2m'A\ byte +is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the .IR "low-order bit" ; @@ -184,6 +194,24 @@ number of bits for smallest object that is not a bit-field (byte) .PP [...] .IP "" 4 +.B 6.2.5 Types +.IP +[...] +.IP 15 +The three types +.BR char ", " "signed char" ", and " "unsigned char" +are collectively called +the +.IR "character types" . +The implementation shall define +.B char +to have the same range, representation, and behavior as either +.B signed char +or +.BR "unsigned char" . +.PP +[...] +.IP "" 4 .B 6.2.6 Representations of types .IP .B 6.2.6.1 General @@ -215,38 +243,31 @@ the resulting set of bytes is called the of the value[...] .RE .PP -Furthermore, in the +Fortunately, strict conformance to +.I POSIX +resolves the discrepancy between +.IR Linux 's +\[lq]byte\[rq] and +.IR C 's +\[lq]char\[rq]; it does so by demanding that an implementation +of .I C -programming language, the keyword \[lq]char\[rq] is used -to specify an integral type; -depending on the implementation of -.IR C , -it may be -.IR signed " or " unsigned -(see +define +.B CHAR_BIT +to be 8 (see .BR REFERENCES , -[4]; -page 50): +[1\[en]3]b): .IP "" 2 .RS +{CHAR_BIT} +.PD 0 .IP "" 4 -.B 6.2.5 Types -.IP -[...] -.IP 15 -The three types -.BR char ", " "signed char" ", and " "unsigned char" -are collectively called -the -.IR "character types" . -The implementation shall define -.B char -to have the same range, representation, and behavior as either -.B signed char -or -.BR "unsigned char" . +.PD +Number of bits in a type +.BR char . +.br +[CX] \[rA]\~Value:\ 8\~\[lA] .RE -.PP .SS PRACTICE In modern times, it is practically a standard that .IR Linux 's @@ -254,9 +275,10 @@ In modern times, it is practically a standard that .IR C 's \[lq]char\[rq]; it is practically a standard that -the value of -.I CHAR_BIT -is +.IR C 's +macro +.B CHAR_BIT +is defined to be .IR "exactly 8" , but this cannot be guaranteed in general. .PP @@ -289,7 +311,14 @@ functions in order to recover a suitable internal representation [4]; page 298; section 7.21.2, "Streams"). -For example: +.PP +For example, here is a +.I C +program that should run as intended on any system that adheres to +the +.I C99 +standard (provided the input matches the format produced by +the output of this program when run on that same system): .IP "" 2 .EX #include <limits.h> // CHAR_BIT @@ -362,13 +391,20 @@ of it becomes extraordinarily important to be cognizant of the sizes of data types, and the layout of multibyte data. .PP -Be prepared to handle various corner cases. -For example, consider the venerable I/O functions +For the sake of discussion, assume there exists a system that +resembles +.IR POSIX , +but is not strictly conformant, +such that it provides most of the expected +facilities, but without constraining the macro +.BR CHAR_BIT . +For example, consider the venerable I/O +functions .BR read (2) and .BR fread (3), which are often used to process binary data in ways similar -to the following: +to this: .IP "" 2 .EX #define _POSIX_C_SOURCE 200809L // Required before headers. @@ -766,44 +802,91 @@ C89, C99, C11, POSIX.1-2004, POSIX.1-2008, POSIX.1-2017. .IR Linus\~Torvalds . Master's Thesis at University of \%Helsinki. MD5:\~5a9073ee2d3bb0d68f5895857e9cf9ca. -.IP [1] 4 +.IP [1] .BR POSIX.1-2004 ; simultaneously "IEEE\~Std\~1003.1\[tm]-2004" and "The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 6". +.IR The\~Open\~Group . +.UR http://pubs.opengroup.org/onlinepubs/009695399/ +2004\ edition +.UE . +.RS +.IP [1]a 5 .IR "Base Definitions" " (" Volume\~1 ): -"Chapter\ 3.\~Definitions". -.UR http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html +.RB \[lq] Chapter\ 3.\~Definitions \[rq] . +.UR http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_84 2004\ edition .UE . -.IR The\~Open\~Group . -.IP [2] 4 +.IP [1]b +Header \f[BI]<limits.h>\f[]. +.UR http://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html#tag_13_24_03_06 +2004\ edition +.UE . +.RE +.IP [2] .BR POSIX.1-2008 ; simultaneously "IEEE\~Std\~1003.1\[tm]-2008" and "The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 7". +.IR The\~Open\~Group . +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/ +2008\ edition +.UE , +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/ +2013\ edition +.UE , +and +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/ +2016\ edition +.UE . +.RS +.IP [2]a 5 .IR "Base Definitions" " (" Volume\~1 ): -"Chapter\ 3.\~Definitions". -.UR http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/V1_chap03.html +.RB \[lq] Chapter\ 3.\~Definitions \[rq]. +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/V1_chap03.html#tag_03_84 2008\ edition .UE , -.UR http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/V1_chap03.html +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/V1_chap03.html#tag_03_84 2013\ edition .UE , and -.UR http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html#tag_03_84 2016\ edition .UE . -.IR The\~Open\~Group . -.IP [3] 4 +.IP [2]b +Header \f[BI]<limits.h>\f[]. +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/limits.h.html#tag_13_23_03_06 +2008\ edition +.UE , +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/limits.h.html#tag_13_23_03_06 +2013\ edition +.UE , +and +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/limits.h.html#tag_13_23_03_06 +2016\ edition +.UE . +.RE +.IP [3] .BR POSIX.1-2017 ; simultaneously "IEEE\~Std\~1003.1\[tm]-2017" and "The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 7". +.IR The\~Open\~Group . +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/ +2018\ edition +.UE . +.RS +.IP [3]a 5 .IR "Base Definitions" " (" Volume\~1 ): -"Chapter\ 3.\~Definitions". -.UR http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html +.RB \[lq] Chapter\ 3.\~Definitions \[rq]. +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html#tag_03_84 2018\ edition .UE . -.IR The\~Open\~Group . -.IP [4] 4 +.IP [3]b +Header \f[BI]<limits.h>\f[]. +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/limits.h.html#tag_13_23_03_06 +2018\ edition +.UE . +.RE +.IP [4] .BR "C11 Draft Standard" . WG14 .UR http://www.open\-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf @@ -815,8 +898,8 @@ Programming\ languages\ \[em]\ C. JTC1/SC22/WG14. Publically available working-draft of the .I C11 -standard, -.IR ISO/IEC\~9899:2011 . +standard +.RI ( ISO/IEC\~9899:2011 ). MD5:\~658f5f4490464255b11e1d5502474deb. .ad b .SH SEE ALSO -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html