This man page defines what "byte" means in the context of Linux programming; it draws on various authoritative references, namely Linus Torvalds's master's thesis, POSIX, and the C11 [draft] standard. Each of these references is properly cited. The content has been laid out to render well in a pager that provides at least 80 columns of monospace characters; it is best viewed by `man' with at least one of the following environment variable definitions: COLUMNS=80 MANWIDTH=80 Signed-off-by: Michael Witten <mfwitten@xxxxxxxxx> --- NOTE: The following page: https://www.kernel.org/doc/man-pages/patches.html requests that "long" patches be sent "both inline and as an attachment". I experimented with doing that, but it doesn't really work all that well; when `git am' is given the entire email (including the attachment), it decodes the attachment and thus tries to apply both the inline patch and the attached patch (which are the same), and there is apparently no way to turn off this decoding. Therefore, I've decided not to provide as an attachment a copy of this patch; instead, simply feed this entire email to `git am'. Sincerely, Michael Witten man7/byte.7 | 826 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ man7/units.7 | 2 + 2 files changed, 828 insertions(+) create mode 100644 man7/byte.7 diff --git a/man7/byte.7 b/man7/byte.7 new file mode 100644 index 000000000..24830e470 --- /dev/null +++ b/man7/byte.7 @@ -0,0 +1,826 @@ +.\" Copyright (c) 2018, Michael Witten <mfwitten@xxxxxxxxx> +.\" 2018-06-20T16:25:27+00:00 +.\" +.\" Authorship is recorded in the git history; the copyrights of +.\" each author are implied thereby. +.\" +.\" This man page contains various excerpts from other sources; +.\" any such material for which a copyright may be asserted is +.\" used strictly for scholarly or educational purposes under +.\" the Fair Use rule. +.\" +.\" %%%LICENSE_START(VERBATIM) +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. The author(s) may not +.\" have taken the same level of care in the production of this manual, +.\" which is licensed free of charge, as they might when working +.\" professionally. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.\" %%%LICENSE_END +.\" +.TH BYTE 7 2018-06-20 "Linux" "Linux Programmer's Manual" +.SH NAME +byte \- 8 bits; the smallest addressable unit in the kernel +.br +char \- at least 8 bits; the smallest addressable unit in C +.SH SYNOPSIS +.B byte +.IP "" 2 +A set of +.I exactly 8 +bits. +.PP +.B char +.IP "" 2 +A set of +.I at least 8 +bits. +.SH DESCRIPTION +.SS THEORY +.I Linux +has been designed to process data composed of +.IR bytes , +such that each +.I byte +has a width of exactly 8; +Linus Torvalds documented this fact explicitly (see +.BR REFERENCES , +[0]; page 21): +.IP "" 2 +As far as the kernel is concerned, all data is a +stream of 8-bit bytes, and the interpretation of those bytes +(possibly by combining two or more bytes into a wider character) +is left to the user \%programs. +.PP +This same design decision is specified by at least +.I POSIX.1-2004 +and its successors, +.IR POSIX.1-2008 " and " POSIX.1-2017 +(see +.BR REFERENCES , +[1\[en]3]): +.IP "" 2 +.RS +.B 3.84 Byte +.PP +An individually addressable unit of data storage that is exactly +an octet, used to store a character or a portion of a character[...] +A\~byte is composed of a contiguous sequence +of 8\ bits. +The least significant bit is called the "low-order" bit; +the most significant is called the "high-order" bit. +.PP +.B Note: +.PD 0 +.IP "" 5 +.PD +The definition of byte from the ISO C standard is broader than +the above and might accommodate hardware architectures with +\%different sized addressable units than octets. +.PP +[...] +.PP +.B 3.254 Octet +.RB [ "3.249 Octet" +under +.IR POSIX.1-2004 ] +.PP +Unit of data representation that consists of eight contiguous bits. +.RE +.PP +In contrast, and as foreshadowed by +.IR POSIX , +the +.I C +programming language has been designed to process data composed +of +.IR chars , +such that each +.I char +has a width of exactly +.IR CHAR_BIT , +where +.I CHAR_BIT +is an integer that is +.I "at least 8" +(see +.BR REFERENCES , +[4]; +pages 4, 27, and 44): +.IP "" 2 +.RS +.IP "" 4 +.B 3.6 +.PD 0 +.IP 1 +.PD +.B byte +.br +addressable unit of data storage large enough to hold any member +of the basic character set of the execution environment +.IP 2 +.IR "NOTE\ 1" \h'2m'It +is possible to express the address of each individual byte of an +object uniquely. +.IP 3 +.IR "NOTE\ 2" \h'2m'A +byte is composed of a contiguous sequence of bits, the number of +which is implementation-defined. +The least significant bit is called the +.IR "low-order bit" ; +the most significant bit is called the +.IR "high-order bit" . +.IP +.B 3.7 +.PD 0 +.IP 1 +.PD +.B character +.br +.I \[la]abstract\[ra] +member of a set of elements used for the organization, control, +or representation of data +.IP +.B 3.7.1 +.PD 0 +.IP 1 +.PD +.B character +.br +single-byte character +.br +.I \[la]C\[ra] +bit representation that fits in a byte +.PP +[...] +.IP "" 4 +.B 5.2.4.2.1\h'2m'Sizes of integer types <limits.h> +.IP 1 +The values given below shall be replaced by constant expressions +suitable for use in +.B #if +preprocessing directives[...] +Their implementation-defined values shall be equal or greater in +magnitude (absolute value) to those shown, with the same sign. +.RS +.IP \[em] 2 +number of bits for smallest object that is not a bit-field (byte) +.br +.B CHAR_BIT\~8\h'5m'\p +.RE +.PP +[...] +.IP "" 4 +.B 6.2.6 Representations of types +.IP +.B 6.2.6.1 General +.IP 1 +The representations of all types are unspecified except as stated +in this subclause. +.IP 2 +Except for bit-fields, objects are composed of contiguous +sequences of one or more bytes, the number, order, and +encoding of which are either explicitly specified or +implementation-defined. +.IP 3 +Values stored in unsigned bit-fields and objects of type +.B unsigned char +shall be represented using a pure binary notation. +.IP 4 +Values stored in non-bit-field objects of any other object type +consist of +.IR n \[mu]\f[B]CHAR_BIT\f[] +bits, where +.I n +is the size of an object of that type, in bytes. +The value may be copied into an object of type +.BI "unsigned char [" n ] +(e.g., by +.BR memcpy ); +the resulting set of bytes is called the +.I object representation +of the value[...] +.RE +.PP +Furthermore, in the +.I C +programming language, the keyword \[lq]char\[rq] is used +to specify an integral type; +depending on the implementation of +.IR C , +it may be +.IR signed " or " unsigned +(see +.BR REFERENCES , +[4]; +page 50): +.IP "" 2 +.RS +.IP "" 4 +.B 6.2.5 Types +.IP +[...] +.IP 15 +The three types +.BR char ", " "signed char" ", and " "unsigned char" +are collectively called +the +.IR "character types" . +The implementation shall define +.B char +to have the same range, representation, and behavior as either +.B signed char +or +.BR "unsigned char" . +.RE +.PP +.SS PRACTICE +In modern times, it is practically a standard that +.IR Linux 's +\[lq]byte\[rq] is synonymous with +.IR C 's +\[lq]char\[rq]; +it is practically a standard that +the value of +.I CHAR_BIT +is +.IR "exactly 8" , +but this cannot be guaranteed in general. +.PP +The implication of this discrepancy is that a programmer who is +concerned about portability must be careful not to conflate these +data types, especially when working with an interface that exists +between systems based on different sets of definitions; +naturally, the most perilous interfaces are procedures that +perform input or output (I/O). +.PP +The simplest way to avoid problems is to exchange data according +to well-defined protocols of serialization, especially protocols that +encode data as a stream of atoms (such as bytes) in a predetermined +sequence. +In\~fact, this is the major intention of +.IR C 's +standard I/O functions, which are defined primarily in terms of +character semantics. +.PP +For maximum portability, it is intended that every +.I C +program produce output by converting its \[lq]internal\[rq], +binary, implementation-specific representations of data into +an \[lq]external\[rq], character-based, largely human-readable +\[lq]text\ stream\[rq]; +such external data may then be parsed with corresponding input +functions in order to recover a suitable internal representation +(see +.BR REFERENCES , +[4]; +page 298; +section 7.21.2, "Streams"). +For example: +.IP "" 2 +.EX +#include <limits.h> // CHAR_BIT +#include <stdlib.h> // exit, EXIT_FAILURE +#include <stdio.h> // FILE, fopen, fscanf, fprintf, fclose + +typedef struct { unsigned char red, green, blue; } Color; + +_Bool read_color(FILE *f, Color *c) +{ + return 3 == fscanf( + f, + "%hhx %hhx %hhx\\n", + &c\->red, &c\->green, &c\->blue + ); +} + +void write_color(FILE *f, const Color *c) +{ + fprintf(f, "%hhx %hhx %hhx\\n", c\->red, c\->green, c\->blue); +} + +#if CHAR_BIT > 8 + #define SUM_AND_WRAP(primary) \\ + sum\->primary = (sum\->primary + c\->primary) % 256 +#else + #define SUM_AND_WRAP(primary) \\ + sum\->primary += c\->primary +#endif + +void sum_colors(Color *sum, const Color *c) +{ + SUM_AND_WRAP(red); + SUM_AND_WRAP(green); + SUM_AND_WRAP(blue); +} + +_Bool is_black(const Color *c) +{ + return (c\->red == 0) && (c\->green == 0) && (c\->blue == 0); +} + +int main() +{ + FILE *f = fopen("/tmp/colors", "a+"); + if (!f) + exit(EXIT_FAILURE); + + Color c, sum = {0}; + while (read_color(f, &c)) + sum_colors(&sum, &c); + + if (is_black(&sum)) + sum.red = sum.green = sum.blue = 1; + + write_color(f, &sum); + fclose(f); +} +.EE +.PP +However, a text stream is not always practical, particularly when +\%computational resources are at a premium; +in that case, there is little alternative but to work more +directly with an internal representation through what the +.I C +standard calls \[lq]binary\ streams\[rq]. +As this involves details specific to a particular implementation +of +.IR C , +it becomes extraordinarily important to be cognizant of the sizes +of data types, and the layout of multibyte data. +.PP +Be prepared to handle various corner cases. +For example, consider the venerable I/O functions +.BR read (2) +and +.BR fread (3), +which are often used to process binary data in ways similar +to the following: +.IP "" 2 +.EX +#define _POSIX_C_SOURCE 200809L // Required before headers. + +#include <unistd.h> // _POSIX_VERSION +#include <fcntl.h> // open, O_RDONLY +#include <stdlib.h> // exit, EXIT_FAILURE +#include <unistd.h> // read +#include <stdio.h> // fdopen (POSIX), fread + +#if _POSIX_VERSION < 200112L + #error "Function \[aq]fdopen()\[aq] requires at least POSIX-1.2001." +#endif + +void delete_all_user_data(void); // Defined somewhere else. + +void read_anything_into_a_bitfield(int fd) +{ + struct { + unsigned int red : 8, + green : 8, + blue : 8; + } buffer; + (void)read(fd, &buffer, 3); + if (buffer.blue != 42) // MISTAKE W: + delete_all_user_data(); // Could be called inadvertently. +} + +void read_a_byte_into_a_char(int fd) +{ + char buffer; + (void)read(fd, &buffer, 1); + if (buffer != \[aq]!\[aq]) // MISTAKE X: + delete_all_user_data(); // Could be called inadvertently. +} + +void read_a_char_into_anything_else(int fd) +{ + FILE* f = fdopen(fd, "rb"); + unsigned short buffer; // MISTAKE Y: + (void)fread(&buffer, 2, 1, f); // Potential buffer overflow. + if (buffer != 666) // MISTAKE Z (and X, again): + delete_all_user_data(); // Could be called inadvertently. +} + +int main() +{ + int fd = open("/path/to/data", O_RDONLY); + if (fd == \-1) + exit(EXIT_FAILURE); + + read_anything_into_a_bitfield(fd); + read_a_byte_into_a_char(fd); + read_a_char_into_anything_else(fd); + + close(fd); +} +.EE +.PP +Each of the above mistakes results from a +.I good +assumption about the size or layout of a data type: +.IP "" 2 +.RS +.IR "MISTAKE W" ": Reading anything into a bit-field" +.IP "" 2 +.RS +In dealing with the precise layout of data, it's tempting to +use the +.I bit-field +construct of the +.I C +programming language, which allows for defining a member of a +.I struct +as representing a certain number of contiguous bits within +the underlying object. +Unfortunately, the +.I C +standard makes very few guarantees about the way consecutive +bit-fields are mapped to the bits of an object (see +.BR REFERENCES , +[4]; +pages 112\[en]114): +.IP "" 2 +.RS +.IP "" 4 +.B 6.7.2 Type Specifiers +.IP +.IP 5 +[...] for bit-fields, it is implementation-defined whether the +specifier +.B int +designates the same type as +.B signed\ int +or the same type as +.BR unsigned\ int . +.IP +[...] +.IP +.B 6.7.2.1 Structure and union specifiers +.IP +[...] +.IP 5 +A bit-field shall have a type that is a qualified or unqualified version of +.BR _Bool ", " signed\ int ", " unsigned\ int , +or some other implementation-defined type. +It is implementation-defined whether atomic types are permitted. +.IP +[...] +.IP 11 +An implementation may allocate any addressable storage unit +large enough to hold a bit-field. +If enough space remains, a bit-field that immediately follows +another bit-field in a structure shall be packed into adjacent +bits of the same unit. +If insufficient space remains, whether a bit-field that does not +fit is put into the next unit or overlaps \%adjacent units is +implementation-defined. +The order of \%allocation of bit-fields within a unit +(high-order to low-order or low-order to high-order) is +implementation-defined. +The alignment of the +addressable storage unit is unspecified. +.RE +.PP +Clearly, it's necessary to write implementation-specific code +when using bit-fields, and thus bit-fields should be avoided +in general. +.RE +.PP +.IR "MISTAKE X" ": Reading a byte into a char" +.IP "" 2 +.RS +Here, the +.I POSIX +function +.BR read (2) +stores a 1-byte value (a value that is +.I exactly +8\ bits) into +.IR buffer , +which is an object of type +.IR char ; +certainly, +.IR C 's +.I char +data type provides enough storage for this purpose, because the +.I C +standard requires that a +.I char +provide +.I at least +8\ bits of storage. +.PP +This fact leads one to make a very good assumption in practice, +namely that a +.I char +provides +.I exactly +8\ bits. +However, it is permissible for an implementation of +.I C +to define +.I char +as a data type that represents +.I more +than 8\ bits. +.PP +Consequently, in such a situation, the +call to the function +.BR read (2) +dutifully fills the lowest 8\ bits of +.IR buffer , +but may leave higher bits untouched and therefore indeterminate. +This mistake comes into full force when the entire +.I buffer +object is interpreted as a single value, thereby allowing those +indeterminate higher-order bits to contribute to the calculation. +One solution to this issue is to "zero out" the +.I buffer +before using it: +.IP "" 2 +.EX +char buffer = 0; +(void)read(fd, &buffer, 1); +.EE +.PP +Of course, on a system where a +.I char +does indeed provide exactly 8\ bits of storage, this extra step is +unnecessary; +it might be worthwhile to provide tailored code: +.IP "" 2 +.EX + +#include <limits.h> // CHAR_BIT + +#if CHAR_BIT > 8 // If a char is larger than the + #define clear(b) (b = 0) // minimum 8 bits, then it\[aq]s +#else // necessary to clear those "extra" + #define clear(b) // higher-order bits; otherwise, do +#endif // nothing at all. + +void read_a_byte_into_a_char() +{ + char buffer; + clear(buffer); // Clear bits (if necessary). + (void)read(fd, &buffer, 1); + if (buffer != \[aq]!\[aq]) + delete_all_user_data(); +} +.EE +.RE +.PP +.IR "MISTAKE Y" ": Reading a char into anything else" +.IP "" 2 +.RS +Here, the standard +.I C +function +.BR fread (3) +is used to store a 2-char value into +.IR buffer , +which is an object of type +.IR unsigned\~short . +.PP +Yet, does +.I buffer +provide enough storage for this purpose?\ If a +.I char +represents +.I exactly +8\ bits, then it does indeed provide enough, because +the +.I C +standard mandates that an +.I unsigned\ short +must provide +.I at\ least +16 bits of storage. +.PP +Of course, this means that the +.I buffer +should be adequately prepared so as to avoid the same problem +evoked by +.IR MISTAKE\ X ; +it would be nice to be able to write something +like the following: +.IP "" 2 +.EX +#if sizeof(buffer) > 2 // This doesn't work. + buffer = 0; +#endif +.EE +.PP +Alas, the +.I C preprocessor +is unaware of data types, variables, or the +.I sizeof +operator; so, it's necessary to draw the correct conclusion from a +more indirect calculation, which tests the maximum value +.RB ( USHRT_MAX ) +that is representable by an object of type +.IR unsigned\ short : +.IP "" 2 +.EX +#include <limits.h> // USHRT_MAX +[...] + #if USHRT_MAX > 65535 + buffer = 0; + #endif +[...] +.EE +.PP +However, consider an implementation of +.I C +that defines both a +.I char +and an +.I unsigned\ short +to represent +.I exactly +16\ bits. +In such a case, an +.I unsigned\ short +represents not 2\ chars, but only 1\ char, yielding +.IR MISTAKE\ Y ; +in such a case, the function +.BR fread (3) +is being asked to store 2\ chars into 1\ char, which would +cause a buffer overflow, clobbering 1\ char of storage. +.PP +The only sensible way to handle this is to be explicit about the +cases that are covered, making sure to define both input and +output procedures so as to adhere to these cases; +for this purpose, it is often helpful to use +.IR C 's +built-in +.I sizeof +operator (the size of any data type, such as +.IR unsigned\ short , +is guaranteed to be an integer multiple of the size of a +.IR char ): +.IP "" 2 +.EX +(void)fread(&buffer, sizeof(buffer), 1, f); +.EE +.RE +.PP +.IR "MISTAKE Z" ": Trusting the source of data" +.IP "" 2 +.RS +Ultimately, data may be read well only if it was previously +written well; +input and output are intimately linked. +.PP +This is particularly true of the standard +.I C +functions, which leave many details to the discretion of the +particular implementation; +even on one machine, it is possible that a program could yield +incompatible results when run under different implementations. +.PP +Portability is eased not by leaving aspects of a +program undefined (as the +.I C +standard is wont to do), but rather by defining as much of a +program as possible, so that it's a straightforward process to +identify mismatches in expectations. +.PP +For this case, it might be best just to identify basic criteria, +and only worry about shoring up incompatibilities when they +arise, if\ ever; +for example, the code can refuse to compile unless a +.I char +provides exactly 8\ bits and a +.I short +exactly 16\ bits, and it can require that multibyte binary data +be exchanged in big-endian (\[lq]network\[rq]) byte order, and +it can explicitly identify those places in the program where +portability is threatened: +.IP "" 2 +.EX +[...] +#include <limits.h> // CHAR_BIT, USHRT_MAX +#include <arpa/inet.h> // ntohs + +#define PORTABILITY_THREAT + +#if (CHAR_BIT != 8) || (USHRT_MAX > 65535) + #error "This program has not yet been ported to this system." +#endif + +[...] + +void read_a_char_into_anything_else(int fd) +{ + FILE* f = fdopen(fd, "rb"); + unsigned short buffer; + PORTABILITY_THREAT (void)fread(&buffer, 2, 1, f); + buffer = ntohs(buffer); // network\-to\-host endian conversion. + if (buffer != 666) + delete_all_user_data(); +} + +[...] +.EE +.PP +Such tactics are aided by +.IR C99 's +\[lq]fixed\ width\[rq] integer types, each of which is defined +in +.I <stdint.h> +only when an implementation explicitly supports it (and must be +defined if the implementation provides any integer type with the +desired properties); +these types include +.I uint8_t +for declaring an object comprised of +.I exactly +8\ bits, and +.I uint16_t +for 16\ bits, and so on. +.RE +.RE +.SH CONFORMING TO +C89, C99, C11, POSIX.1-2004, POSIX.1-2008, POSIX.1-2017. +.SH REFERENCES +.ad l +.IP [0] 4 +.UR https://www.cs.helsinki.fi/u/kutvonen/index_files/linus.pdf +.B Linux: a Portable Operating System +.UE . +1997-01-31. +.IR Linus\~Torvalds . +Master's Thesis at University of \%Helsinki. +MD5:\~5a9073ee2d3bb0d68f5895857e9cf9ca. +.IP [1] 4 +.BR POSIX.1-2004 ; +simultaneously "IEEE\~Std\~1003.1\[tm]-2004" and +"The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 6". +.IR "Base Definitions" " (" Volume\~1 ): +"Chapter\ 3.\~Definitions". +.UR http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html +2004\ edition +.UE . +.IR The\~Open\~Group . +.IP [2] 4 +.BR POSIX.1-2008 ; +simultaneously "IEEE\~Std\~1003.1\[tm]-2008" and +"The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 7". +.IR "Base Definitions" " (" Volume\~1 ): +"Chapter\ 3.\~Definitions". +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2008edition/basedefs/V1_chap03.html +2008\ edition +.UE , +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2013edition/basedefs/V1_chap03.html +2013\ edition +.UE , +and +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html +2016\ edition +.UE . +.IR The\~Open\~Group . +.IP [3] 4 +.BR POSIX.1-2017 ; +simultaneously "IEEE\~Std\~1003.1\[tm]-2017" and +"The\ Open\ Group Technical\ Standard Base\ Specifications, Issue\ 7". +.IR "Base Definitions" " (" Volume\~1 ): +"Chapter\ 3.\~Definitions". +.UR http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap03.html +2018\ edition +.UE . +.IR The\~Open\~Group . +.IP [4] 4 +.BR "C11 Draft Standard" . +WG14 +.UR http://www.open\-std.org/JTC1/SC22/WG14/www/docs/n1570.pdf +.I N1570 +.UE . +.IR ISO/IEC\ 9899:201x , +Programming\ languages\ \[em]\ C. +2011-04-12. +JTC1/SC22/WG14. +Publically available working-draft of the +.I C11 +standard, +.IR ISO/IEC\~9899:2011 . +MD5:\~658f5f4490464255b11e1d5502474deb. +.ad b +.SH SEE ALSO +.BR bswap (3), +.BR byteorder (3), +.BR endian (3), +.BR units (7) diff --git a/man7/units.7 b/man7/units.7 index 3df0f28c8..8cb6d5b32 100644 --- a/man7/units.7 +++ b/man7/units.7 @@ -127,3 +127,5 @@ hda: 120064896 sectors (61473 MB) w/2048KiB Cache .in .PP the MB are megabytes and the KiB are kibibytes. +.SH SEE ALSO +.BR byte (7) -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html