Re: [PATCH] New Page: byte.7: Document what "byte" means, theory and practice

Michael Witten <mfwitten@xxxxxxxxx> · Fri, 22 Jun 2018 05:54:57 -0000

On Thu, 21 Jun 2018 21:46:03 +0200, Michael Kerrisk wrote:
>On 06/20/2018 06:25 PM, Michael Witten wrote:
>> This man page  defines what "byte" means in the  context of Linux
>> programming; it draws on various authoritative references, namely
>> Linus  Torvalds's master's  thesis,  POSIX, and  the C11  [draft]
>> standard. Each of these references is properly cited.
>>
>> The content  has been  laid out  to render well  in a  pager that
>> provides at least 80 columns  of monospace characters; it is best
>> viewed by  `man' with at least one  of the  following environment
>> variable definitions:
>>
>>   COLUMNS=80
>>   MANWIDTH=80
>
> Thanks for sending this, but  what's missing in this cover message
> is some explanation of why the  page needed. It's not clear to me.
> Nor is the rationale clear from reading the start of the page. So,
> why is the page needed?

A programmer needs  to hook into  various interfaces to  make things
work.  Linux provides an interface, POSIX provides an interface, and
the C standard provides an interface; and, of course, there are many
other interfaces, some of which haven't even yet been built, but for
which a programmer might want to be  fully prepared, and which might
itself target one of those Big interfaces while neglecting another.

Though these Big Three interfaces  are related, they're not actually
coupled  all that  strongly  together; there's  plenty  of room  for
disagreement both  now and in  the future,  which is one  reason why
Linus Torvalds  writes in his  master's thesis  about  the size of a
byte and the  nature of data  (the quote in the new man page is from
the  section "Unresolved  Issues", where  he details  concerns about
portability).

For one thing, standards are written to be ignored.

When has anything of moderate complexity in this world, let alone in
computing, ever really done what it's supposed to do? That's why the
digital world has been (was?) built by hackers;  it took clever folk
who weren't afraid  to connect  things  together, but their intrepid
spirit came  not  from saying "Hey, it compiled without error when I
pressed the 'play' button!", but  rather from knowing exactly how to
connect things, especially on a low level.

Indeed, you don't  have much of a programming  environment without a
way to think about bytes or the sizes of data types.  I suspect that
most programs only  "work" in the sense that they  "happen to work";
these days, 32-bit computers are being routinely dropped by software
projects,  and labeled "obsolete",  despite the  fact  that they are
perfectly adequate  machines and had been  supported without trouble
for years if not decades.

Why?

Because that software sucks,  and  was  written  without  a shred of
respect for the  sizes (or layouts) of data types.  On the LKML, you
can find  people commiserating over  the horrors of  bit-fields, for
the simple reason that they do not behave like they should according
to "Common Sense" (nevertheless,  they satisfy every  one  of  the C
standard's specifications...  or lack thereof).  Similarly, how many
problems have been  caused over the years by a  lack  of respect for
endianness?

Such failures  to meet  the demands of  an interface are  a cultural
phenomenon that manifests from the  dearth of documentation on these
topics, particularly now that computing  has reached ever more lofty
heights of abstraction, shielding many a budding programmer from the
trial-by-fire that is coding near the metal.

Sure, Linux targets POSIX  (or  maybe  POSIX now targets Linux), but
only so far.  Only so far!  The nature of  the Linux kernel  is such
that it is at best "POSIX-like", rather than "POSIX-compliant"; it's
driven more by backwards compatibility than adherence to the digital
dictates of a committee.

There's nothing stopping a determined  soul from porting Linux to an
unusual architecture that does not have an 8-bit primitive;  for the
sake of  compatibility, that  port  would undoubtedly require  a few
hacks to emulate an 8-bit interface, but that's just the kernel! The
user space is an entirely different domain, which might eschew POSIX
compliance (targeting instead  just the looser constraints  of the C
standard),  and  thereby  place  on the  programmer  the  burden  of
structuring data properly.

Even if there were the  *strictest* compliance to POSIX, guess what?
An `unsigned short' under POSIX  ain't necessarily 16 bits; like the
C standard,  POSIX requires only that an `unsigned short' be capable
of representing *at least* 16 bits:

  http://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/limits.h.html#tag_13_23_03_06
  {USHRT_MAX}
      Maximum value for an object of type unsigned short.
      Minimum Acceptable Value: 65 535

All of your code that uses an uninitialized `unsigned short' to read
in a single 2-byte datum is wrongheaded,  even under POSIX;  it just
"happens to work",  at least  for now.  You've  got  to clear  those
"extra" higher-order bits if you  don't want them used inadvertently
in your calculation.  That is, you've got to write a program that is
aware of the sizes of even basic data types.

As described by Linus's master's thesis, that's why the Linux kernel
targets a header-based "virtual machine" that provides architecture-
specific implementations of integer types with precise widths (e.g.,
`u8', `u16', or `u32);  similarly, that's why this man page mentions
C99's fixed-width integer types (e.g., `uint8_t', `uint16_t', etc.).

Please recall that the compiler used to build the kernel need not be
the compiler used to build normal programs; while the aforementioned
port may require a hacked implementation of C to emulate `u8' in the
kernel source, there's no reason to suspect that such a hack is also
available for the user-space compiler.

The new  man page  explicitly discusses issues  like this,  but con-
centrates more on the narrow topic of  what a "byte" or a "char" is.
Perhaps the purpose of this man  page would be more obvious if other
data  types  (like `short')  were also listed  in the SYNOPSIS,  and
further  discussed  in the DESCRIPTION.  Perhaps the man page should
also delve into Linux's integer types.

What do you think?

The world is messy, and a progammer  (more than anybody else)  needs
to be aware of just how  messy  it is; a good  programmer *wants* to
know how messy the world is, and a superb programmer enjoys thinking
about how to keep things tidy.

In short, I was scratching an itch.

I wrote a  man page that I personally wish  had already been written
when  I  was beginning to think about such things;  certainly,  even
writing  this  man page  has  helped  me crystalize and organize  my
thoughts on the topic of  compatibility, portability, and dependable
exchange of data across interfaces.  I  hold in mind  a picture of a
programmer,  sitting  eagerly  before  the terminal  of  a  software
development environment,  ready and willing to read about  the tools
at one's disposal.

What's a Linux Programmer's Manual without a page on bits and bytes?

Sincerely,
Michael Witten
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html