Re: F40 and breaking file command change

Tim via users <users@xxxxxxxxxxxxxxxxxxxxxxx> · Fri, 03 May 2024 03:34:25 +0930

George N. White III:
> In some cases, "established behaviour" means text files using ASCII
> character sets, which creates problems for the majority of the world,
> and should be considered "broken".   In this day and age, we need to
> pay attention to text encodings.   

The notion of "plain text" has been broken for decades.  Virtually
every computer system used a format that was distinct to itself, and
with no content type identification in the file.  It was generally
presumed that your text file was the same as the rest of the text on
your system, but this failed badly when you exchange files with
foreigners.

These days we usually use UTF, which is only barely compatible with
ASCII.  If it only uses characters from 0 to 127, it is.  Any
characters higher than that are not ASCII, and there are many hundreds,
perhaps thousands of characters in its repertoire (hence why I said
ASCII and UTF was *barely* compatible).  Many computers using 7-bit
text often falsely described their individual non-ASCII encoding of
their own text as being ASCII.  Or their 8-bit non-ASCII encoding as
ASCII.  And there's various different UTF schemes, too.

Some computers had an associated meta file that did have info about the
file (Amigas and their ".info" files along with the file you're
interested in, Macs and their old dual data and fork file system). 
There was a certain amount of logic in that, but made sending files a
nuisance - you had to remember to do both files, or the system had to
manage that for you.  And you're still dependent on the recipient being
able to handle it, they may not.

UTF can be determined by looking at a couple of bytes at the start (the
BOM), and parsing more of the file if that's missing to try and guess
what it might be by looking for some common code sequences (web
browsers have done that for many years, and got it wrong for many
years, too.  But something that doesn't check for that and presumes
ASCII will be surprised by extraneous content.

Using a non-text format for data and config files is more robust.  It
can start with header info that unambiguously identifies itself, as
part of a single self-describing file and data (data format,
application it's intended for, etc).  For what it's worth any binary
file can contain text, directly as itself, it's not precluded.  It can
start with the identifying header, followed by text that could be
parsed by more than the original application (for future-proofing).

Applications that save and use binary data can also handle versioning
better, if thought went into supporting that.  If they data has changed
format over time, and identifies what it is, the application can also
use the data in ways it knows how it used to use, differently from how
it currently does it, and get that right.

-- 

NB:  All unexpected mail to my mailbox is automatically deleted.
I will only get to see the messages that are posted to the list.

The following system info data is generated fresh for each post:

uname -rsvp
Linux 6.2.15-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 11 16:51:53
UTC 2023 x86_64
--
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue