--On Monday, September 18, 2017 18:16 -0500 Adam Roach <adam@xxxxxxxxxxx> wrote: > I think we're talking at cross purposes here. > > Today, as we speak, I have a copy of the RFC repository on my > hard drive. (To be precise, I have it on most of the hard > drives of the various machines that I use). For my current > workflow, I *think* all of them got there via rsync, although > it's possible that some of them are still using an old > wget-based setup. It's kind of immaterial how they got there, > because a careful examination of them would show the same > result between the two methods (and any others I could think > of, including FTP mirroring and manually downloading via web > browsers): it's a sequence of bytes, with a ".txt" file > extension; identical, regardless of which tool downloaded > them. There is nothing else about the file to indicate its > encoding.[1] > > Okay. So, now, I open up the local file browser to that file > on my hard drive, and double-click on an RFC. An application > is launched. Let's say that application is Wordpad. How does > it know which character encoding to use for this file? It doesn't and the presence of absence of a pair of octets it might interpret as a BOM just feeds another heuristic. Keep in mind that, if the content of that file were in 8859-1, that could be interpreted as Small Thorn followed by small y with diaeresis (the characters Unicode codes as U+00FE and U+00FF). Or course, it if were coded in ISO 8859-5 (or 6, 7, 8, 11, etc.), there would be different interpretations. I note that several versions of Wordpad will get just about equally confused if your rsync (or whatever) fetch results in an object in your local, Windows-ish, file system with LF as an EOL rather than CRLF. Conventions about file names suffixed in ".txt" have worked as well as they have only because it has been possible to assume that ".txt" implies ASCII. From the early days of the net, even that has not been perfect, not just because EOL=LF has existed since early on (IIR, the first version of ASCII required it), but because there was that EBCDIC problem. The only real solution to these problems is files that carry their own descriptions (the idea of a two-part, or even three-part, file where one part is a description predates its adoption by Apple my many years). Otherwise, it is all heuristics and the other strong argument against BOM as a "this is UTF-8" clue is that it often won't work. I got over believing that it was reasonable to try to abstract file description information down to a few characters and then embed it in the file name in the early 1970s, but obviously lost that battle long ago. Perhaps, if we need to indicate what is UTF-8 and what isn't, we should start suffixing files with ".utf8" or, if people like three character suffixes, ".ut8" or ".uf8", rather than relying on in-file indicators that violate the relevant standards, don't adequately identify the relevant CCS, and invite assorted file concatenation problems, etc. john