--On Monday, November 30, 2020 23:38 +0100 Carsten Bormann <cabo@xxxxxxx> wrote: > Hi John, > > I believe a hard-earned piece of experience I took from using > FTP is that you don't want to do conversions during transit. I think we had figured that out by early in 1971. At least in general, I think it is still true. > In theory, the sending system knows its format, can convert > that to a common format, and the receiving system knows its > own format and can do the second conversion. That was > certainly true when we had access methods, per-record byte > counts etc. > > In practice, systems today have files of various formats lying > around, no useful metadata so no idea what the actual source > format is, so they are likely to botch the conversion. > Getting the original bytes from the sending system and doing > the conversion at the receiving end, outside the actual > transfer, became the norm. I don't remember when ftp > clients started to automatically send "TYPE I" on > connection setup, but it sure made life so much easier (read: > FTP became somewhat usable again). Even then, if someone asks for "type ascii" or, potentially, "type unicode", an error message, maybe with a 3yz or 4yz code, to the effect of "I don't have a clue what this file actually is, suggest you ask for Type I and sort it out yourself". > This is probably one instance of the more general gateway > fundamental. Instead of needing gateways between various mail > systems, everything converged to SMTP, or being as close to > SMTP as possible (i.e. requiring minimal > gatewaying/conversion). > > This has also happened in text formats (with the exception of > the remaining two line ends, and, with a lesser relevance, > interpretation of HTs). Anything that tries to make life > easier for systems that aren't UTF-8 yet is like adding > another transition technology to IPv6: trying to be helpful, > but in practice counterproductive. There, and for the specific case of Unicode, probably we disagree. Keep in mind that a number of contemporary operating systems use UTF-16, or even UTF-32, with some byte ordering, internally. They typically know they are doing that, if only to be able to do an orderly conversation to UTF-8 for putting over the wire or from UTF-8 or ASCII for incoming data. Now, if I retrieve a UTF-16 object in image moded, there better be a BOM and it better be accurate or I'm probably in deep trouble. And, if I retrieve UTF-32 and know that it is text but not anything else, I better not be on a system that that makes special use of null bytes. So I would assume that "TYPE U", interpreted as always encoded in UTF-8, would work fine coming out of most systems because they probably know how to do the conversation already. And, if someone requests "TYPE A", I'd much rather than a system that notices, e.g., that some of the octets in the file have their high bit on, return an error indication than make something up (like zeroing out all of the high-order bits or discarding a bit in the middle). Since you used an email analogy, let me do that too. While we permit a MIME body part of application/octet-stream, which basically translates into plain English as "I don't know or I'm not willing to tell you -- your problem", we do not allow text/plain charset="I don't have a clue what this is or how it was encoded but I think it is text". And, if the originating system knows enough to specify that a body part is text/plain and to specify a charset, it presumably knows enough to respond intelligently to an fTP request for TYPE A / E / or presumably U whether it knows how to do the conversion or not. It is a little different because the sending user or MUA might know even if the operating system doesn't, but still... And most of the MUAs I know of will make an educated guess and, in practice, usually get it right. best, john