Re: Two FTP issues

John C Klensin <john-ietf@xxxxxxx> · Mon, 30 Nov 2020 18:26:00 -0500

--On Monday, November 30, 2020 23:38 +0100 Carsten Bormann
<cabo@xxxxxxx> wrote:

> Hi John,
> 
> I believe a hard-earned piece of experience I took from using
> FTP is that you don't want to do conversions during transit.

I think we had figured that out by early in 1971. At least in
general, I think it is still true.

> In theory, the sending system knows its format, can convert
> that to a common format, and the receiving system knows its
> own format and can do the second conversion.  That was
> certainly true when we had access methods, per-record byte
> counts etc.
> 
> In practice, systems today have files of various formats lying
> around, no useful metadata so no idea what the actual source
> format is, so they are likely to botch the conversion.
> Getting the original bytes from the sending system and doing
> the conversion at the receiving end, outside the actual
> transfer, became the norm.  I don't remember when ftp
> clients started to automatically send "TYPE I" on
> connection setup, but it sure made life so much easier (read:
> FTP became somewhat usable again).

Even then, if someone asks for "type ascii" or, potentially,
"type unicode", an error message, maybe with a 3yz or 4yz code,
to the effect of "I don't have a clue what this file actually
is, suggest you ask for Type I and sort it out yourself".

> This is probably one instance of the more general gateway
> fundamental.  Instead of needing gateways between various mail
> systems, everything converged to SMTP, or being as close to
> SMTP as possible (i.e. requiring minimal
> gatewaying/conversion).
> 
> This has also happened in text formats (with the exception of
> the remaining two line ends, and, with a lesser relevance,
> interpretation of HTs).  Anything that tries to make life
> easier for systems that aren't UTF-8 yet is like adding
> another transition technology to IPv6: trying to be helpful,
> but in practice counterproductive.

There, and for the specific case of Unicode, probably we
disagree.  Keep in mind that a number of contemporary operating
systems use UTF-16, or even UTF-32, with some byte ordering,
internally.  They typically know they are doing that, if only to
be able to do an orderly conversation to UTF-8 for putting over
the wire or from UTF-8 or ASCII for incoming data.  Now, if I
retrieve a UTF-16 object in image moded, there better be a BOM
and it better be accurate or I'm probably in deep trouble.  And,
if I retrieve UTF-32 and know that it is text but not anything
else, I better not be on a system that that makes special use of
null bytes.  So I would assume that "TYPE U", interpreted as
always encoded in UTF-8, would work fine coming out of most
systems because they probably know how to do the conversation
already.  And, if someone requests "TYPE A", I'd much rather
than a system that notices, e.g., that some of the octets in the
file have their high bit on, return an error indication than
make something up (like zeroing out all of the high-order bits
or discarding a bit in the middle).

Since you used an email analogy, let me do that too.   While we
permit a MIME body part of application/octet-stream, which
basically translates into plain English as "I don't know or I'm
not willing to tell you -- your problem", we do not allow
text/plain charset="I don't have a clue what this is or how it
was encoded but I think it is text".  And, if the originating
system knows enough to specify that a body part is text/plain
and to specify a charset, it presumably knows enough to respond
intelligently to an fTP request for TYPE A / E / or presumably U
whether it knows how to do the conversion or not.  It is a
little different because the sending user or MUA might know even
if the operating system doesn't, but still...  And most of the
MUAs I know of will make an educated guess and, in practice,
usually get it right.

best,
   john