Re: Comments on draft-shafranovich-mime-sql-03

Yakov Shafranovich <ietf@xxxxxxxxxxx> · Tue, 5 Feb 2013 20:11:06 -0500

I would agree that in this scenario the ISO standard would not help
since it would only govern how the SQL client talks to the db server,
not when it is placed on a web server.

I think the situation is actually very similar to the one described in
RFC 6657 where there maybe a conflict between the charset parameter on
the outside and inside the payload:

  In order to improve interoperability with deployed agents, "text/*"
   media type registrations SHOULD either

   a.  specify that the "charset" parameter is not used for the defined
       subtype, because the charset information is transported inside
       the payload (such as in "text/xml"), or

   b.  require explicit unconditional inclusion of the "charset"
       parameter, eliminating the need for a default value.

   In accordance with option (a) above, registrations for "text/*" media
   types that can transport charset information inside the corresponding
   payloads (such as "text/html" and "text/xml") SHOULD NOT specify the
   use of a "charset" parameter, nor any default value, in order to
   avoid conflicting interpretations should the "charset" parameter
   value and the value specified in the payload disagree.

While this is in the "application/*" tree, going with choice A would
essentially drop the "charset" parameter and in your example, would
have the implementors trying to figure out the charset from the
payload itself.

The question is what happens when the SQL file itself carries no
charset information, such as when using "mysql-dump" with the
"--skip-set-charset" option. According to MYSQL, UTF-8 would be used
in v5.1+ and ASCII in versions prior to that. Perhaps, we should leave
"charset" as an optional parameter for cases like these, and just take
out the default value.

Yakov

On Tue, Feb 5, 2013 at 10:05 AM, Bjoern Hoehrmann <derhoermi@xxxxxxx> wrote:
> * Yakov Shafranovich wrote:
>>[...]
>
> I am interested in this situation:
>
>   -> Someone wants to publish database contents or schema
>   -> Use DB-specific dumping tool to create .sql file
>   -> Puts .sql file on web server
>   -> Server associates .sql with proposed media type
>
>   -> Someone else downloads this resource
>   -> Checks IANA registry for the media type
>   -> Finds proposed specification
>
> Note that there is no step "publisher of .sql file ensures that the dump
> tool generates US-ASCII encoded text, or otherwise makes sure the text's
> in a single character encoding and makes sure the web server includes
> the character encoding label in the `charset` header of the Content-Type
> header when serving the .sql file". Experience suggests that respones
> will include no or an incorrect label and downloaders are likely to ig-
> nore the charset parameter even if correctly specified. However, reading
> the draft the person in the sceanrio above would assume that he has got
> US-ASCII encoded text, even though that's fairly unlikely, especially in
> the future given "international text" and using UTF-8 without escapes is
> becoming increasingly common.
>
> Similarily, the draft would tell him to check some ISO standard for "the
> Structured Query Language", even though most likely he should instead
> identify which database software generated the file and check the manual
> for that software to find out about all the files. As a simple example,
> the dumps from <http://dumps.wikimedia.org/> read like this:
>
>   -- MySQL dump 10.13  Distrib 5.1.66, for debian-linux-gnu (x86_64)
>   --
>   -- Host: 10.0.6.76    Database: frrwiki
>   -- ------------------------------------------------------
>   -- Server version     5.1.53-wm-log
>
>   /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
>   /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
>   /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
>   /*!40101 SET NAMES utf8 */;
>   ...
>   --
>   -- Table structure for table `category`
>   --
>
>   DROP TABLE IF EXISTS `category`;
>   /*!40101 SET @saved_cs_client     = @@character_set_client */;
>   /*!40101 SET character_set_client = utf8 */;
>   ...
>
> They do not currently use the proposed type, but if they did, you will
> have to know the format of "MySQL dump" files and what the codes in the
> comments here mean to conclude that these are actually UTF-8 encoded
> files. Google will find other examples with `character_set_client` for
> other character encodings like "latin1". The ISO standard, as far as I
> am aware, will not help you there, and neither does the US-ASCII default
> proposed in the draft.
> --
> Björn Höhrmann · mailto:bjoern@xxxxxxxxxxxx · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/