CSV, subsetting, and searching (was: Re: Last Call: Adding a fragment identifier to the text/csv media type(see <draft-hausenblas-csv-fragment-06.txt>))

John C Klensin <john-ietf@xxxxxxx> · Tue, 15 Oct 2013 13:29:45 -0400

--On Tuesday, October 15, 2013 08:03 -0700 SM <sm@xxxxxxxxxxxx>
wrote:

> At 01:54 15-10-2013, John C Klensin wrote:
>> My reasoning is that, while the change seems fine, the
>> precedent seems atrocious.  If this is approved via
>> Independent Stream publication and the next case that comes
>> along is, unlike this one, generally hated by the community,
>> the amount of hair-splitting required to deny that one having
>> approved this one would be impressive.. and bad for the IETF.

> I was unable to see whether this specification would be of use
> to data.gov as the site is still unaccessible.  There is an
> opendata site in Brazil ( http://dados.gov.br/), France (
> http://www.data.gouv.fr/ ) and several other countries.  The
> specification may be relevant to opendata which is something
> of interests to governments.  It would have been better if the
> specification was processed in the IETF Stream but the
> community was not interested in taking it up (see msg-id:
> CAC4RtVAeTGpHFA01YX=PS7CYeOfYFS0Sc-g3wb05USnoWyUJMQ@mail.gmail
> .com).
>...

As long as the question is "should the IETF approve registration
of an extension whose documentation is published via the ISE",
most of the above is massively irrelevant.  Hence the change in
subject line and the copy to Nevil (and I'm not likely to
discuss it further on this list).

>From the standpoint of retrieval of information from a database
(opendata or otherwise), my experience suggests that CSV
fragments, especially as specified here, are going to be fairly
irrelevant.  Not harmful, probably ok for those who think they
have a use for them, just useless for lots of retrieval
functions.  The problem is that, in general and with a dataset
of any real size, people don't think of things in terms of row
and column numbers (getting them to think that way is quite
error-prone).   Especially when there are _lots_ of columns,
retrieval by column number is usually a bad idea.

Using an example from the I-D to construct a different one,
  http://example.com/data.csv#col="temperature";
would make a lot more sense in many cases than
  http://example.com/data.csv#col=2
But the spec doesn't allow that and it is perhaps better handled
with a query rather than a fragment (although that opens the
problem of where queries are processed that has tied the URNbis
WG in knots).

It is also not unusual with statistical and scientific databases
(especially non-relational ones) to have named rows as well as
named columns, but, while RFC 4180 allows for ";
header=present", it makes no provision for 'rowNames="present"',
much less what many data analysis packages would really like to
see, which would be something like 'rowNames="present, col=NN"'
with the latter designating the "column" (or columns) of the CSV
file in which those names appeared, perhaps borrowing from
regular expressions and allowing "$" and "$-1" instead of NN.

There is no reason why one couldn't have those sorts of
arrangements with a CSV format and some systems do, but is isn't
the text/csv of RFC 4180.

For datasets of non-trivial size and complexity, fragments as
specified here are going to be really useful only when the
application retrieves what used to be called a "codebook" and
first, uses it to change row and column identifiers or other
query-supporting information into row and column numbers, and
then constructs the URI with this fragment ID.  I suspect that
won't be common for lots of reasons starting from the
observation that many modern database management and database
access technologies discourage detached codebooks.   

Those types of application situations also lead to another
problem with the fragment approach.  Again going back to the
example in the draft, if one had, instead of 

   date,temperature,place
   2011-01-01,1,Galway

one had 
   date, time,observed-melting-point
   2011-01-01,0900.3, 0.001
   2011-01-01,0901.2, 0.002
   2011-01-01,0901.8, -0.09

Then many systems would do the conversions if the last column to
floating point as the data were being read (and might convert
times or dates as well).  Depending on how much information was
kept, responding to 
    http://www.example.com/melting-points.csv#col=3
with
  0.001
  0.002
  -0.09
rather than 
  0.001
  0.002
  -0.090
or
  0.1 x10**-2
etc., as the spec seems to require, might require significant
work.  Some processors would care, others wouldn't, but you see
the problem.

Again, I don't see a big problem with this addition although
there are a number of things I'd like to see either more clear
or explicitly warned about.   I don't see arguments about what
problems it doesn't solve as relevant unless extravagant claims
are made for it, and the current draft avoids such claims.  But
those issues are quite separate from the issue of the IESG
passing responsibility for documentation and evaluation of a
registration modification request (for a registration for which
I believe the IESG to be the "owner") off to the ISE.

best,
     john