Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@xxxxxx from September 2012)

David Sheets <kosmo.zb@xxxxxxxxx> · Wed, 24 Oct 2012 20:37:28 -0700

On Tue, Oct 23, 2012 at 10:05 PM, Ian Hickson <ian@xxxxxxxx> wrote:
> On Wed, 24 Oct 2012, Manger, James H wrote:
>>
>> Currently, I don't think url.spec.whatwg.org distinguishes between
>> strings that are valid URLs and strings that can be interpreted as URLs
>> by applying its standardised error handling. Consequently, error
>> handling cannot be at the option of the software developer as you cannot
>> tell which bits are error handling.
>
> Well first, the whole point of discussions like this is to work out what
> the specs _should_ say; if the specs were perfect then there wouldn't be
> any need for discussion.

Good! Let's have a discussion about what the spec should say.

> On Tue, 23 Oct 2012, David Sheets wrote:
>>
>> One algorithm? There seem to be several functions...
>>
>> - URI reference parsing (parse : scheme -> string -> raw uri_ref)
>> - URI reference normalization (normalize : raw uri_ref -> normal uri_ref)
>> - absolute URI predicate (absp : normal uri_ref -> absolute uri_ref option)
>> - URI resolution (resolve : absolute uri_ref -> _ uri_ref -> absolute uri_ref)
>
> I don't understand what your four algorithms are supposed to be.

Ian, these are common descriptors (and function signatures).

Here are (longer) prose descriptions for those unfamiliar with
standard functional notation:

*parse* is a function which takes the contextual scheme and a string
to be parsed and produces a structure of unnormalized reference
components.
*normalize* is a function which takes a structure of unnormalized
reference components and produces a structure of normalized reference
components (lower-casing scheme, lower-casing host for some schemes,
collapsing default ports, coercing invalid codepoints, etc).
*absp* is a function which takes a structure of normalized reference
components and potentially produces a structure of normalized
reference components which is guaranteed to be absolute (or nothing:
in JS, this roughly corresponds to nullable).
*resolve* is a function which takes a URI structure and a reference
component structure and produces a URI structure corresponding to the
reference resolution of the second argument against the first (base)
argument.

See my original message for how these compose into your one_algorithm.

> There's just one algorithm as far as I can tell -- it takes as input an
> arbitrary string and a base URL object, and returns a normalised absolute
> URL object, where a "URL object" is a conceptual construct consisting of
> the components scheme, userinfo, host, port, path, query, and
> fragment, which can be serialised together into a string form.

How is the arbitrary string deconstructed? How is the result
normalized? What constitutes an absolute reference? How does a
reference resolve against a base URI?

>> Anne's current draft increases the space of valid addresses.
>
> No, Anne hasn't finished defining conformance yet. (He just started
> today.)

This is a political dodge to delay the inevitable discussion of
address space expansion.

>From what I have read of WHATWG's intentions and discussed with you
and others, you are codifying current browser behavior for
'interoperability'. Current browsers happily consume and emit URIs
that are invalid per STD 66.

<http://url.spec.whatwg.org/#writing> presently says:
"A fragment is "#", followed by any URL unit that is not one of
U+0009, U+000A, and U+000D."
This is larger than STD 66's space of valid addresses.

>> > The de facto parsing rules are already complicated by de facto
>> > requirements for handling errors, so defining those doesn't increase
>> > complexity either (especially if such behaviour is left as optional,
>> > as discussed above.)
>>
>> *parse* is separate from *normalize* is separate from checking if a
>> reference is absolute (*absp*) is separate from *resolve*.
>
> No, it doesn't have to be. That's actually a more complicated way of
> looking at it than necessary, IMHO.

Why use several simple, flexible sharp tools when you could use a
single complicated, monolithic blunt tool?

Why do you insist on producing a single, brittle, opaque function when
you could produce several simply-defined functions that actually model
the data type transformations?

Vendors are, of course, always free to implement an optimized
composition for their specific use cases.

>> Why don't we have a discussion about the functions and types involved in
>> URI processing?
>>
>> Why don't we discuss expanding allowable alphabets and production rules?
>
> Personally I think this kind of open-ended approach is not a good way to
> write specs.

The specs already exist and use these formalisms successfully. Why do
you think discussions about the model of the problem-space are
'open-ended'? Why are you trying to stop a potentially productive
discussion?

> Better is to put forward concrete use cases, technical data,
> etc, and let the spec editor take all that into account and turn it into a
> standard.

Is <https://github.com/dsheets/ocaml-uri/blob/master/lib/uri.ml#L108>
correct or should safe_chars_for_fragment include '#'?

Whatever 'standard' you produce will require me to exert significant
effort on your One Giant Algorithm to factor it into its proper
components and reconcile it with competing standards for my users. I
have applications that use each of the above functions separately.

> Arguing about what precise alphabets are allowed and whether to
> spec something using prose or production rules is just bikeshedding.

I can only conclude that you understand neither the value of precision
nor the meaning of "bikeshedding".

You are not constructing anything remotely comparable to a nuclear reactor.

I am expressing a genuine desire to discuss the actual technical
content of the relevant specifications in the most precise and concise
way possible.

I am losing confidence in your technical leadership.

Sincerely,

David Sheets