Questions about RFC6455 (implementing websockets)

"A. Rothman" <amichai2@xxxxxxxxxxxx> · Mon, 2 Sep 2024 21:40:46 +0300



    Hi,

    
    I'm currently working on adding websocket support to the JLHTTP
    server. I've already read RFC6455 in full before starting to
    implement it, and once again at the end to make sure nothing was
    missed (plus lots of re-reading specific sections in the middle),
    but still have some open questions. I hope I'm writing to the
    correct mailing list, if not then apologies and please direct me to
    the correct venue :-)

    
    Along the way I also took note of things that were unclear or
    confusing (whether due to the wording of the spec or my
    misunderstanding of it). I wanted to share the issues I came across,
    both to hopefully get some answers in order to write a correct
    implementation, but also since some issues may need to be reported
    as errata or for future updates or clarifications to the spec (any
    input on if and what is appropriate would be welcome). Here are the
    issues, in order of appearance:

    
    1: 1.8

    
       "At the time of writing of this

       specification, it should be noted that connections on ports 80
    and

       443 have significantly different success rates, with connections
    on

       port 443 being significantly more likely to succeed, though this
    may

       change with time."

    
       I could neither understand what this is trying to say (Is this a
    general statistic about port-scanning? Or a fun fact specific to
    websocket deployments?) nor why "it should be noted" - how is it
    relevant to the websocket spec? Would removing this sentence
    entirely make any difference in understanding, implementing or using
    websockets? I'm not sure if I'm missing an important point or this
    is really redundant.

    
    2: 4.2.1.3

    
       "An |Upgrade| header field containing the value "websocket""

    
       Does "containing" imply it can have other elements as well (as
    the Upgrade field is normally defined), or does it mean the entire
    value must exactly equal "websocket"? The former makes more sense,
    being standard, but I've seen interpretations both ways, and mention
    of clients that reject anything but the single exact value. Perhaps
    this can be clarified in the spec.

    
    3: 4.2.2.4 (regarding /origin/)

    
       "For more detail, refer to Section 10"

    
       Shouldn't this be 10.2 ("Origin Considerations")?

    
    4: 4.2.2.4 (regarding /key/)

    
      "It is not necessary for the server to base64-decode the
    |Sec-WebSocket-Key| value."

    
      Although technically correct, this is somewhere between a
    half-truth and misleading: section 4.2.1 states that

    
      "If the server, while reading the handshake, finds that the client
    did

       not send a handshake that matches the description below [...]

       the server MUST stop processing the client's handshake and return
    an HTTP

       response with an appropriate error code"

    
       and the "description below" of the key field is:

    
       "A |Sec-WebSocket-Key| header field with a base64-encoded (see

       Section 4 of [RFC4648]) value that, when decoded, is 16 bytes in

       length."

    
       So, although technically the base64 value does not need to be
    decoded, the server still MUST ensure it's a valid base64 string (to
    match the 'base64-encoded' description), and does have to check the
    padding bytes along with the total encoded string length to make
    sure the decoded base64 value is exactly 16 bytes in length (to
    match the "16 bytes in length" description). By this point you're
    only a few bit shifts away from fully decoding the value :-) In all
    but the most highly optimized implementations, these checks would
    most likely be implemented by just base-64 decoding the value,
    rather than implementing a separate 'base64 character validity,
    padding and length checker'. Or they would ignore/miss this
    requirement, due to the misleading "it is not necessary to decode"
    statement (as I did at first), and not perform this validation at
    all, which would violate the MUST in 4.2.1.

    
       I think either the misleading statement should be removed, or the
    description of the header read by the server should be loosened to
    just checking that the header exists and not describe its contents
    (thus allowing the server to indeed ignore it being base64 at all),
    or leave both as-is, but add an explicit exception to the rule in
    this header's description stating the value itself, as described,
    does not need to be validated at all.

    
    5: 4.2.2.5.4

    
      The Sec-WebSocket-Accept ABNF allows base64 of any length - why
    not limit it to the exact length actually required by this header
    (encoded SHA1 result length?)? Or at least add an ABNF comment about
    the required length (as is done in various other ABNF definitions in
    the spec)?

    
    6: 5.1

    
      "a client MUST mask all frames that it sends to the server" and

      "A server MUST NOT mask any frames that it sends to the client"

    
      However in 5.2, referring to the Mask field definition, it only
    says:

    
      "All frames sent from client to server have this bit set to 1."

    
      I think it should also say "All frames sent from server to
      client have this bit set to 0". It's strange to mention the
      requirement in one direction and not the other here, even though
      both are MUST (NOT) earlier. It would make sense to mention either
      both directions or none, not just one.
    7.1.5
      "_The WebSocket Connection Close Code_ is defined as the status
    code (Section
      7.4) contained in

      the first Close control frame received by the application
    implementing this protocol."

      "the first Close control frame received" implies there can be
      more than one (also in 7.1.6)? What scenario is this? Why not
      explicitly disallow sending more than one? How is the receiver
      supposed to react to the second one?
    7: 5.2

    
      frame-opcode            = frame-opcode-non-control /

                                frame-opcode-control /

                                frame-opcode-cont

      frame-opcode-cont       = %x0 ; frame continuation

      frame-opcode-non-control= %x1 ; text frame

                              / %x2 ; binary frame

                              / %x3-7

                              ; 4 bits in length,

                              ; reserved for further non-control frames

      frame-opcode-control    = %x8 ; connection close

                              / %x9 ; ping

                              / %xA ; pong

                              / %xB-F ; reserved for further control

                                      ; frames

                                      ; 4 bits in length

    
      These should all be 4 bits in length, but oddly it seems to
    mention this only for the reserved bit values - first of all, the
    indentation and order of comments is different between
    frame-opcode-non-control and frame-opcode-control, which is
    unnecessarily confusing, but also the length comment is between the
    reserved values and the "reserved for further non-control frames"
    comment (which implies it belongs only to those fields), and is
    missing entirely for frame-opcode-cont. One possible solution is to
    remove both length comments, and just add it instead to
    frame-opcode. Or, add it to the three individual types, but as a
    last comment in the indentation level of all possible values (not
    just reserved). [ The indentation in this email may get screwed up,
    so see the original RFC to see what I mean. ]

    
    8: 5.5.1

    
      "the endpoint typically echos the status code it received"

    
      should this be "the status code and reason"?

    
    9: 6.1.1

    
      "If at any point the state of the WebSocket connection changes,
    the endpoint MUST abort the following steps."

    
      It isn't entirely clear what "at any point" means here - in
    between the listed steps? within a step? more specifically, in step
    7 it transmits potentially multiple frames - is that an atomic
    operation? or MUST it abort also in between frames/fragments? or
    MUST it abort in the middle of a single (possibly very long) frame?
    This has significant implications to what happens next e.g. if
    aborted in the middle of a frame, or even in between frames in a
    multi-frame message, the framing protocol is then broken and no
    close frame can be sent, etc., whereas if step 7 is all-or-nothing
    with regards to the MUST abort, then the closing can continue
    gracefully and no data is lost). This should really be clarified, as
    this is a MUST that can be interpreted in significantly different
    ways.

    
    10: 7.1.7

    
      "An endpoint MUST NOT continue to attempt to process data
    (including a responding Close frame)"

    
      So maybe it should say "to process frames" instead of "to process
    data" (since data frames are explicitly defined, use of "data" here
    is confusing. Is a ping frame "data"?).

    
    11: 7.2.2

    
      "Certain algorithms require or recommend that the server _Abort
    the

      WebSocket Connection_ during the opening handshake.  To do so, the

      server MUST simply _Close the WebSocket Connection_"

    
      1 - The term _Abort the WebSocket Connection_ that is defined here
    doesn't appear anywhere else in the RFC. Searching for the word
    "abort" on its own, there is

      "abort these steps", "abort the connection attempt", "abort the
    connection", "abort the WebSocket handshake". It would be useful to
    use the defined term wherever it was intended to be used (I'm not
    sure which of the above), to disambiguate this MUST from
    coincidental use of the word "abort" in the prose.

    
      2 - More generally, it's not entirely clear why it would be
    required to close the socket in any case if the handshake fails,
    whether according to this section or other sections. To my
    understanding the handshake is still using the HTTP protocol, so
    sending an error status code should be enough, and the client should
    be free to reuse the connection to try again, or to send unrelated
    non-websocket requests for other HTTP resources on the same TCP
    connection. Specifically, section 4.4 describes such a scenario for
    protocol version negotiations - why should it be necessary to close
    the connection in the middle before the second request with the
    corrected version number is sent? (section 4.2.2, under /version/,
    requires aborting the handshake).

    
      3 - Finally, different headers and conditions use different
    terminology in discussing error conditions, instead of being limited
    to the well-defined terms. I think it would be super helpful to make
    more precise use of the overloaded terms 'close', 'fail', 'abort'
    throughout the spec, use the well-defined terms everywhere they are
    needed and avoid using the generic English words where they don't
    have the same well-defined meaning. I get that there was an
    intention to do so, such as in 7.2.2, but it didn't go all the way
    in being careful how those words are used elsewhere, resulting in
    confusion in what to actually do in each error condition (and there
    are lots of them in the spec).

    
    12: 7.4.1

    
      "1011 indicates that a server is terminating the connection
    because

      it encountered an unexpected condition that prevented it from

      fulfilling the request."

    
      There is no definition or even concept of 'request' in the
    websocket framing protocol (other than the HTTP handshake request,
    which would return an HTTP status code and not this). The websocket
    protocol is about a bi-directional stream of frames and messages,
    not a request-response model.

    
      I'm guessing maybe this is meant to be some sort of
    application-level thing if an application decides to use a
    request-response model on top of websockets? but that's beyond the
    scope of this specification. In short, I think the terminology
    and/or intention of when to use this status code should be
    clarified, because as it is currently phrased it is undefined.

    
    13: 9.1

    
      "If a value is received by either the client or

      the server during negotiation that does not conform to the ABNF

      below, the recipient of such malformed data MUST immediately _Fail

      the WebSocket Connection_."

    
      This is another specific instance of #11 above - it is strange to
    Fail the websocket connection (which entails closing the socket) for
    a malformed HTTP header... if Sec-WebSocket-Extensions does not
    conform to ABNF, shouldn't it just return a HTTP 400 status code and
    move on to the next request? Some headers require disconnecting and
    some don't... it's strange.

    
    I hope this was clear enough, and somebody in the know can help me
    figure these out and/or/if how else to report them to improve the
    spec.

    
    Many thanks!

    
    Amichai