I'm currently working on adding websocket support to the JLHTTP server. I've already read RFC6455 in full before starting to implement it, and once again at the end to make sure nothing was missed (plus lots of re-reading specific sections in the middle), but still have some open questions. I hope I'm writing to the correct mailing list, if not then apologies and please direct me to the correct venue :-)
Along the way I also took note of things that were unclear or confusing (whether due to the wording of the spec or my misunderstanding of it). I wanted to share the issues I came across, both to hopefully get some answers in order to write a correct implementation, but also since some issues may need to be reported as errata or for future updates or clarifications to the spec (any input on if and what is appropriate would be welcome). Here are the issues, in order of appearance:
1: 1.8
"At the time of writing of this
specification, it should be noted that connections on ports 80 and
443 have significantly different success rates, with connections on
port 443 being significantly more likely to succeed, though this may
change with time."
I could neither understand what this is trying to say (Is this a general statistic about port-scanning? Or a fun fact specific to websocket deployments?) nor why "it should be noted" - how is it relevant to the websocket spec? Would removing this sentence entirely make any difference in understanding, implementing or using websockets? I'm not sure if I'm missing an important point or this is really redundant.
2: 4.2.1.3
"An |Upgrade| header field containing the value "websocket""
Does "containing" imply it can have other elements as well (as the Upgrade field is normally defined), or does it mean the entire value must exactly equal "websocket"? The former makes more sense, being standard, but I've seen interpretations both ways, and mention of clients that reject anything but the single exact value. Perhaps this can be clarified in the spec.
3: 4.2.2.4 (regarding /origin/)
"For more detail, refer to Section 10"
Shouldn't this be 10.2 ("Origin Considerations")?
4: 4.2.2.4 (regarding /key/)
"It is not necessary for the server to base64-decode the |Sec-WebSocket-Key| value."
Although technically correct, this is somewhere between a half-truth and misleading: section 4.2.1 states that
"If the server, while reading the handshake, finds that the client did
not send a handshake that matches the description below [...]
the server MUST stop processing the client's handshake and return an HTTP
response with an appropriate error code"
and the "description below" of the key field is:
"A |Sec-WebSocket-Key| header field with a base64-encoded (see
Section 4 of [RFC4648]) value that, when decoded, is 16 bytes in
length."
So, although technically the base64 value does not need to be decoded, the server still MUST ensure it's a valid base64 string (to match the 'base64-encoded' description), and does have to check the padding bytes along with the total encoded string length to make sure the decoded base64 value is exactly 16 bytes in length (to match the "16 bytes in length" description). By this point you're only a few bit shifts away from fully decoding the value :-) In all but the most highly optimized implementations, these checks would most likely be implemented by just base-64 decoding the value, rather than implementing a separate 'base64 character validity, padding and length checker'. Or they would ignore/miss this requirement, due to the misleading "it is not necessary to decode" statement (as I did at first), and not perform this validation at all, which would violate the MUST in 4.2.1.
I think either the misleading statement should be removed, or the description of the header read by the server should be loosened to just checking that the header exists and not describe its contents (thus allowing the server to indeed ignore it being base64 at all), or leave both as-is, but add an explicit exception to the rule in this header's description stating the value itself, as described, does not need to be validated at all.
5: 4.2.2.5.4
The Sec-WebSocket-Accept ABNF allows base64 of any length - why not limit it to the exact length actually required by this header (encoded SHA1 result length?)? Or at least add an ABNF comment about the required length (as is done in various other ABNF definitions in the spec)?
6: 5.1
"a client MUST mask all frames that it sends to the server" and
"A server MUST NOT mask any frames that it sends to the client"
However in 5.2, referring to the Mask field definition, it only says:
"All frames sent from client to server have this bit set to 1."
I think it should also say "All frames sent from server to client have this bit set to 0". It's strange to mention the requirement in one direction and not the other here, even though both are MUST (NOT) earlier. It would make sense to mention either both directions or none, not just one.
7.1.5
"_The WebSocket Connection Close Code_ is defined as the status code (Section 7.4) contained inthe first Close control frame received by the application implementing this protocol."
"the first Close control frame received" implies there can be more than one (also in 7.1.6)? What scenario is this? Why not explicitly disallow sending more than one? How is the receiver supposed to react to the second one?
7: 5.2frame-opcode = frame-opcode-non-control /
frame-opcode-control /
frame-opcode-cont
frame-opcode-cont = %x0 ; frame continuation
frame-opcode-non-control= %x1 ; text frame
/ %x2 ; binary frame
/ %x3-7
; 4 bits in length,
; reserved for further non-control frames
frame-opcode-control = %x8 ; connection close
/ %x9 ; ping
/ %xA ; pong
/ %xB-F ; reserved for further control
; frames
; 4 bits in length
These should all be 4 bits in length, but oddly it seems to mention this only for the reserved bit values - first of all, the indentation and order of comments is different between frame-opcode-non-control and frame-opcode-control, which is unnecessarily confusing, but also the length comment is between the reserved values and the "reserved for further non-control frames" comment (which implies it belongs only to those fields), and is missing entirely for frame-opcode-cont. One possible solution is to remove both length comments, and just add it instead to frame-opcode. Or, add it to the three individual types, but as a last comment in the indentation level of all possible values (not just reserved). [ The indentation in this email may get screwed up, so see the original RFC to see what I mean. ]
8: 5.5.1
"the endpoint typically echos the status code it received"
should this be "the status code and reason"?
9: 6.1.1
"If at any point the state of the WebSocket connection changes, the endpoint MUST abort the following steps."
It isn't entirely clear what "at any point" means here - in between the listed steps? within a step? more specifically, in step 7 it transmits potentially multiple frames - is that an atomic operation? or MUST it abort also in between frames/fragments? or MUST it abort in the middle of a single (possibly very long) frame? This has significant implications to what happens next e.g. if aborted in the middle of a frame, or even in between frames in a multi-frame message, the framing protocol is then broken and no close frame can be sent, etc., whereas if step 7 is all-or-nothing with regards to the MUST abort, then the closing can continue gracefully and no data is lost). This should really be clarified, as this is a MUST that can be interpreted in significantly different ways.
10: 7.1.7
"An endpoint MUST NOT continue to attempt to process data (including a responding Close frame)"
So maybe it should say "to process frames" instead of "to process data" (since data frames are explicitly defined, use of "data" here is confusing. Is a ping frame "data"?).
11: 7.2.2
"Certain algorithms require or recommend that the server _Abort the
WebSocket Connection_ during the opening handshake. To do so, the
server MUST simply _Close the WebSocket Connection_"
1 - The term _Abort the WebSocket Connection_ that is defined here doesn't appear anywhere else in the RFC. Searching for the word "abort" on its own, there is
"abort these steps", "abort the connection attempt", "abort the connection", "abort the WebSocket handshake". It would be useful to use the defined term wherever it was intended to be used (I'm not sure which of the above), to disambiguate this MUST from coincidental use of the word "abort" in the prose.
2 - More generally, it's not entirely clear why it would be required to close the socket in any case if the handshake fails, whether according to this section or other sections. To my understanding the handshake is still using the HTTP protocol, so sending an error status code should be enough, and the client should be free to reuse the connection to try again, or to send unrelated non-websocket requests for other HTTP resources on the same TCP connection. Specifically, section 4.4 describes such a scenario for protocol version negotiations - why should it be necessary to close the connection in the middle before the second request with the corrected version number is sent? (section 4.2.2, under /version/, requires aborting the handshake).
3 - Finally, different headers and conditions use different terminology in discussing error conditions, instead of being limited to the well-defined terms. I think it would be super helpful to make more precise use of the overloaded terms 'close', 'fail', 'abort' throughout the spec, use the well-defined terms everywhere they are needed and avoid using the generic English words where they don't have the same well-defined meaning. I get that there was an intention to do so, such as in 7.2.2, but it didn't go all the way in being careful how those words are used elsewhere, resulting in confusion in what to actually do in each error condition (and there are lots of them in the spec).
12: 7.4.1
"1011 indicates that a server is terminating the connection because
it encountered an unexpected condition that prevented it from
fulfilling the request."
There is no definition or even concept of 'request' in the websocket framing protocol (other than the HTTP handshake request, which would return an HTTP status code and not this). The websocket protocol is about a bi-directional stream of frames and messages, not a request-response model.
I'm guessing maybe this is meant to be some sort of application-level thing if an application decides to use a request-response model on top of websockets? but that's beyond the scope of this specification. In short, I think the terminology and/or intention of when to use this status code should be clarified, because as it is currently phrased it is undefined.
13: 9.1
"If a value is received by either the client or
the server during negotiation that does not conform to the ABNF
below, the recipient of such malformed data MUST immediately _Fail
the WebSocket Connection_."
This is another specific instance of #11 above - it is strange to Fail the websocket connection (which entails closing the socket) for a malformed HTTP header... if Sec-WebSocket-Extensions does not conform to ABNF, shouldn't it just return a HTTP 400 status code and move on to the next request? Some headers require disconnecting and some don't... it's strange.
I hope this was clear enough, and somebody in the know can help me figure these out and/or/if how else to report them to improve the spec.
Many thanks!
Amichai