On 17/09/16 16:11, Chad Phillips wrote: > Was this packet capture done on the client side or the server side or > somewhere in the middle? There appears to be some messages missing. > In particular I don?t see any CCS or Finished messages being > exchanged. Is the network this is over potentially noisy that might > explain packet loss? > > > From the perspective of the DTLS handshake, my server hosting the Licode > library is the client, and latest stable Chrome browser is the server, > if I understand the terminology correctly. The packet capture was taken > from the client (Licode) side. > > Would the CCS or Finished messages have gotten filtered out by the > ?dtls? filter I applied to the packet capture? I do have the full trace > and can re-filter to just one complete connection over a specific UDP > port as you suggested, let me know if that would be helpful I took another look at the packet trace. I found the CCS/Finished messages! They are actually there but wireshark is not showing them for some reason (at least my version of wireshark isn't). On the end of the packet which contains three Certificate fragments, the ClientKeyExchange and the Certificate Verify, my wireshark is then saying "Malformed Packet". This in in relation to a load of data that is in the packet after the Certificate Verify. Looking at it by eye the packets look well formed so I'm not sure why wireshark is complaining. Anyway after the Certificate Verify I can see the CCS, and an encrypted handshake message - which will be the Finished. What is odd is that we are seeing 3 Certificate fragments and 2 CertificateVerify fragments in a single network packet. OpenSSL will only fragment if it thinks the MTU isn't big enough for anything larger. It looks like licode is then combining the multiple fragments into a single packet anyway. This is probably something to do with the way licode is written meaning that OpenSSL is not getting the right MTU value. I assume the licode developers are trying to compensate for that by sending it all in one go anyway. That shouldn't cause any problems - but its a bit odd and it would be better to make sure OpenSSL gets the right MTU in the first place. I speculate that the reason I'm seeing the "malformed packet" is that, normally, you'd only see a maximum of 5 DTLS handshake records in a single packet. However, because we have fragmented the Certificate and CertificateVerify messages we've got more the 5 DTLS records in a packet. My guess is that there is a bug in wireshark that fails if it gets more the 5 records in one go. But that really is pure guess work. > > I see these failures only in situations where browser users with slow > and/or lossy internet are joining, and usually when the group size gets > to be six or more participants. The particular testing scenario that > generated the packets you saw was a user with 225kbps upload, 5120kbps > down, 70ms delay, 0% packet loss. > > I?ll grant you those network conditions aren?t the best for group video > chat, but if Google Hangouts can pull it off, I?d like to as well. > > > On receiving that the client should respond with a retransmit of > the Certificate/ClientKeyExchange/CertificateVerify/CCS/Finished > flight of messages. But it does not appear to do so?the retransmit > does not happen until after the encrypted alert. > > > This sounds like it might be a bug in the Licode library, not resending > the retransmit properly? Possibly. It could be that or it could also feasibly be a bug in OpenSSL. However I have a theory that might explain it (but it is just a theory). DTLS uses a timer to retransmit messages that may have got lost. If it hasn't had the response it expects following the last set of messages it sent by the time the timer expires, then it retransmits them. I wrote above that "On receiving that the client should respond with a retransmit of the Certificate/ClientKeyExchange/CertificateVerify/CCS/Finished flight of messages.". Actually in reality that's a bit of an over-simplification. What actually happens is the peer is waiting for some messages that it doesn't receive within its timeout (either because they are lost or delayed), and so it retransmits them. This is why we see the second set of ServerHello (etc) messages from the server. The client application usually then notices that the socket has become readable and the application calls OpenSSL to process that data. OpenSSL reads it, realises that they are retransmits of messages that it has already processed and drops them. It also checks its own timer to see if the client needs to retransmit any messages. If the client timer hasn't expired yet then it does nothing. This could be why no retransmit happens immediately after the second set of ServerHello messages, i.e. the client timer hasn't expired yet. Normally the server would continue to retransmit periodically, which would cause the client application to try and process the retransmits again, and eventually the timer will have expired and it will retransmit its last set of messages again. However in this case it seems the server only retransmits once and then doesn't try again. Perhaps boring has a different retransmit policy to us - I'm not sure. This means though that unless something else happens no further data will be received by the client until it retransmits its last set of messages...but with no further data being received the application never calls openssl again to cause it to check its timer again and make the retransmits happen! Therefore the client is sat there waiting for the server to send it something...and the server is also sat there waiting for something to happen. There is an OpenSSL API which is intended to resolve this issue: DTLSv1_handle_timeout() The application is expected to call this periodically during the handshake if no other data has been sent or received. The causes OpenSSL to check its timer and do any retransmits if necessary. If licode doesn't call this, then its plausible that this is the cause of the issue. Unfortunately there is at least one OpenSSL bug here - in the documentation: DTLSv1_handle_timeout() is undocumented :-( Matt