On 25/09/2019 18:41, Ian Sinclair wrote: > Thanks for the detailed investigation. I don't know if we have a BIO callback or > modified any BIO code. I'll have to dig into our version of Asterisk to see if I > can tell. > > The version confusion is mine. We really are running 1.0.1e 58 from Centos6. I > got crossed up because when I checked the documentation it only goes back to > 1.0.2, so that got into my head. > > If we have to rebuild the code, even just to debug the external cause (I expect > there is something funny being fed to openSSL) is it going to be compatible for > me to use the 1.0.2 source code or are there a bunch of related packages that I > would need to upgrade in parallel? >From an OpenSSL perspective 1.0.2 should be a drop in replacement for 1.0.1. However if you are using the system OpenSSL package, then upgrading that could have other impacts - but that's not something I can really advise on. Matt > > Thanks. > -------------------------------------------------------------------------------- > *From:* Matt Caswell <matt@xxxxxxxxxxx> > *Sent:* September 25, 2019 9:49 AM > *To:* Ian Sinclair <ian.sinclair@xxxxxxxxxxxxx>; openssl-users@xxxxxxxxxxx > <openssl-users@xxxxxxxxxxx> > *Subject:* Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == > (unsigned int)ret); > > > > On 24/09/2019 14:11, Ian Sinclair wrote: >> I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On >> one site we're getting a crash from dtls1_do_write and as far as I can tell it's >> from the assertion coded: >> >> /* bad if this assert fails, only part of the handshake >> * message got sent. but why would this happen? */ >> OPENSSL_assert(len == (unsigned int)ret); >> >> My question is the same as some previous author - why would this happen? > > I've taken a look at the code in this area to try and figure out a reason. > > I note that the code before this looks like this: > > ret = dtls1_write_bytes(s, type, &s->init_buf->data[s->init_off], > len); > if (ret < 0) { > ....omitted... > } else { > > /* > * bad if this assert fails, only part of the handshake message > * got sent. but why would this happen? > */ > OPENSSL_assert(len == (unsigned int)ret); > > Most significantly I notice that this doesn't seem to handle the ret == 0 case - > which is normally used to indicate an error. So looking at dtls1_write_bytes we > should check whether there are any cases where it could return 0. Well that > function doesn't actually do very much except call do_dtls1_write() and return > the result of that. > > There are a few possible points where that function could return a value: > > 1) Returns the result of ssl3_write_pending() > 2) Returns the result of ssl_dispatch_alert(). Following the logic of that we > find that it ends up returning the result of a recursive do_dtls1_write() call - > so we can ignore this one. > 3) If len == 0 and !create_empty_fragment then we return 0. If that is the case > then len == 0 and ret == 0 so the assertion would not be hit, so we can ignore this. > 4) if create_empty_fragment is true we return wr->length. But there are no calls > of do_dtls1_write where create_empty_fragment is true anywhere in the code. This > is actually dead code (we should remove this). We can ignore this. > 5) On error we return -1, which we can ignore. > > So it seems do_dtls1_write() will only return 0 if ssl3_write_pending() does. > > Looking at that function there are 3 possible points where it can return: > > 1) On error it returns -1, so we can ignore that. > 2) In some circumstances it can return s->wpend_ret. wpend_return is set in > do_dtls1_write to the value of len, so the only way it could be 0 is if len is, > which would not cause the assertion to fail so we can ignore this. > 3) In other circumstances it returns the result of a BIO_write() call > > So the only way I can see for a 0 return to come back is if BIO_write() returns > that. Looking at the implementation of BIO_write() it will return 0 if: > > - the BIO is NULL. But there is a check for this in ssl3_write_pending() so we > know this will never be the case. > - If a BIO callback has been set (via BIO_set_callback()) then it will return > the result from that > - Otherwise the return value is the result of a write call from the underlying > BIO implementation. > > In DTLS the underlying BIO implementation is usually a BIO_s_datagram(). That > write function will return the value from a system call to send or sendto. The > man pages indicate this returns -1 on error, or otherwise the number of bytes > sent - so I don't see how we could ever get 0 here. > > So, the summary of all of that is, I think we could get a situation where the > assertion is triggered if: > > 1) You are using BIO_set_callback() and the callback you set returns 0 > 2) You are using some source/sink BIO other than BIO_s_datagram() and it > sometimes returns 0 from its write call. > 3) You are using a filter BIO and that returns 0 during a write call > > Other possibilities that spring to mind are if you are using a custom BIO that > doesn't behave well and returns a positive value from its write call that is not > equal to the number of bytes it was asked to write. > >> >> Is there any meaningful way I can debug this? Some flag I can set that will show >> the DTLS packets to try to find a cause? Some way to get rid of the assertion so >> that the failure doesn't take down the whole system, because currently the >> assertion causes a reboot? > > In later releases I notice that this "hard" assertion has been converted to a > soft assertion - i.e. it only crashes in a debug build. Otherwise it just > returns -1 to indicate an error. So the equivalent of changing the code to look > like this in a production build: > > /* > * bad if this assert fails, only part of the handshake message > * got sent. but why would this happen? > */ > if (len != (unsigned int)ret) > return -1; > > >> It's happening on an end customer site so building a >> debug load isn't particularly viable, but if that's the only option tell me how. >> >> Is this a known problem that is only fixed as a non-security fix in a later >> release? > > It's not a know problem as far as I can remember. But on the other hand there > have been a lot of DTLS bug fixes that have gone in between 1.0.2 and 1.1.1. > > Note that 1.0.2 is approaching EOL (at the end of this year), so is only > receiving security fixes. Since any fix we would apply doesn't sound like a > security fix it wouldn't get backported to 1.0.2. > >> We are current for the release, I believe, with v1.0.1e 58.el6_10. > > I'm slightly confused here you're talking about 1.0.1e but above you said 1.02 > above. > > Matt