Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

Matt Caswell <matt@xxxxxxxxxxx> · Thu, 26 Sep 2019 09:49:18 +0100



On 25/09/2019 18:41, Ian Sinclair wrote:
> Thanks for the detailed investigation. I don't know if we have a BIO callback or
> modified any BIO code. I'll have to dig into our version of Asterisk to see if I
> can tell.
> 
> The version confusion is mine. We really are running 1.0.1e 58 from Centos6. I
> got crossed up because when I checked the documentation it only goes back to
> 1.0.2, so that got into my head.
> 
> If we have to rebuild the code, even just to debug the external cause (I expect
> there is something funny being fed to openSSL) is it going to be compatible for
> me to use the 1.0.2 source code or are there a bunch of related packages that I
> would need to upgrade in parallel? 

>From an OpenSSL perspective 1.0.2 should be a drop in replacement for 1.0.1.
However if you are using the system OpenSSL package, then upgrading that could
have other impacts - but that's not something I can really advise on.

Matt


> 
> Thanks.
> --------------------------------------------------------------------------------
> *From:* Matt Caswell <matt@xxxxxxxxxxx>
> *Sent:* September 25, 2019 9:49 AM
> *To:* Ian Sinclair <ian.sinclair@xxxxxxxxxxxxx>; openssl-users@xxxxxxxxxxx
> <openssl-users@xxxxxxxxxxx>
> *Subject:* Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len ==
> (unsigned int)ret);
>  
> 
> 
> On 24/09/2019 14:11, Ian Sinclair wrote:
>> I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On
>> one site we're getting a crash from dtls1_do_write and as far as I can tell it's
>> from the assertion coded:
>> 
>>   /* bad if this assert fails, only part of the handshake
>>    * message got sent.  but why would this happen? */
>>   OPENSSL_assert(len == (unsigned int)ret);
>> 
>> My question is the same as some previous author - why would this happen?
> 
> I've taken a look at the code in this area to try and figure out a reason.
> 
> I note that the code before this looks like this:
> 
>         ret = dtls1_write_bytes(s, type, &s->init_buf->data[s->init_off],
>                                 len);
>         if (ret < 0) {
>                 ....omitted...
>         } else {
> 
>             /*
>              * bad if this assert fails, only part of the handshake message
>              * got sent.  but why would this happen?
>              */
>             OPENSSL_assert(len == (unsigned int)ret);
> 
> Most significantly I notice that this doesn't seem to handle the ret == 0 case -
> which is normally used to indicate an error. So looking at dtls1_write_bytes we
> should check whether there are any cases where it could return 0. Well that
> function doesn't actually do very much except call do_dtls1_write() and return
> the result of that.
> 
> There are a few possible points where that function could return a value:
> 
> 1) Returns the result of ssl3_write_pending()
> 2) Returns the result of ssl_dispatch_alert(). Following the logic of that we
> find that it ends up returning the result of a recursive do_dtls1_write() call -
> so we can ignore this one.
> 3) If len == 0 and !create_empty_fragment then we return 0. If that is the case
> then len == 0 and ret == 0 so the assertion would not be hit, so we can ignore this.
> 4) if create_empty_fragment is true we return wr->length. But there are no calls
> of do_dtls1_write where create_empty_fragment is true anywhere in the code. This
> is actually dead code (we should remove this). We can ignore this.
> 5) On error we return -1, which we can ignore.
> 
> So it seems do_dtls1_write() will only return 0 if ssl3_write_pending() does.
> 
> Looking at that function there are 3 possible points where it can return:
> 
> 1) On error it returns -1, so we can ignore that.
> 2) In some circumstances it can return s->wpend_ret. wpend_return is set in
> do_dtls1_write to the value of len, so the only way it could be 0 is if len is,
> which would not cause the assertion to fail so we can ignore this.
> 3) In other circumstances it returns the result of a BIO_write() call
> 
> So the only way I can see for a 0 return to come back is if BIO_write() returns
> that. Looking at the implementation of BIO_write() it will return 0 if:
> 
> - the BIO is NULL. But there is a check for this in ssl3_write_pending() so we
> know this will never be the case.
> - If a BIO callback has been set (via BIO_set_callback()) then it will return
> the result from that
> - Otherwise the return value is the result of a write call from the underlying
> BIO implementation.
> 
> In DTLS the underlying BIO implementation is usually a BIO_s_datagram(). That
> write function will return the value from a system call to send or sendto. The
> man pages indicate this returns -1 on error, or otherwise the number of bytes
> sent - so I don't see how we could ever get 0 here.
> 
> So, the summary of all of that is, I think we could get a situation where the
> assertion is triggered if:
> 
> 1) You are using BIO_set_callback() and the callback you set returns 0
> 2) You are using some source/sink BIO other than BIO_s_datagram() and it
> sometimes returns 0 from its write call.
> 3) You are using a filter BIO and that returns 0 during a write call
> 
> Other possibilities that spring to mind are if you are using a custom BIO that
> doesn't behave well and returns a positive value from its write call that is not
> equal to the number of bytes it was asked to write.
> 
>> 
>> Is there any meaningful way I can debug this? Some flag I can set that will show
>> the DTLS packets to try to find a cause? Some way to get rid of the assertion so
>> that the failure doesn't take down the whole system, because currently the
>> assertion causes a reboot?
> 
> In later releases I notice that this "hard" assertion has been converted to a
> soft assertion - i.e. it only crashes in a debug build. Otherwise it just
> returns -1 to indicate an error. So the equivalent of changing the code to look
> like this in a production build:
> 
>             /*
>              * bad if this assert fails, only part of the handshake message
>              * got sent.  but why would this happen?
>              */
>             if (len != (unsigned int)ret)
>                 return -1;
> 
> 
>> It's happening on an end customer site so building a
>> debug load isn't particularly viable, but if that's the only option tell me how.
>> 
>> Is this a known problem that is only fixed as a non-security fix in a later
>> release?
> 
> It's not a know problem as far as I can remember. But on the other hand there
> have been a lot of DTLS bug fixes that have gone in between 1.0.2 and 1.1.1.
> 
> Note that 1.0.2 is approaching EOL (at the end of this year), so is only
> receiving security fixes. Since any fix we would apply doesn't sound like a
> security fix it wouldn't get backported to 1.0.2.
> 
>> We are current for the release, I believe, with v1.0.1e 58.el6_10.
> 
> I'm slightly confused here you're talking about 1.0.1e but above you said 1.02
> above.
> 
> Matt