Re: patch: crash on using already destroyed ssl socket

George Joseph <gjoseph@xxxxxxxxxx> · Thu, 13 Oct 2016 08:25:31 -0600

On Wed, Oct 12, 2016 at 10:12 PM, Riza Sulistyo <riza@xxxxxxxxx> wrote:
Hi Alexei,
Please find the reply, inline.

Best Regards,

Riza

On Thu, Oct 13, 2016 at 12:23 AM, Alexei Gradinari <alex2grad@xxxxxxxxx> wrote:

Hello Riza,

1. Please, see the next OpenSSL calls.

BIO_pending(ssock->ossl_wbio)

SSL_do_handshake(ssock->ossl_ssl)

SSL_read(ssock->ossl_ssl, data_, size_)

SSL_write(ssock->ossl_ssl, data, (int)size)

All of these calls could use "SSL object when it has encountered a fatal error" or

the SSL object was destroyed.

My patch added this checking, but you removed it.

Why?

- The checks will reduce the potential of of SSL object reuse, however it's not bullet proof.
  e.g:
   1. thread 1: fatal error occurs in on_read(), but before reset_ssl_sock_state() is called and ssock->last_err is updated, context switch
   2. thread 2: ssock_send() executed, call BIO_pending() etc 

- write_mutex actually is to protect write operation, as for the fatal error reported (from the call stack), 
  it happened on read operation which is not protected by write_mutex (write_mutex will be diacquire when destroying SSL state), 
  so, adding these checks around write_mutex would be less effective.

- based on our test, the crash didn't happen again. This was confirmed by George and another of our user with the same issue. 

Actually I'm going to defer to Alexei and Ross on this.  

  If you have a different result, please forward them to us.

- lastly (although minor), the fatal error scenario + reset_ssl_sock_state() happened on read operation but the SSL object being accessed was for write (e.g: write BIO).
  However we are not certain if there's a correlation between the BIO write and the read operation.

2. The destroy_ssl doesn't destroy write_mutex.

The write_mutex is destroyed in pj_ssl_sock_close.

My patch does nothing with this, so this potential race existed before.

we were not saying that the issue was introduce by your patch.

The ssl should be destroyed as soon as possible to release a TCP socket/port.

To fix potential race issue with write_mutex the group locking could use.

So I combined my patch with your group locking patch.

Attached a new patch. 

Regards,

Alexei

Wednesday, October 12, 2016, 8:36:13 AM, you wrote:

Hi Alexei,

When doing the review your patch, we identified another potential race issue which might occur.

e.g: write_mutex get destroyed before send() is called.

We feel that using group lock will solve these issues.

As you can see from the patch, the reset_ssl_sock_state() will not destroy the ssl. 

We have moved all the code of destroying ssl and write_mutex to the ssl_on_destroy()

which is called when the group lock is no longer held or reference count reach 0.

As for your concern regarding the call to BIO_pending() on a closed socket, it shouldn't be 

a problem since we don't use OpenSSL for the socket operation. 

Or, maybe I'm missing something here?

Best Regards,

Riza

On Tue, Oct 11, 2016 at 10:44 PM, Alexei Gradinari <alex2grad@xxxxxxxxx> wrote:

Hello Riza,

Could you, please, explain why did you remove all checking on openssl calls from my patch?

Your patch didn't fix race condition.

Imagine

one thread calls reset_ssl_sock_state

while another thread calls BIO_pending(ssock->ossl_wbio) on already closed socket.

Pleas, look at commit 189e20c68c2399782034fec9f65d6b9ee88aac3d

https://mta.openssl.org/pipermail/openssl-commits/2015-May/001035.html 

   Reusing an SSL object when it has encountered a fatal error can

   have bad consequences. This is a bug in application code not libssl

   but libssl should be more forgiving and not crash.

P.S.

I don't understand why my patch isn't acceptable for you.

Could you, please, explain.

Regards,

Alexei

Sunday, October 9, 2016, 7:51:54 PM, you wrote:

Hi Ross/Alexei,

We have made some modification to the patch in order to incorporate the group lock.

Please find the patch on the attachment, could you help us in testing it? 

Best Regards,

Riza

On Fri, Oct 7, 2016 at 4:47 PM, Ross Beer <ross.beer@xxxxxxxxxxx> wrote:

Hi Ming,

Sorry to chase, however is this still undergoing review?

Kind regards,

Ross

From: pjsip <pjsip-bounces@xxxxxxxxxxxxxxx> on behalf of Ming <ming@xxxxxxxxx>

Sent: 03 October 2016 12:34

To: pjsip list

Subject: Re:  patch: crash on using already destroyed ssl socket 

Hi Ross,

Yes, the patch is undergoing some changes and currently under

(hopefully) final review.

Regards,

Ming

On Mon, Oct 3, 2016 at 5:23 PM, Ross Beer <ross.beer@xxxxxxxxxxx> wrote:

> Hi PJSIP team,

>

>

> Could this patch be considered for inclusion in the project?

>

>

>  This issue is causing segfaults a least once a day.

>

>

> Kind regards,

>

>

> Ross

>

>

>

> ________________________________

> From: pjsip <pjsip-bounces@xxxxxxxxxxxxxxx> on behalf of Alexei Gradinari

> <alex2grad@xxxxxxxxx>

> Sent: 21 September 2016 16:38

> To: pjsip list

> Subject:  patch: crash on using already destroyed ssl socket

>

> Hello,

>

> On heavy loaded system with TLS,

> one thread could destroy the ssl socket on SSL_ERROR_SYSCALL

> while another thread still uses this socket which

> was already freed, so we get segfault.

> Attached 2 backtraces.

>

> To avoid race condition need to lock the socket before destroying it.

>

> The attached patch adds the socket lock on destroying it

> and adds a checking on all openssl calls that the socket wasn't destroyed.

>

> Regards,

> Alexei

>

> _______________________________________________

> Visit our blog: http://blog.pjsip.org

>

> pjsip mailing list

> pjsip@xxxxxxxxxxxxxxx

> http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

pjsip Info Page

lists.pjsip.org

This is the mailing list to discuss pjsip, the open source sip stack. We also have answers to frequently asked questions. To see the collection of prior postings to ...

>

_______________________________________________

Visit our blog: http://blog.pjsip.org

pjsip mailing list

pjsip@xxxxxxxxxxxxxxx

http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org

-- 
George Joseph
Digium, Inc. | Software Developer
445 Jan Davis Drive NW - Huntsville, AL 35806 - US
Check us out at: www.digium.com & www.asterisk.org

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org