Re: patch: crash on using already destroyed ssl socket

Alexei Gradinari <alex2grad@xxxxxxxxx> · Mon, 17 Oct 2016 14:16:19 -0400

Title: Re: [pjsip] patch: crash on using already destroyed ssl socket

Hello Nanang,

I agree that the group lock patch fixes the original issue

and I can confirm that with this patch there isn't a crash.

But additional check adds extra avoidance if something goes wrong.

 if (ssock->ssl_state == SSL_STATE_NULL || ssock->last_err == SSL_ERROR_SYSCALL || !ssock->ossl_wbio)

 if (ssock->ssl_state == SSL_STATE_NULL || ssock->last_err == SSL_ERROR_SYSCALL || !ssock->ossl_ssl)

May be add this check if PJ_ENABLE_EXTRA_CHECK.

About 'fatal error'.

Looking at opessnl s_client.c/s_server.c I found that the next errors brought to SSL_shutdown

 SSL_ERROR_ZERO_RETURN

 SSL_ERROR_SYSCALL

 SSL_ERROR_SSL

So need to check all of these errors.

I couldn't reproduce the crash on old openssl version because I could only trigger SSL_ERROR_SYSCALL error,

but I think the crash could only happen on SSL_ERROR_ZERO_RETURN.

What do you think?

Regards,

Alexei

Monday, October 17, 2016, 8:55:17 AM, you wrote:

Hi Alexei,

Please don't misunderstand, sorry for not being so informative about our hesitation to include some parts of the patch.

The main and original issue is the crash, which from the call stack trace it seems to be caused by two threads accessing write BIO at the same time (one for shutting down connection, and the other for sending some app data). The issue in above scenario should have been addressed by https://trac.pjsip.org/repos/changeset/5459 via group lock, i.e: SSL shutdown only occurs after app finishes with the SSL socket by explicitly invoking pj_ssl_sock_close(), instead of after an SSL error in read operation whose timing is not controllable by application, so now both SSL shutdown and sending app data timings are controlled by application. We prefer group lock solution because it could also solve some other issues such as possibility of write_mutex is prematurely destroyed by application from on_data_read() callback while other thread is sending data.

And then you brought up the issue of "reusing SSL object after a fatal error", this is a new/different issue to us as it is not the cause of the crash in the original report. The base of the issue is this commit message:

--

   Reusing an SSL object when it has encountered a fatal error can

   have bad consequences. This is a bug in application code not libssl

   but libssl should be more forgiving and not crash.

--

which unfortunately doesn't really provide a clear definition of 'fatal error', we've tried to search around but unfortunatly cannot find much about it. The latest patch seems to handle SSL_ERROR_SYSCALL case well (even we are not sure whether it is a fatal error and causes crash), but how about other fatal errors. So for now, perhaps it is better to do some more tests first, especially with older OpenSSL version, i.e: before the Apr 23, 2015 commit.

BR,

nanang

On Fri, Oct 14, 2016 at 10:51 PM, Alexei Gradinari <alex2grad@xxxxxxxxx> wrote:

Hello Teluu,

What's wrong with my last patch?

Could you, please, review it?

I don't understand why you are so against these changes?

Regards,

Alexei

Friday, October 14, 2016, 4:48:14 AM, you wrote:

Hi Alexei,

We appreciate if you could test it out.

We will gladly review this again if your test show any issue.

Best Regards,

Riza

On Fri, Oct 14, 2016 at 12:01 AM, Alexei Gradinari <alex2grad@xxxxxxxxx> wrote:

Hello Riza,

1. You are right... the last_err was set out of locking the write_mutex.

Attached modified patch which set the last_err inside locking the write_mutex,

so it eliminates the SSL object reuse.

2. The read operation (SSL_read) is protected by write_mutex.

Are there other read operations which are not protected by write_mutex?

I concern write_mutex as SSL mutex which should protect the SSL objects.

3. About confirmation that the crash didn't happened with your patch...

If the crash hasn't happened in 2-3 days after you published this patch

it doesn't mean that the crash couldn't happen again.

If I see a code which could lead to the potential crash I'd like to write a code which prevent crashes.

Reusing the SSL object when it has encountered a fatal error could crash

on system with openssl version without

the commit 189e20c68c2399782034fec9f65d6b9ee88aac3d dated Apr 23, 2015.

If you insist I can setup a test system with openssl without this commit

to prove that crash could happen.

Regards,

Alexei

Thursday, October 13, 2016, 12:12:10 AM, you wrote:

Hi Alexei,

Please find the reply, inline.

Best Regards,

Riza

On Thu, Oct 13, 2016 at 12:23 AM, Alexei Gradinari <

alex2grad@xxxxxxxxx> wrote:

Hello Riza,

1. Please, see the next OpenSSL calls.

BIO_pending(ssock->ossl_wbio)

SSL_do_handshake(ssock->ossl_ssl)

SSL_read(ssock->ossl_ssl, data_, size_)

SSL_write(ssock->ossl_ssl, data, (int)size)

All of these calls could use "SSL object when it has encountered a fatal error" or

the SSL object was destroyed.

My patch added this checking, but you removed it.

Why?

- The checks will reduce the potential of of SSL object reuse, however it's not bullet proof.

e.g:

 1. thread 1: fatal error occurs in on_read(), but before reset_ssl_sock_state() is called and ssock->last_err is updated, context switch

 2. thread 2: ssock_send() executed, call BIO_pending() etc 

- write_mutex actually is to protect write operation, as for the fatal error reported (from the call stack), 

it happened on read operation which is not protected by write_mutex (write_mutex will be diacquire when destroying SSL state), 

so, adding these checks around write_mutex would be less effective.

- based on our test, the crash didn't happen again. This was confirmed by George and another of our user with the same issue. 

If you have a different result, please forward them to us.

- lastly (although minor), the fatal error scenario + reset_ssl_sock_state() happened on read operation but the SSL object being accessed was for write (e.g: write BIO).

However we are not certain if there's a correlation between the BIO write and the read operation.

2. The destroy_ssl doesn't destroy write_mutex.

The write_mutex is destroyed in pj_ssl_sock_close.

My patch does nothing with this, so this potential race existed before.

we were not saying that the issue was introduce by your patch.

The ssl should be destroyed as soon as possible to release a TCP socket/port.

To fix potential race issue with write_mutex the group locking could use.

So I combined my patch with your group locking patch.

Attached a new patch. 

-- 

Best regards,

Alexei                            mailto:alex2grad@xxxxxxxxx

-- 

_______________________________________________
Visit our blog: http://blog.pjsip.org

pjsip mailing list
pjsip@xxxxxxxxxxxxxxx
http://lists.pjsip.org/mailman/listinfo/pjsip_lists.pjsip.org