Re: SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Wed, 19 Aug 2020 15:35:54 -0400

On 8/19/20 5:29 AM, Matt Caswell wrote:

> We should really have a proper callback for this purpose. PRs welcome!
> (Doesn't help you right now though).

Thank you for a prompt, thoughtful, and useful response. I believe that
we are on the same page as far as async API overall intentions, and I am
also very glad to hear that the OpenSSL team may welcome an addition of
a proper callback to address Squid's use case. I know Squid needs are
not unique. I do not yet know whether I can contribute (or facilitate
contribution of) such an enhancement, but this green light is meaningful
progress already!

>> "Somewhat counter-intuitively, the OpenSSL async API is meant for
>> activities that can work correctly _without_ becoming asynchronous (i.e.
>> without being paused to temporary give way to other activities)"

> I have no idea what you mean by this. 

Sorry for not detailing this accusation. I was worried that my email was
already too long/verbose... I will detail it below.

> The whole point of ASYNC_pause_job() is to temporarily give way to
> other activities.

Yes, giving way to other activities is the whole point of the async API
optimization. Unfortunately, being only an optimization, the API is not
enough when the callback MUST "give way to other activities". AFAICT,
OpenSSL does not guarantee that ASYNC_pause_job() in a callback will
actually "give way to other activities" because OpenSSL does guarantee
that some engine or OpenSSL native code does not hold a
ASYNC_block_pause() "lock" while calling the callback.

The engine code that async API supports well may look like this[1]:

   myengine()
   {
       while (!something_happened())
           ASYNC_pause_job(); // application MAY get control here

       ... use something ...
   }

The callback code that async API does not support looks like this:

    mycallback()
    {
        if (!something_happened())
            ASYNC_pause_job(); // application MUST get control here

        assert(something_happened());
        ... use something ...
    }

Please note that replacing "if" with "while" in mycallback() would make
the compiled code identical with myengine() but would not solve the
problem: Instead of the failed assertion, the callback would get into an
infinite loop...

The callback _relies_ on the application making progress (e.g., fetching
the missing intermediate certificates or declaring a fetch failure
before resuming SSL_connect()). This callback cannot work correctly
without the application actually getting control. That is why the
pausing call comments are different: MAY vs. MUST.

Does this clarify what I meant? Do you agree that OpenSSL async API is
not suitable for callbacks that _require_ ASYNC_pause_job() to return
control to the application?

[1] This myengine() example is inspired by your explanation at
https://mta.openssl.org/pipermail/openssl-dev/2015-October/003031.html

> ASYNC_block_pause() ... does not appear in the library code

True, but it did appear there in the past, right? I am looking at commit
625146d as an example. Those calls were removed more than a year later
in 75e2c87 AFAICT, but I see no guarantee that they will not reappear again.

And even if OpenSSL now has a policy against using ASYNC_block_pause()
internally, or a policy against holding an ASYNC_block_pause() "lock"
while calling any callback, some custom engine might do that at the
"wrong" for the above mycallback() moment, right?

If you think that fears about something inside OpenSSL/engines
preventing our callback from returning control to the application are
unfounded, then using async API may be the best long-term solution for
Squid. Short-term, it does not work "as is" because OpenSSL STACKSIZE
appears to be too small (leading to weird crashes that disappear if we
increase STACKSIZE from 32768 to 524288 bytes), but perhaps we can
somehow hack around that.

> One possibility that springs to mind (which is also an ugly hack) is to
> defer the validation of the certificates. So, you have a verify callback
> that always says "ok". But any further reads on the underlying BIO
> always return with "retry" until such time as any intermediate
> certificates have been fetched and the chain has been verified "for
> real". The main problem I can see with this approach is there is no easy
> way to send the right alert back to the server in the event of failure.

We were also concerned that X509_verify_cert() is not enough to fully
mimic the existing OpenSSL certificate validation procedure because the
internal OpenSSL ssl_verify_cert_chain() does not just call
X509_verify_cert(). It also does some DANE-related manipulations, for
example. Are those fears unfounded? In other words, is calling
X509_verify_cert() directly always enough to make the right certificate
validation decision?

Thanks a lot,

Alex.