Re: possible self-deadlock in idle signal handler

baconm@xxxxxxxxxxxxx · Sat, 28 Mar 2009 23:27:06 -0400

Thanks for the link.  That certainly would cover the general case of 
what I'm seeing.

This looks like something much more than "well, let me fiddle with it a 
bit and submit a patch."  Has anyone with the requisite design 
perspective taken a hack at solving it?  I'm assuming that running 
idled doesn't ameliorate the problem.

-Michael

Quoting Wesley Craig <wes@xxxxxxxxx>:

> See here:
>
> 	https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=3100
>
> The solution is to rewrite the signal handler to do much less.
>
> :wes
>
> On 28 Mar 2009, at 09:37, Michael Bacon wrote:
>> We're experiencing some problems, particularly with a small number of
>> users, which manifest themselves in the dreaded "one deadlocked,
>> hundreds waiting" process logjam.  The keystone process appears to be
>> an imapd deadlocked on itself in this manner (this is Solaris 9):
>>
>> -> pstack 19090
>> 19090:  imapd
>>   febc5994 lwp_park (0, 0, 0)
>>   febc206c slow_lock (fecc05a8, feba0000, 0, fecbc000, 14, 0) + 58
>>   fec46e70 malloc   (c, 0, 13d668, 13d66c, 28cc, 13d790) + 18
>>   00078ac0 xmalloc  (c, 13d790, 0, 0, 0, 0) + 4
>>   00074a64 lock_or_refresh (13d660, 1364b4, 107400, 0, 0, 0) + 10c
>>   00074d50 myfetch  (13d660, 1bbe58, 10, ffbfb25c, ffbfb254,  1364b4) + 44
>>   00060d74 seen_readit (1364a0, ffbfb2ec, ffbfb2e8, 1252bc,  ffbfb2e4, 1)
>> + 60
>>   0003d0c4 index_checkseen (123a00, 0, 0, 603, 1e5a4c, 87fd0) + 4c
>>   0003e298 index_check (123a00, 0, 1, 125000, ffbfc370, 125000) + 234
>>   0002c574 idle_update (3, 0, 0, 0, 0, 0) + 24
>>   0005f7cc idle_handler (e, 0, ffbfcb20, 0, 0, 0) + 5c
>>   febc5bac __sighndlr (e, 0, ffbfcb20, 5f770, 0, 0) + c
>>   febbf804 call_user_handler (e, 0, ffbfcb20, 0, 0, 0) + 234
>>   febbf9b4 sigacthandler (e, 0, ffbfcb20, 8, 1bd7c0, 0) + 64
>>   --- called from signal handler with signal 14 (SIGALRM) ---
>>   fec470d4 _malloc_unlocked (64, 0, 0, fecbc000, 0, 0) + 240
>>   fec46e78 malloc   (64, ff0a07d0, a3, 1c4d0d, db, 6d) + 20
>>   fefc5820 default_malloc_ex (64, ff0b17b0, ca, ca, 0, ffe43088) + 20
>>   fefc61e4 CRYPTO_malloc (0, ff0b17b0, ca, 1bcff0, 1bcf78, 1bcf78)  + 84
>>   ff036efc EVP_DigestInit_ex (ffbfd150, ff0dfbb0, 0, fffffff8, 0,
>> ffbfd1fd) + 13c
>>   fefdabec HMAC_Init_ex (ffbfd13c, ffbfd150, ffbfd048, ff0dfbb0, 0,  0) +
>> cc
>>   ff160b70 tls1_mac (1bea88, ffbfd288, 0, 20, 0, 1) + 90
>>   ff15cfa4 ssl3_read_bytes (1bea88, 17, ffbfd288, 8c, 1c4d03, 0) + 524
>>   ff15a9c4 ssl3_read (1bea88, 13aef0, 1000, 0, 378, 0) + 44
>>   ff16a30c SSL_read (0, 13aef0, 1000, 0, ffbfd5bc, ffbfd5b1) + 6c
>>   0006bd5c prot_fill (13ae78, 0, 0, 0, ffbfd5bc, ffbfd428) + ec
>>   0005e564 getword  (13ae78, 125108, 1, 1a9e0, 2c8dc, 125000) + ac
>>   0002c8f0 cmd_idle (13d358, 7dc00, 0, 0, 730061, 0) + 2e8
>>   0002ea6c cmdloop  (0, 1360d8, 8bc60, 8bc60, 123c00, 125000) + df0
>>   00030d34 service_main (123c00, 132080, ffbffc2c, 0, 1aa50, 11a800) +
>> 180
>>   0001aaf8 main     (ffbff2b4, 7c000, fa, 27667, 2602e4, 49c71400)  + 640
>>   0001a2ec _start   (0, 0, 0, 0, 0, 0) + 5c
>>
>>  From looking online, what looks to be the problem is that the SSL  stack
>> was in the middle of a malloc() call when the SIGALRM went off,  causing
>> the process to try to open the seen file, which resulted in another
>> malloc.  The second malloc requests a mutex on malloc for the process
>> (part of Solaris's thread internals), but that mutex is held by the
>> first call, and hence the mutex lock will never return and the process
>> is permanently hung, holding the lock for the mailbox.
>>
>> Would anyone happen to have any tips on getting out from under this?
>

----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html