Re: corruption causing crash in __queue_work

Nikolay Borisov <n.borisov@xxxxxxxxxxxxxx> · Fri, 11 Dec 2015 20:00:29 +0200

On Fri, Dec 11, 2015 at 7:08 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Nikolay.
>
> On Fri, Dec 11, 2015 at 05:57:22PM +0200, Nikolay Borisov wrote:
>> So I had a server with the patch just crash on me:
>>
>> Here is how the queue looks like:
>>  crash> struct workqueue_struct 0xffff8802420a4a00
>> struct workqueue_struct {
>>   pwqs = {
>>     next = 0xffff8802420a4c00,
>>     prev = 0xffff8802420a4a00
>
> Hmmm... pwq list is already corrupt.  ->prev is terminated but ->next
> isn't.
>
>>   },
>>   list = {
>>     next = 0xffff880351f9b210,
>>     prev = 0xdead000000200200
>
> Followed by by 0xdead000000200200 which is likely from
> CONFIG_ILLEGAL_POINTER_VALUE.
>
> ...
>>   name =
>> "dm-thin\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>>   rcu = {
>>     next = 0xffff8802531c4c20,
>>     func = 0xffffffff810692e0 <rcu_free_wq>
>
> and call_rcu_sched() already called.  The workqueue has already been
> destroyed.
>
>>   },
>>   flags = 131082,
>>   cpu_pwqs = 0x0,
>>   numa_pwq_tbl = 0xffff8802420a4b10
>> }
>>
>> crash> rd 0xffff8802420a4b10 2 (the machine has 2 NUMA nodes hence the
>> '2' argument)
>> ffff8802420a4b10:  0000000000000000 0000000000000000   ................
>>
>> At the same time searching for 0xffff8802420a4a00 in the debug output
>> shows nothing IOW it seems that the numa_pwq_tbl is never installed for
>> this workqueue apparently:
>>
>> [root@smallvault8 ~]# grep 0xffff8802420a4a00 /var/log/messages
>>
>> Also dumping all the logs from the dmesg contained in the vmcore image I
>> find nothing and when I do the following correlation:
>> [root@smallvault8 ~]# grep \(null\) wq.log | wc -l
>> 1940
>> [root@smallvault8 ~]# wc -l wq.log
>> 1940 wq.log
>>
>> It seems what's happening is really just changing the numa_pwq_tbl on
>> workqueue creation i.e. it is never re-assigned. So at this point I
>> think it seems that there is a situation where the wqattr are not being
>> applied at all.
>
> Hmmm... No idea why it didn't show up in the debug log but the only
> way a workqueue could be in the above state is either it got
> explicitly destroyed or somehow pwq refcnting is messed up, in both
> cases it should have shown up in the log.
>
> cc'ing dm people.  Is there any chance dm-think could be using
> workqueue after destroying it?

In __pool_destroy in dm-thin.c I don't see a call to
cancel_delayed_work before destroying the workqueue. Is it possible
that this is the causeI

>
> Thanks.
>
> --
> tejun

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel