Re: possible writeback race/bch_data_insert_keys fails

Coly Li <colyli@xxxxxxx> · Fri, 23 Feb 2018 23:27:01 +0800

Hi Eric,

On 23/02/2018 9:35 PM, Eric Tork wrote:
> Yes, I can help with testing once there is a patch to try.  
> 

This will help a lot, thanks in advance.

>   To clarify, my setup is using backing device sdh1 and a cache set to create bcache1, which is then sent through a LUKS layer and then becomes a new cache set for backing device secondleveltest, with an LVM LV, which then becomes bcache2.  So, the stacking is a bit unique.  So, both bcache devices are not truly using the same cache set - bcache1 is being used to form a cache set that is used in bcache2.

I see. Thanks for the information.

Coly Li

> 
> ----- Original Message -----
> From: "Coly Li" <colyli@xxxxxxx>
> To: "Eric A Tork" <etork@xxxxxxxxxxxxxx>, linux-bcache@xxxxxxxxxxxxxxx
> Sent: Thursday, February 22, 2018 8:33:40 PM
> Subject: Re: possible writeback race/bch_data_insert_keys fails
> 
> On 23/02/2018 8:04 AM, Eric A Tork wrote:
>>
>>
>>   Hello,  I am hitting a lock issue with bcache while doing some
>> testing, and only a reboot brings the system back after encountering
>> this issue.  
>>
>> Here is my lsblk:
>>
>>
>> sdh
>> sdh1                                 zfs_member
>>   bcache1                            crypto_LUKS                     
>>     loopcrypto1                      bcache                          
>>       bcache2                        LVM2_member                     
>>         secondleveltest-fullzfsnfs   zfs_member        thirdlevelzfs 
>>
>>
>>
>>   There appears to be a race happen as the system is performing normally
>> and then all activity to the bcache devices hits 100% and no more I/O
>> happens.
>>
>> This is with two stacked bcache devices (LUKS in between) with writeback
>> turned on.  It will do the same if set to writethrough as well.  
>>
>> [root@centos-7 log]# uname -a
>> Linux centos-7.1-test.talentbankonline.com 4.15.4-1.el7.elrepo.x86_64 #1
>> SMP Sat Feb 17 13:35:20 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>>   And here is the kernel trace when the system stalls:
>>
> 
> Hi Eric,
> 
> At the first glance, the race is very probably from a global work queue:
> bcache_wq. Although there are 2 different bcache devices stacked, they
> share the unique work queue bcache_wq in request.c.
> 
> I guess bcache code was not originally designed for stacked itself, this
> is why you hit this bug.
> 
> I guess the stacked bcache devices may also share same cache set, so the
> fix might be to change bcache_wq into a per-bcache-device queue.
> 
> Could you please to help testing once I have a patch for your issue?
> 
> Thanks in advance.
> 
> Coly Li

[snipped]

--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html