Re: [PATCH] bcache: fix deadlock in bcache_allocator()

Coly Li <colyli@xxxxxxx> · Wed, 10 Jul 2019 23:57:25 +0800

On 2019/7/10 11:46 下午, Andrea Righi wrote:
> On Wed, Jul 10, 2019 at 11:11:37PM +0800, Coly Li wrote:
>> On 2019/7/10 5:31 下午, Andrea Righi wrote:
>>> bcache_allocator() can call the following:
>>>
>>>  bch_allocator_thread()
>>>   -> bch_prio_write()
>>>      -> bch_bucket_alloc()
>>>         -> wait on &ca->set->bucket_wait
>>>
>>> But the wake up event on bucket_wait is supposed to come from
>>> bch_allocator_thread() itself => deadlock:
>>>
>>>  [ 242.888435] INFO: task bcache_allocato:9015 blocked for more than 120 seconds.
>>>  [ 242.893786] Not tainted 4.20.0-042000rc3-generic #201811182231
>>>  [ 242.896669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>  [ 242.900428] bcache_allocato D 0 9015 2 0x80000000
>>>  [ 242.900434] Call Trace:
>>>  [ 242.900448] __schedule+0x2a2/0x880
>>>  [ 242.900455] ? __schedule+0x2aa/0x880
>>>  [ 242.900462] schedule+0x2c/0x80
>>>  [ 242.900480] bch_bucket_alloc+0x19d/0x380 [bcache]
>>>  [ 242.900503] ? wait_woken+0x80/0x80
>>>  [ 242.900519] bch_prio_write+0x190/0x340 [bcache]
>>>  [ 242.900530] bch_allocator_thread+0x482/0xd10 [bcache]
>>>  [ 242.900535] kthread+0x120/0x140
>>>  [ 242.900546] ? bch_invalidate_one_bucket+0x80/0x80 [bcache]
>>>  [ 242.900549] ? kthread_park+0x90/0x90
>>>  [ 242.900554] ret_from_fork+0x35/0x40
>>>
>>> Fix by making the call to bch_prio_write() non-blocking, so that
>>> bch_allocator_thread() never waits on itself.
>>>
>>> Moreover, make sure to wake up the garbage collector thread when
>>> bch_prio_write() is failing to allocate buckets.
>>>
>>> BugLink: https://bugs.launchpad.net/bugs/1784665
>>> Signed-off-by: Andrea Righi <andrea.righi@xxxxxxxxxxxxx>
>>
>> Hi Andrea,
>>
> 
> Hi Coly,
> 

Hi Andrea,

>> >From the BugLink, it seems several critical bcache fixes are missing.
>> Could you please to try current 5.3-rc kernel, and try whether such
>> problem exists or not ?
> 
> Sure, I'll do a test with the latest 5.3-rc kernel. I just wanna mention
> that I've been able to reproduce this problem after backporting all the
> fixes (even those from linux-next), but I agree that testing 5.3-rc is a
> better idea (I may have introduced bugs while backporting stuff).
> 

Do you also back port the patches which are just merged into 5.3-rc ?
There are some fixes for deadlocking problems.

>>
>> For this patch itself, it looks good except that I am not sure whether
>> invoking garbage collection is a proper method. Because bch_prio_write()
>> is called right after garbage collection gets done, jump back to
>> retry_invalidate: again may just hide a non-space long time waiting
>> condition.
> 
> Honestly I was thinking the same, but if I don't call the garbage
> collector bch_allocator_thread() gets stuck forever (or for a very very
> long time) in the retry_invalidate loop...
> 
>>
>> Could you please give me some hint, on how to reproduce such hang
>> timeout situation. If I am lucky to reproduce such problem on 5.3-rc
>> kernel, it may be very helpful to understand what exact problem your
>> patch fixes.
> 
> Fortunately I have a reproducer, here's the script that I'm using:
> 

Great! Let me try this script, thank you very much :-)

Coly Li

> ---
> #!/bin/bash -x
> 
> BACKING=/sys/class/block/bcache0
> CACHE=/sys/fs/bcache/*-*-*
> while true; do
>     echo "1" | tee ${BACKING}/bcache/stop
>     echo "1" | tee ${CACHE}/stop
>     udevadm settle
>     [ ! -e "${BACKING}" -a ! -e "${CACHE}" ] && break
>     sleep 1
> done
> wipefs --all --force /dev/vdc2
> wipefs --all --force /dev/vdc1
> wipefs --all --force /dev/vdc
> wipefs --all --force /dev/vdd
> blockdev --rereadpt /dev/vdc
> blockdev --rereadpt /dev/vdd
> udevadm settle
> 
> # create ext4 fs over bcache
> parted /dev/vdc --script mklabel msdos || exit 1
> udevadm settle --exit-if-exists=/dev/vdc
> parted /dev/vdc --script mkpart primary 2048s 2047999s || exit 1
> udevadm settle --exit-if-exists=/dev/vdc1
> parted /dev/vdc --script mkpart primary 2048000s 20922367s || exit 1
> udevadm settle --exit-if-exists=/dev/vdc2
> make-bcache -C /dev/vdd || exit 1
> while true; do
>     udevadm settle
>     CSET=`ls /sys/fs/bcache | grep -- -`
>     [ -n "$CSET" ] && break;
>     sleep 1
> done
> make-bcache -B /dev/vdc2 || exit 1
> while true; do
>     udevadm settle
>     [ -e "${BACKING}" ] && break
>     sleep 1;
> done
> echo $CSET | tee ${BACKING}/bcache/attach
> udevadm settle --exit-if-exists=/dev/bcache0
> bcache-super-show /dev/vdc2
> udevadm settle
> mkfs.ext4 -F -L boot-fs -U e9f00d20-95a0-11e8-82a2-525400123401 /dev/vdc1
> udevadm settle
> mkfs.ext4 -F -L root-fs -U e9f00d21-95a0-11e8-82a2-525400123401 /dev/bcache0 || exit 1
> blkid
> ---
> 
> I just run this as root in a busy loop (something like
> `while :; do ./test.sh; done`) on a kvm instance with two extra disks
> (in addition to the root disk).
> 
> The extra disks are created as following:
> 
>  qemu-img create -f qcow2 disk1.qcow 10G
>  qemu-img create -f qcow2 disk2.qcow 2G
> 
> I'm using these particular sizes, but I think we can reproduce the same
> problem also using different sizes.
> 
> Thanks,
> -Andrea
> 

-- 

Coly Li