Re: Disappearing device during device plugging causes io errors.

Nikolay Borisov <n.borisov@xxxxxxxxxxxxxx> · Wed, 6 Jan 2016 16:41:37 +0200

On 01/06/2016 04:31 PM, Ming Lei wrote:
> On Wed, Jan 6, 2016 at 7:05 PM, Nikolay Borisov
> <n.borisov@xxxxxxxxxxxxxx> wrote:
>>
>>
>> On 01/05/2016 03:34 AM, Ming Lei wrote:
>>> On Mon, Jan 4, 2016 at 11:56 PM, Nikolay Borisov
>>> <n.borisov@xxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> On 01/04/2016 05:44 PM, Ming Lei wrote:
>>>>> On Mon, Jan 4, 2016 at 11:31 PM, Nikolay Borisov
>>>>> <n.borisov@xxxxxxxxxxxxxx> wrote:
>>>>>> Hi Ming,
>>>>>>
>>>>>> On 01/04/2016 05:23 PM, Ming Lei wrote:
>>>>>>> On Mon, Jan 4, 2016 at 4:21 PM, Nikolay Borisov
>>>>>>> <n.borisov@xxxxxxxxxxxxxx> wrote:
>>>>>>>> Hello block people ,
>>>>>>>>
>>>>>>>> I'm running some experiments using the attached init_vg.txt script. And
>>>>>>>> at the same time I have the following systemtap script active:
>>>>>>>>
>>>>>>>> probe kernel.statement("loop_clr_fd@drivers/block/loop.c:896") {
>>>>>>>>         printf("Unbound device %s\n", kernel_string($lo->lo_disk->disk_name));
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> probe kernel.statement("loop_set_fd@drivers/block/loop.c:780") {
>>>>>>>>         printf("Bound device: %s\n", kernel_string($lo->lo_disk->disk_name));
>>>>>>>>         //print_backtrace();
>>>>>>>> }
>>>>>>>>
>>>>>>>> probe kernel.statement("__blk_mq_run_hw_queue@block/blk-mq.c:814") {
>>>>>>>>         printf("error in blk_mq_run_hq_queue for dev %s\n", kernel_string($bd->rq->rq_disk->disk_name));
>>>>>>>>         print_backtrace();
>>>>>>>>         print("----------------------------------\n");
>>>>>>>> }
>>>>>>>>
>>>>>>>> Which produces the following output from time to time:
>>>>>>>>
>>>>>>>> Unbound device loop3
>>>>>>>> error in blk_mq_run_hq_queue for dev loop3
>>>>>>>>  0xffffffff8134ef6b : __blk_mq_run_hw_queue+0x29b/0x380 [kernel]
>>>>>>>>  0xffffffff8134f10a : blk_mq_run_hw_queue+0x6a/0x80 [kernel]
>>>>>>>>  0xffffffff8134faeb : blk_mq_insert_requests+0xdb/0x120 [kernel]
>>>>>>>>  0xffffffff8134fc54 : blk_mq_flush_plug_list+0x124/0x140 [kernel]
>>>>>>>>  0xffffffff81346886 : blk_flush_plug_list+0xc6/0x1f0 [kernel]
>>>>>>>>  0xffffffff813469e4 : blk_finish_plug+0x34/0x50 [kernel]
>>>>>>>>  0xffffffff811de687 : do_blockdev_direct_IO+0x757/0xbf0 [kernel]
>>>>>>>>  0xffffffff811deb63 : __blockdev_direct_IO+0x43/0x50 [kernel]
>>>>>>>>  0xffffffff811da8b8 : blkdev_direct_IO+0x58/0x80 [kernel]
>>>>>>>>  0xffffffff8112b73f : generic_file_read_iter+0x13f/0x150 [kernel]
>>>>>>>>  0xffffffff811d9fd7 : blkdev_read_iter+0x37/0x40 [kernel]
>>>>>>>>  0xffffffff811a1d13 : __vfs_read+0xd3/0xf0 [kernel]
>>>>>>>>  0xffffffff811a1ea7 : vfs_read+0x97/0xe0 [kernel]
>>>>>>>>  0xffffffff811a287a : sys_read+0x5a/0xc0 [kernel]
>>>>>>>>  0xffffffff8162102e : entry_SYSCALL_64_fastpath+0x12/0x71 [kernel]
>>>>>>>> ----------------------------------
>>>>>>>> Bound device: loop3
>>>>>>>>
>>>>>>>> At the same time I get the following output in dmesg:
>>>>>>>> blk-mq: bad return on queue: -5 <-- This -EIO code is returned from loop_queue_rq
>>>>>>>> blk_update_request: I/O error, dev loop3, sector 0
>>>>>>>>
>>>>>>>> To me this means it's possible that device disabling races with
>>>>>>>> pending IO plugs for this device. I wonder whether it would be possible
>>>>>>>> to flush any plugs for a particular device before disabling its
>>>>>>>> multiqueue? Or maybe delay the plug flushing until we know the device
>>>>>>>
>>>>>>> Yes, you should deattach the loop block after all pending I/Os to current loop
>>>>>>> block are completed first. For example, umount and lvremove should be run
>>>>>>> before deleting loop in your test case, and the paths are totally controlled
>>>>>>> by user space.
>>>>>>>
>>>>>>>> is actually active. Though I can see a problem with the latter approach
>>>>>>>> since this would mean it's possible to have the following scenario:
>>>>>>>>
>>>>>>>> 1. Device is attached to system and writes are going normally
>>>>>>>> 2. A process plugs the device and starts queuing IO on the plug
>>>>>>>> 3. The device is detached from the system
>>>>>>>> 4. Plug flushing code detects (3) and waits until device is re-attached
>>>>>>>> 5. Device is reattached
>>>>>>>> 6. Plug from (4) is flushed.
>>>>>>>>
>>>>>>>> However, the device attached in (5) might not be the same device as in
>>>>>>>> (1) and this would mean that (6) would be writing potentially random
>>>>>>>> data WRT device attached to (5) .
>>>>>>>
>>>>>>> It is the user's responsiblity to complete all pending I/O to current loop(old)
>>>>>>> before the loop(new) is attached again because both the two pathes are
>>>>>>> from user-space finally.  And these I/Os will be completed as -EIO and
>>>>>>> won't reach the backing file at all, so how can the above case happen?
>>>>>>
>>>>>> It can't happen, I was just thinking out loud. As I have pointed out -
>>>>>> this seems a rather bogus scenario.
>>>>>
>>>>> OK, so there isn't real problem in your report.
>>>>
>>>> I just want to know (account) for all IO and just seeing some random IO
>>>> errors was putting me off.
>>>
>>> No, it is definitely not random IO error, and all IO will be failed after
>>> the loop is detached.
>>>
>>>>
>>>>>>>> Essentially is it normal to have IO fail in such situations?
>>>>>>>
>>>>>>>     cat init_vg.txt
>>>>>>>     ...
>>>>>>>     loopdev=$(losetup -f --show ${file})
>>>>>>>     pvcreate --metadatasize 1M ${loopdev}
>>>>>>>     vgcreate ${group} -s 1MiB ${loopdev}
>>>>>>>     ...
>>>>>>>     umount $mntpath
>>>>>>>     vgchange -Kan $group
>>>>>>>     losetup -d $loopdev
>>>>>>>
>>>>>>> As far as for your above test case, it is normal to fail the IO after
>>>>>>> the loop block is deleted, and you should have removed the volume
>>>>>>> group first before deleting the loop block.
>>>>>>
>>>>>> But in this case the filesystem (which is on the volume group, which is
>>>>>> on the loop device) is unmounted, then the volume group is deactivated,
>>>>>
>>>>> As I mentioned, you should have run lvremove before attaching/disabling
>>>>> the loop.
>>>>
>>>> But lvremove would delete my volumes, whereas I do not want to delete
>>>> them, rather just disable them (what lvchange -Kan is supposed to do)
>>>
>>> OK, that looks fine.
>>>
>>>> and then remove the loop device so that I can, for example, transfer the
>>>> VG by just moving the single loopback image. I will run more tests to
>>>> see from which process does the failure come.
>>>>
>>>>>
>>>>>> which, at this point, should stop all IO and finally the loop device is
>>>>>> nuked, yet I can still see IO in transmit. Based on this it seems that
>>>>>> vgchange might not be flushing everything. I mostly see the failures
>>>>>> occur with reads.
>>>>>
>>>>> The read may be from reading partition table, and loop block just
>>>>> returns -EIO in this situation, so what is wrong with this way?
>>>>
>>>> Will have to check this.
>>
>> Modifying the stap script to show the process which was generating the
>> failure showed that it's mainly lvchange and sometimes (in the begining
>> of the test) the vgcreate command. This, coupled with the fact that the
>> failures happen during DIO and thus bypassing the filesystem could
>> really indicate that what you are saying (reading part table or
>> otherwise metadata from the volume) might be true. However, see my
>> concerns below.
>>
>>>
>>> OK.
>>>
>>> I still can't see any problem from your report up to now.  If you think
>>> it is a real problem, please provide the observable effect from user view
>>> explicitly.
>>
>> So when I run the test just once everything works as expected - all the
>> commands in the test are synchronous so it is not expected to have
>> lingering IO while the loop device is being removed, since this is done
>> after the filesystem is unmounted and lvchange has finished executing.
>> However, when I run multiple instances of the test case e.g.
>>
>> for i in {1..6}; do ./init_vg.sh > /dev/null & done
>>
>> where the number of instances is chosen such that it is equal to the
>> number of loopback device on system I start to see the aforementioned IO
>> failures. And they are always random wrt to when they are happening or
> 
> Now you run this test concurrently, then you can't make sure all pending
> I/O from all tasks are completed before detaching the loop in one of the
> task any more, so this issue is observed.

This is true, but since a loop device is used per-test, e.g. the test
acquires a free loop device and keeps using it until it finishes with
it. This should mean that once a loop device is unbound all IO  for it
should be finished and only then is it visible to other test instances,
otherwise this loopback is 'private' to this test instance , no?

> 
> As I mentioned, it is user space's responsibilty to avoid the race. Returning
> -EIO for detached loop has been there for long time, and I don't think it is
> a issue in reality.

I'm not disputing that this is wrong. What I'm really puzzled is why do
I observe the failures when I run the test concurrently, since the
loopback devices are private to every test instance  hence cross-talk
shouldn't be occurring.

> 
>> for which particular loopback device. And given the structure of the
>> test case - always generating unique names and each instance working
>> with its own dedicated loopback device I find it odd that I see the IO
>> failure with multiple tests and not when running 1 instance.
>>
>> Regards,
>> Nikolay
>>
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html