Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd

Markus Kienast <elias1884@xxxxxxxxx> · Tue, 24 Nov 2015 13:46:18 +0100

Unfortunately I have rebooted the server, as I needed the services back online.
I did try mapping and unmapping again after reboot and did not see the
problem anymore.

However, I will search through my logs and send you everything from
Feb 3 - Feb 5.

And if I see the issue again, I will follow all the debug steps
described in this thread and post it here.

In the mean time, I have upgraded to the next minor revision from your
dragonfly-debian archives. So maybe I do not see the problem anymore
due to that.

Many thanks for your help!

Regards,
Markus

On Tue, Nov 24, 2015 at 12:51 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Tue, Nov 24, 2015 at 12:49 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>> On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1884@xxxxxxxxx> wrote:
>>> Kernel Version
>>> elias@paris3:~$ uname -a
>>> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
>>> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Output of dmesg and /var/log/dmesg attached.
>>> But does not show much except for one mon being down.
>>> The mon is down for hardware reasons.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>>>
>>>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@xxxxxxxxxxxxx> wrote:
>>>> > I am having the same issue here.
>>>>
>>>> Which kernel are you running?  Could you attach your dmesg?
>>>>
>>>> >
>>>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>>>> > rbd: failed to remove rbd device: (16) Device or resource busy
>>>> > rbd: remove failed: (16) Device or resource busy
>>>> >
>>>> > root@paris3:/etc/neutron# rbd info -p volumes
>>>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>>>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>>>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>>>> > c=0x17734e0).fault
>>>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>>>> > size 500 GB in 128000 objects
>>>> > order 22 (4096 kB objects)
>>>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>>>> > format: 2
>>>> > features: layering
>>>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>>>> > rbd_header.1b6d9e2aaa998b
>>>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>>>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>>>>
>>>> Did you root cause these faults?
>>>
>>> Hardware failure caused these faults.
>>>
>>>>
>>>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>>>> >
>>>> > root@paris3:/etc/neutron# ps ax | grep rbd
>>>> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>>>>
>>>> Was there an ext filesystem involved?  How was it umounted - do you
>>>> have a "umount <mountpoint>" process stuck in D state?
>>>
>>> Yes, all these RBDs are formatted with ext4. I am regularly using them
>>> with openstack and have never had any problems.
>>> I did "unmount <mountpoint>" and the unmount process did actually
>>> finish just fine.
>>> Where can I look up, if it is stuck in "D" state?
>>>
>>>>
>>>> > 11003 ?        S      0:00 [jbd2/rbd1-8]
>>>> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
>>>> > 24228 ?        S      0:00 [jbd2/rbd3-8]
>>>> >
>>>> > root@paris3:/etc/neutron# ceph --version
>>>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>>>> >
>>>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>>>> > returns nothing
>>>> >
>>>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>>>> >                      USER        PID ACCESS COMMAND
>>>> > /dev/rbd0:
>>>>
>>>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>>>
>>> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
>>> client8471177
>>>
>>>>
>>>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>>>
>>> root@paris3:~# ls -l /sys/kernel/debug/ceph/
>>> total 0
>>> drwxr-xr-x 2 root root 0 Feb  4  2015
>>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
>>> drwxr-xr-x 2 root root 0 Nov 23 11:41
>>> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>>>
>>> root@paris3:~# cat
>>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
>>> has no output
>>
>> This means there are no outstanding/hung rbd I/Os.  According to you,
>> umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
>> hanging around, keeping /dev/rbd0 open and holding a ref to it.
>> A quick search produced two similar reports:
>>
>> [1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
>> [2] http://lists.openwall.net/linux-ext4/2015/10/24/11
>>
>> The only difference as far I can tell is those people noticed the jbd2
>> thread because they wanted to run fsck, while you ran into it because
>> you tried to do "rbd unmap".  Neither mentions rbd.
>>
>> Look at [2], did you at any point see any similar errors in dmesg?
>>
>>>
>>> root@paris3:~# cat
>>> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
>>> hangs with no output
>>
>> It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"
>
> It should read "so it could be related", of course.
>
> Thanks,
>
>                 Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html