Re: [CEPH-DEVEL] [ceph-users] occasional failure to unmap rbd

Ilya Dryomov <idryomov@xxxxxxxxx> · Tue, 24 Nov 2015 12:49:04 +0100

On Tue, Nov 24, 2015 at 12:12 AM, Markus Kienast <elias1884@xxxxxxxxx> wrote:
> Kernel Version
> elias@paris3:~$ uname -a
> Linux paris3.sfe.tv 3.16.0-28-generic #38-Ubuntu SMP Sat Dec 13
> 16:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> Output of dmesg and /var/log/dmesg attached.
> But does not show much except for one mon being down.
> The mon is down for hardware reasons.
>
>
>
> On Mon, Nov 23, 2015 at 11:26 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>>
>> On Mon, Nov 23, 2015 at 11:03 PM, Markus Kienast <mark@xxxxxxxxxxxxx> wrote:
>> > I am having the same issue here.
>>
>> Which kernel are you running?  Could you attach your dmesg?
>>
>> >
>> > root@paris3:/etc/neutron# rbd unmap /dev/rbd0
>> > rbd: failed to remove rbd device: (16) Device or resource busy
>> > rbd: remove failed: (16) Device or resource busy
>> >
>> > root@paris3:/etc/neutron# rbd info -p volumes
>> > volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2
>> > 2015-11-23 22:42:06.842697 7f2d57e49700  0 -- :/2760503703 >>
>> > 10.90.90.4:6789/0 pipe(0x1773250 sd=3 :0 s=1 pgs=0 cs=0 l=1
>> > c=0x17734e0).fault
>> > rbd image 'volume-f3ab6892-f35e-4b98-8832-efbaaa2f4ca2':
>> > size 500 GB in 128000 objects
>> > order 22 (4096 kB objects)
>> > block_name_prefix: rbd_data.1b6d9e2aaa998b
>> > format: 2
>> > features: layering
>> > root@paris3:/etc/neutron# rados -p volumes listwatchers
>> > rbd_header.1b6d9e2aaa998b
>> > 2015-11-23 22:42:58.546723 7fec94fec700  0 -- :/2519796249 >>
>> > 10.90.90.4:6789/0 pipe(0x9cf260 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x9cf4f0).fault
>>
>> Did you root cause these faults?
>
> Hardware failure caused these faults.
>
>>
>> > watcher=10.90.90.3:0/3293327848 client.8471177 cookie=1
>> >
>> > root@paris3:/etc/neutron# ps ax | grep rbd
>> >  7814 ?        S      0:00 [jbd2/rbd0-8]
>>
>> Was there an ext filesystem involved?  How was it umounted - do you
>> have a "umount <mountpoint>" process stuck in D state?
>
> Yes, all these RBDs are formatted with ext4. I am regularly using them
> with openstack and have never had any problems.
> I did "unmount <mountpoint>" and the unmount process did actually
> finish just fine.
> Where can I look up, if it is stuck in "D" state?
>
>>
>> > 11003 ?        S      0:00 [jbd2/rbd1-8]
>> > 14042 ?        S      0:00 [jbd2/rbd2p1-8]
>> > 24228 ?        S      0:00 [jbd2/rbd3-8]
>> >
>> > root@paris3:/etc/neutron# ceph --version
>> > ceph version 0.80.11 (8424145d49264624a3b0a204aedb127835161070)
>> >
>> > root@paris3:/etc/neutron# ls /sys/block/rbd0/holders/
>> > returns nothing
>> >
>> > root@paris3:/etc/neutron# fuser -amv /dev/rbd0
>> >                      USER        PID ACCESS COMMAND
>> > /dev/rbd0:
>>
>> What's the output of "cat /sys/bus/rbd/devices/0/client_id"?
>
> root@paris3:~# cat /sys/bus/rbd/devices/0/client_id
> client8471177
>
>>
>> What's the output of "sudo cat /sys/kernel/debug/ceph/*/osdc"?
>
> root@paris3:~# ls -l /sys/kernel/debug/ceph/
> total 0
> drwxr-xr-x 2 root root 0 Feb  4  2015
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711
> drwxr-xr-x 2 root root 0 Nov 23 11:41
> 32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177
>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client8471177/osdc
> has no output

This means there are no outstanding/hung rbd I/Os.  According to you,
umount completed successfully, and yet there is a jbd2/rbd0-8 kthread
hanging around, keeping /dev/rbd0 open and holding a ref to it.
A quick search produced two similar reports:

[1] https://ask.fedoraproject.org/en/question/7572/how-to-stop-kernel-ext4-journaling-thread/
[2] http://lists.openwall.net/linux-ext4/2015/10/24/11

The only difference as far I can tell is those people noticed the jbd2
thread because they wanted to run fsck, while you ran into it because
you tried to do "rbd unmap".  Neither mentions rbd.

Look at [2], did you at any point see any similar errors in dmesg?

>
> root@paris3:~# cat
> /sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
> hangs with no output

It shouldn't hang, so it could be unrelated.  Given the "Feb  4  2015"
timestamp, I'm going to assume you haven't rebooted this box in a long
time?  If so, do you remember what happened around that date?

Do you keep syslog archives?  I'd be interested in seeing everything
you have for Feb 3 - Feb 5.

To try to figure out where it's hanging, can you do

# cat >/sys/kernel/debug/ceph/32ba3117-e320-49fc-aabd-f100d5a7e94b.client7663711/osdc
< it'll hang, grab its PID from ps output >
# cat /proc/$PID/stack

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html