Re: rbd unmap fails with "Device or resource busy"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ilya,

On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not mounted and not (obviously) open by any other processes?

E.g. maybe there's some way of using ebpf or similar to look at the 'rbd_dev->open_count' in the live kernel?

And/or maybe there's some way, again using ebpf or similar, to record sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release() to try to identify something that's opening the device and not releasing it?

Attaching kprobes to rbd_open() and rbd_release() is probably the fastest option. I don't think you even need a stack trace, PID and comm (process name) should do. I would start with something like:

# bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid, comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid, comm) }'

Fetching the actual rbd_dev->open_count value is more involved but also doable.

Excellent! Thanks!

tl;dr there's something other than the open_count causing the unmap failures - or something's elevating and decrementing open_count without going through rbd_open and rbd_release. Or perhaps there's some situation whereby bpftrace "misses" recording calls to rbd_open and rbd_release.

FYI, the production process is:

- create snapshot of rbd
- map
- mount with ro,norecovery,nouuid (the original live fs is still mounted)
- export via NFS
- mount on Windows NFS client
- process on Windows
- remove Windows NFS mount
- unexport from NFS
- unmount
- unmap

(I haven't mentioned the NFS export previously because I thought the issue was replicable without it - but that might simply have been due to the 'pvs' issue which has been resolved.)

I now have a script that mimics the above production sequence in a loop and left it running all night. Out of 288 iterations it had 13 instances where the unmap was failing for some time (i.e. in all cases it eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at all without using --force). In the failing cases the unmap was retried at 1 second intervals. The shortest time taken to eventually umap was 521 seconds, the longest was 793 seconds.

Note, in the below I'm using "successful" for the tests where the first unmap succeeded, and "failed" for the tests where the first unmap failed, although in all cases the unmap eventually succeeded.

I ended up with a bpftrace script (see below) that logs the timestamp, open or release (O/R), pid, device name, open_count (at entry to the function), and process name.

A successful iteration of that process mostly looks like this:

Timestamp     O/R Pid    Device Count Process
18:21:18.235870 O 3269426 rbd29 0 mapper
18:21:20.088873 R 3269426 rbd29 1 mapper
18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
18:21:20.105281 O 3269457 rbd29 1 blkid
18:21:31.858621 R 3269457 rbd29 2 blkid
18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
18:21:31.882235 O 3269475 rbd29 0 mount
18:21:38.241808 R 3269475 rbd29 1 mount
18:21:38.242174 O 3269475 rbd29 0 mount
18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
18:23:55.564512 R 3270060 rbd29 1 umount

Or occasionally it looks like this, with "rpc.mountd" disappearing:

18:35:49.539224 O 3277664 rbd29 0 mapper
18:35:50.515777 R 3277664 rbd29 1 mapper
18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
18:35:50.531978 O 3277694 rbd29 1 blkid
18:35:57.361799 R 3277694 rbd29 2 blkid
18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
18:35:57.384316 O 3277713 rbd29 0 mount
18:36:01.234337 R 3277713 rbd29 1 mount
18:36:01.234849 O 3277713 rbd29 0 mount
18:37:21.304270 R 3289527 rbd29 1 umount

Of the 288 iterations, only 20 didn't include the rpc.mountd lines.

An unsuccessful iteration looks like this:

18:37:31.885408 O 3294108 rbd29 0 mapper
18:37:33.181607 R 3294108 rbd29 1 mapper
18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
18:37:33.197982 O 3294691 rbd29 1 blkid
18:37:42.712870 R 3294691 rbd29 2 blkid
18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
18:37:42.738469 O 3298073 rbd29 0 mount
18:37:49.339012 R 3298073 rbd29 1 mount
18:37:49.339352 O 3298073 rbd29 0 mount
18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
18:53:56.054685 R 3313923 rbd29 1 init

According to my script log, the first unmap attempt was at 18:39:42, i.e. 42 seconds after rpc.mountd released the device. At that point the the open_count was (or should have been?) 1 again allowing the unmap to succeed - but it didn't. The unmap was retried every second until it eventually succeeded at 18:53:56, the same time as the mysterious "init" process ran - but also note there is NO "umount" process in there so I don't know if the name of the process recorded by bfptrace is simply incorrect (but how would that happen??) or what else could be going on.

All 13 of the failed iterations recorded that weird "init" instead of "umount".

12 of the 13 failed iterations included rpc.mountd in the trace, but one didn't (i.e. it went direct from mount to init/umount, like the 2nd successful example above), i.e. around the same proportion as the successful iterations.

So it seems there's something other than the open_count causing the unmap failures - or something's elevating and decrementing open_count without going through rbd_open and rbd_release. Or perhaps there's some situation whereby bpftrace "misses" recording calls to rbd_open and rbd_release.


The bpftrace script looks like this:
--------------------------------------------------------------------
//
// bunches of defines and structure definitions extracted from // drivers/block/rbd.c elided here...
//
kprobe:rbd_open {
  $bdev = (struct block_device *)arg0;
  $rbd_dev = (struct rbd_device *)($bdev->bd_disk->private_data);

  printf("%s O %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name,
    $rbd_dev->open_count, comm
  );
}
kprobe:rbd_release {
  $disk = (struct gendisk *)arg0;
  $rbd_dev = (struct rbd_device *)($disk->private_data);

  printf("%s R %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name,
    $rbd_dev->open_count, comm
  );
}
--------------------------------------------------------------------


Cheers,

Chris



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux