Re: rbd unmap fails with "Device or resource busy"

Chris Dunlop <chris@xxxxxxxxxxxx> · Wed, 21 Sep 2022 11:36:29 +1000

Hi Ilya,

On Mon, Sep 19, 2022 at 12:14:06PM +0200, Ilya Dryomov wrote:
On Mon, Sep 19, 2022 at 9:43 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not 
mounted and not (obviously) open by any other processes?

E.g. maybe there's some way of using ebpf or similar to look at the 
'rbd_dev->open_count' in the live kernel?

And/or maybe there's some way, again using ebpf or similar, to record 
sufficient info (e.g. a stack trace?) from rbd_open() and 
rbd_release() to try to identify something that's opening the device 
and not releasing it?

Attaching kprobes to rbd_open() and rbd_release() is probably the 
fastest option.  I don't think you even need a stack trace, PID and 
comm (process name) should do.  I would start with something like:

# bpftrace -e 'kprobe:rbd_open { printf("open pid %d comm %s\n", pid, 
comm) } kprobe:rbd_release { printf("release pid %d comm %s\n", pid, 
comm) }'

Fetching the actual rbd_dev->open_count value is more involved but 
also doable.

Excellent! Thanks!

tl;dr there's something other than the open_count causing the unmap 
failures - or something's elevating and decrementing open_count without 
going through rbd_open and rbd_release. Or perhaps there's some situation 
whereby bpftrace "misses" recording calls to rbd_open and rbd_release.

FYI, the production process is:

- create snapshot of rbd
- map
- mount with ro,norecovery,nouuid (the original live fs is still mounted)
- export via NFS
- mount on Windows NFS client
- process on Windows
- remove Windows NFS mount
- unexport from NFS
- unmount
- unmap

(I haven't mentioned the NFS export previously because I thought the 
issue was replicable without it - but that might simply have been due to 
the 'pvs' issue which has been resolved.)

I now have a script that mimics the above production sequence in a loop 
and left it running all night. Out of 288 iterations it had 13 instances 
where the unmap was failing for some time (i.e. in all cases it 
eventually succeeded, unlike the 51 rbd devices I can't seem to unmap at 
all without using --force). In the failing cases the unmap was retried 
at 1 second intervals. The shortest time taken to eventually umap was 
521 seconds, the longest was 793 seconds.

Note, in the below I'm using "successful" for the tests where the first 
unmap succeeded, and "failed" for the tests where the first unmap 
failed, although in all cases the unmap eventually succeeded.

I ended up with a bpftrace script (see below) that logs the timestamp, 
open or release (O/R), pid, device name, open_count (at entry to the 
function), and process name.

A successful iteration of that process mostly looks like this:

Timestamp     O/R Pid    Device Count Process
18:21:18.235870 O 3269426 rbd29 0 mapper
18:21:20.088873 R 3269426 rbd29 1 mapper
18:21:20.089346 O 3269447 rbd29 0 systemd-udevd
18:21:20.105281 O 3269457 rbd29 1 blkid
18:21:31.858621 R 3269457 rbd29 2 blkid
18:21:31.861762 R 3269447 rbd29 1 systemd-udevd
18:21:31.882235 O 3269475 rbd29 0 mount
18:21:38.241808 R 3269475 rbd29 1 mount
18:21:38.242174 O 3269475 rbd29 0 mount
18:22:49.646608 O 2364320 rbd29 1 rpc.mountd
18:22:58.715634 R 2364320 rbd29 2 rpc.mountd
18:23:55.564512 R 3270060 rbd29 1 umount

Or occasionally it looks like this, with "rpc.mountd" disappearing:

18:35:49.539224 O 3277664 rbd29 0 mapper
18:35:50.515777 R 3277664 rbd29 1 mapper
18:35:50.516224 O 3277685 rbd29 0 systemd-udevd
18:35:50.531978 O 3277694 rbd29 1 blkid
18:35:57.361799 R 3277694 rbd29 2 blkid
18:35:57.365263 R 3277685 rbd29 1 systemd-udevd
18:35:57.384316 O 3277713 rbd29 0 mount
18:36:01.234337 R 3277713 rbd29 1 mount
18:36:01.234849 O 3277713 rbd29 0 mount
18:37:21.304270 R 3289527 rbd29 1 umount

Of the 288 iterations, only 20 didn't include the rpc.mountd lines.

An unsuccessful iteration looks like this:

18:37:31.885408 O 3294108 rbd29 0 mapper
18:37:33.181607 R 3294108 rbd29 1 mapper
18:37:33.182086 O 3294175 rbd29 0 systemd-udevd
18:37:33.197982 O 3294691 rbd29 1 blkid
18:37:42.712870 R 3294691 rbd29 2 blkid
18:37:42.716296 R 3294175 rbd29 1 systemd-udevd
18:37:42.738469 O 3298073 rbd29 0 mount
18:37:49.339012 R 3298073 rbd29 1 mount
18:37:49.339352 O 3298073 rbd29 0 mount
18:38:51.390166 O 2364320 rbd29 1 rpc.mountd
18:39:00.989050 R 2364320 rbd29 2 rpc.mountd
18:53:56.054685 R 3313923 rbd29 1 init

According to my script log, the first unmap attempt was at 18:39:42, 
i.e. 42 seconds after rpc.mountd released the device. At that point the 
the open_count was (or should have been?) 1 again allowing the unmap to 
succeed - but it didn't. The unmap was retried every second until it 
eventually succeeded at 18:53:56, the same time as the mysterious "init" 
process ran - but also note there is NO "umount" process in there so I 
don't know if the name of the process recorded by bfptrace is simply 
incorrect (but how would that happen??) or what else could be going on.

All 13 of the failed iterations recorded that weird "init" instead of 
"umount".

12 of the 13 failed iterations included rpc.mountd in the trace, but one 
didn't (i.e. it went direct from mount to init/umount, like the 2nd 
successful example above), i.e. around the same proportion as the 
successful iterations.

So it seems there's something other than the open_count causing the unmap 
failures - or something's elevating and decrementing open_count without 
going through rbd_open and rbd_release. Or perhaps there's some situation 
whereby bpftrace "misses" recording calls to rbd_open and rbd_release.

The bpftrace script looks like this:
--------------------------------------------------------------------
//
// bunches of defines and structure definitions extracted from 
// drivers/block/rbd.c elided here...
//
kprobe:rbd_open {
  $bdev = (struct block_device *)arg0;
  $rbd_dev = (struct rbd_device *)($bdev->bd_disk->private_data);

  printf("%s O %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name,
    $rbd_dev->open_count, comm
  );
}
kprobe:rbd_release {
  $disk = (struct gendisk *)arg0;
  $rbd_dev = (struct rbd_device *)($disk->private_data);

  printf("%s R %d %s %lu %s\n",
    strftime("%T.%f", nsecs), pid, $rbd_dev->name,
    $rbd_dev->open_count, comm
  );
}
--------------------------------------------------------------------

Cheers,

Chris