Re: rbd unmap fails with "Device or resource busy"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 15, 2022 at 06:29:20PM +1000, Chris Dunlop wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not mounted and not (obviously) open by any other processes?

linux-5.15.58
ceph-16.2.9

I have multiple XFS on rbd filesystems, and often create rbd snapshots, map and read-only mount the snapshot, perform some work on the fs, then unmount and unmap. The unmap regularly (about 1 in 10 times) fails like:

$ sudo rbd unmap /dev/rbd29
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

tl;dr problem solved: there WAS a process holding the rbd device open.

Sigh. It turns out the problem is NOT solved.

I've stopped 'pvs' from scanning the rbd devices. This was sufficient to allow my minimal test script to work without unmap failures, but my full production process is still suffering from the unmap failures.

I now have 51 rbd devices which I haven't been able to unmap for the last three days (in contrast to my earlier statement where I said I'd always been able to unmap eventually, generally after 30 minutes or so). That's out of maybe 80-90 mapped rbds over that time.

I've no idea why the unmap failures are so common this time, and why, this time, I haven't been able to unmap them in 3 days.

I had been trying an unmap of one specific rbd (randomly selected) every second for 3 hours whilst simultaneously, in a tight loop, looking for any other processes that have the device open. The unmaps continued to fail and I haven't caught any other process with the device open.

I also tried a back-off strategy by linearly increasing a sleep between unmap attempts. By the time the sleep was up to 4 hours I have up, with unmaps of that device still failing. Unmap attempts at random times since then on that particular device and all the other of the 51 un-unmappable device continue to fail.

I'm sure I can unmap the devices using '--force' but at this point I'd rather try to work out WHY the unmap is failing: it seems to be pointing to /something/ going wrong, somewhere. Given no user processes can be seen to have the device open, it seems that "something" might be in the kernel somewhere.

I'm trying to put together a test using a cut down version of the production process to see if I can make the unmap failures happen a little more repeatably.

I'm open to suggestions as to what I can look at.

E.g. maybe there's some way of using ebpf or similar to look at the 'rbd_dev->open_count' in the live kernel?

And/or maybe there's some way, again using ebpf or similar, to record sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release() to try to identify something that's opening the device and not releasing it?

If anyone knows how that could be done that would be great, otherwise it's going to take me a bit of time to try to work out how that might be done.

Chris



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux