Re: rbd unmap fails with "Device or resource busy"

Chris Dunlop <chris@xxxxxxxxxxxx> · Mon, 19 Sep 2022 17:43:21 +1000

On Thu, Sep 15, 2022 at 06:29:20PM +1000, Chris Dunlop wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not mounted 
and not (obviously) open by any other processes?

linux-5.15.58
ceph-16.2.9

I have multiple XFS on rbd filesystems, and often create rbd 
snapshots, map and read-only mount the snapshot, perform some work on 
the fs, then unmount and unmap. The unmap regularly (about 1 in 10 
times) fails like:

$ sudo rbd unmap /dev/rbd29
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

tl;dr problem solved: there WAS a process holding the rbd device open.

Sigh. It turns out the problem is NOT solved.

I've stopped 'pvs' from scanning the rbd devices. This was sufficient to 
allow my minimal test script to work without unmap failures, but my full 
production process is still suffering from the unmap failures.

I now have 51 rbd devices which I haven't been able to unmap for the 
last three days (in contrast to my earlier statement where I said I'd 
always been able to unmap eventually, generally after 30 minutes or so).  
That's out of maybe 80-90 mapped rbds over that time.

I've no idea why the unmap failures are so common this time, and why, 
this time, I haven't been able to unmap them in 3 days.

I had been trying an unmap of one specific rbd (randomly selected) every 
second for 3 hours whilst simultaneously, in a tight loop, looking for 
any other processes that have the device open. The unmaps continued to 
fail and I haven't caught any other process with the device open.

I also tried a back-off strategy by linearly increasing a sleep between 
unmap attempts.  By the time the sleep was up to 4 hours I have up, with 
unmaps of that device still failing. Unmap attempts at random times 
since then on that particular device and all the other of the 51 
un-unmappable device continue to fail.

I'm sure I can unmap the devices using '--force' but at this point I'd 
rather try to work out WHY the unmap is failing: it seems to be pointing 
to /something/ going wrong, somewhere. Given no user processes can be 
seen to have the device open, it seems that "something" might be in the 
kernel somewhere.

I'm trying to put together a test using a cut down version of the 
production process to see if I can make the unmap failures happen a 
little more repeatably.

I'm open to suggestions as to what I can look at.

E.g. maybe there's some way of using ebpf or similar to look at the 
'rbd_dev->open_count' in the live kernel?

And/or maybe there's some way, again using ebpf or similar, to record 
sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release() 
to try to identify something that's opening the device and not releasing 
it?

If anyone knows how that could be done that would be great, otherwise 
it's going to take me a bit of time to try to work out how that might be 
done.

Chris