On Thu, Sep 15, 2022 at 06:29:20PM +1000, Chris Dunlop wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not mounted
and not (obviously) open by any other processes?
linux-5.15.58
ceph-16.2.9
I have multiple XFS on rbd filesystems, and often create rbd
snapshots, map and read-only mount the snapshot, perform some work on
the fs, then unmount and unmap. The unmap regularly (about 1 in 10
times) fails like:
$ sudo rbd unmap /dev/rbd29
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
tl;dr problem solved: there WAS a process holding the rbd device open.
Sigh. It turns out the problem is NOT solved.
I've stopped 'pvs' from scanning the rbd devices. This was sufficient to
allow my minimal test script to work without unmap failures, but my full
production process is still suffering from the unmap failures.
I now have 51 rbd devices which I haven't been able to unmap for the
last three days (in contrast to my earlier statement where I said I'd
always been able to unmap eventually, generally after 30 minutes or so).
That's out of maybe 80-90 mapped rbds over that time.
I've no idea why the unmap failures are so common this time, and why,
this time, I haven't been able to unmap them in 3 days.
I had been trying an unmap of one specific rbd (randomly selected) every
second for 3 hours whilst simultaneously, in a tight loop, looking for
any other processes that have the device open. The unmaps continued to
fail and I haven't caught any other process with the device open.
I also tried a back-off strategy by linearly increasing a sleep between
unmap attempts. By the time the sleep was up to 4 hours I have up, with
unmaps of that device still failing. Unmap attempts at random times
since then on that particular device and all the other of the 51
un-unmappable device continue to fail.
I'm sure I can unmap the devices using '--force' but at this point I'd
rather try to work out WHY the unmap is failing: it seems to be pointing
to /something/ going wrong, somewhere. Given no user processes can be
seen to have the device open, it seems that "something" might be in the
kernel somewhere.
I'm trying to put together a test using a cut down version of the
production process to see if I can make the unmap failures happen a
little more repeatably.
I'm open to suggestions as to what I can look at.
E.g. maybe there's some way of using ebpf or similar to look at the
'rbd_dev->open_count' in the live kernel?
And/or maybe there's some way, again using ebpf or similar, to record
sufficient info (e.g. a stack trace?) from rbd_open() and rbd_release()
to try to identify something that's opening the device and not releasing
it?
If anyone knows how that could be done that would be great, otherwise
it's going to take me a bit of time to try to work out how that might be
done.
Chris