Hi Illya,
On Tue, Sep 13, 2022 at 01:43:16PM +0200, Ilya Dryomov wrote:
On Tue, Sep 13, 2022 at 3:44 AM Chris Dunlop <chris@xxxxxxxxxxxx> wrote:
What can make a "rbd unmap" fail, assuming the device is not mounted
and not (obviously) open by any other processes?
linux-5.15.58
ceph-16.2.9
I have multiple XFS on rbd filesystems, and often create rbd snapshots,
map and read-only mount the snapshot, perform some work on the fs, then
unmount and unmap. The unmap regularly (about 1 in 10 times) fails
like:
$ sudo rbd unmap /dev/rbd29
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy
I've double checked the device is no longer mounted, and, using "lsof"
etc., nothing has the device open.
One thing that "lsof" is oblivious to is multipath, see
https://tracker.ceph.com/issues/12763.
The server is not using multipath - e.g. there's no multipathd, and:
$ find /dev/mapper/ -name '*mpath*'
...finds nothing.
I've found that waiting "a while", e.g. 5-30 minutes, will usually
allow the "busy" device to be unmapped without the -f flag.
"Device or resource busy" error from "rbd unmap" clearly indicates
that the block device is still open by something. In this case -- you
are mounting a block-level snapshot of an XFS filesystem whose "HEAD"
is already mounted -- perhaps it could be some background XFS worker
thread? I'm not sure if "nouuid" mount option solves all issues there.
Good suggestion, I should have considered that first. I've now tried it
without the mount at all, i.e. with no XFS or other filesystem:
------------------------------------------------------------------------------
#!/bin/bash
set -e
rbdname=pool/name
for ((i=0; ++i<=50; )); do
dev=$(rbd map "${rbdname}")
ts "${i}: ${dev}"
dd if="${dev}" of=/dev/null bs=1G count=1
for ((j=0; ++j; )); do
rbd unmap "${dev}" && break
sleep 1m
done
(( j > 1 )) && echo "$j minutes to unmap"
done
------------------------------------------------------------------------------
This failed at about the same rate, i.e. around 1 in 10. This time it only
took 2 minutes each time to successfully unmap after the initial unmap
failed - I'm not sure if this is due to the test change (no mount), or
related to how busy the machine is otherwise.
The upshot is, it definitely looks like there's something related to the
underlying rbd that's preventing the unmap.
Have you encountered this error in other scenarios, i.e. without
mounting snapshots this way or with ext4 instead of XFS?
I've seen the same issue after unmounting r/w filesystems, but I don't do
that nearly as often so it hasn't been a pain point. However, per the test
above, the issue is unrelated to the mount.
Cheers,
Chris