EBLOCKLISTED error after rbd map was interrupted by fatal signal

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Wed, 22 Feb 2023 14:04:23 +0100

Hi folks,

Recently we've met a problem [1] with the kernel ceph client/rbd.

Writing to /sys/bus/rbd/add_single_major in some cases can take a lot
of time, so on the userspace side
we had a timeout and sent a fatal signal to the rbd map process to
interrupt the process.
And this working perfectly well, but then it's impossible to perform
rbd map again cause we are always getting EBLOCKLISTED error.

We've done some brief analysis of the kernel side.

Kernelside call stack:
sysfs_write [/sys/bus/rbd/add_single_major
]
add_single_major_store
do_rbd_add
rbd_add_acquire_lock
rbd_acquire_lock
rbd_try_acquire_lock <- EBLOCKLISTED comes from there for 2nd and
further attempts

Most probably the place at which it was interrupted by a signal:
static int rbd_add_acquire_lock(struct rbd_device *rbd_dev)
{
...

        rbd_assert(!rbd_is_lock_owner(rbd_dev));
        queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
        ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait,
        ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); <=== signal arrives

As far as I understand, we had been receiving the EBLOCKLISTED errno
because ceph_monc_blocklist_add()
sent the "osd blocklist add" command to the ceph monitor successfully.
We had removed the client from blocklist [2].
But we still weren't able to perform the rbd map. It looks like some
extra state is saved on the kernel client side and blocks us.

What do you think about it?

Links:
[1] https://github.com/lxc/lxd/pull/11213
[2] https://docs.ceph.com/en/quincy/cephfs/eviction/#advanced-un-blocklisting-a-client

Kind regards,
Alex