Re: EBLOCKLISTED error after rbd map was interrupted by fatal signal

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Wed, 22 Feb 2023 15:07:46 +0100

On Wed, Feb 22, 2023 at 2:38 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Wed, Feb 22, 2023 at 1:17 PM Aleksandr Mikhalitsyn
> <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> >
> > Hi folks,
> >
> > Recently we've met a problem [1] with the kernel ceph client/rbd.
> >
> > Writing to /sys/bus/rbd/add_single_major in some cases can take a lot
> > of time, so on the userspace side
> > we had a timeout and sent a fatal signal to the rbd map process to
> > interrupt the process.
> > And this working perfectly well, but then it's impossible to perform
> > rbd map again cause we are always getting EBLOCKLISTED error.
>
> Hi Aleksandr,

Hi Ilya!

Thanks a lot for such a fast reply.

>
> I'm not sure if there is a causal relationship between "rbd map"
> getting sent a fatal signal by LXC and these EBLOCKLISTED errors.  Are
> you saying that that was confirmed to be the root cause, meaning that
> no such errors were observed after [1] got merged?

AFAIK, no. After [1] was merged we haven't seen any issues with rbd.
I think Stephane will correct me if I'm wrong.

I also can't be fully sure that there is a strict logical relationship
between EBLOCKLISTED error and fatal signal.
After I got a report from LXD folks about this I've tried to analyse
kernel code and find the places where
EBLOCKLISTED (ESHUTDOWN|EBLOCKLISTED|EBLACKLISTED) can be sent to the userspace.
I was surprised that there are no places in the kernel ceph/rbd client
where we can throw this error, it can only
be received from ceph monitor as a reply to a kernel client request.
But we have a lot of checks like this:
if (rc == -EBLOCKLISTED)
      fsc->blocklisted = true;
so, if we receive this error once then it will be saved in struct
ceph_fs_client without any chance to clear it.
Maybe this is the reason why all "rbd map" attempts are failing?..

>
> >
> > We've done some brief analysis of the kernel side.
> >
> > Kernelside call stack:
> > sysfs_write [/sys/bus/rbd/add_single_major]
> > add_single_major_store
> > do_rbd_add
> > rbd_add_acquire_lock
> > rbd_acquire_lock
> > rbd_try_acquire_lock <- EBLOCKLISTED comes from there for 2nd and
> > further attempts
> >
> > Most probably the place at which it was interrupted by a signal:
> > static int rbd_add_acquire_lock(struct rbd_device *rbd_dev)
> > {
> > ...
> >
> >         rbd_assert(!rbd_is_lock_owner(rbd_dev));
> >         queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
> >         ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait,
> >         ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); <=== signal arrives
> >
> > As far as I understand, we had been receiving the EBLOCKLISTED errno
> > because ceph_monc_blocklist_add()
> > sent the "osd blocklist add" command to the ceph monitor successfully.
>
> RBD doesn't use ceph_monc_blocklist_add() to blocklist itself.  It's
> there to blocklist some _other_ RBD client that happens to be holding
> the lock and isn't responding to this RBD client's requests to release
> it.

Got it. Thanks for clarifying this.

>
> > We had removed the client from blocklist [2].
>
> This is very dangerous and generally shouldn't ever be done.
> Blocklisting is Ceph's term for fencing.  Manually lifting the fence
> without fully understanding what is going on in the system is a fast
> ticket to data corruption.
>
> I see that [2] does say "Doing this may put data integrity at risk" but
> not nearly as strong as it should.  Also, it's for CephFS, not RBD.
>
> > But we still weren't able to perform the rbd map. It looks like some
> > extra state is saved on the kernel client side and blocks us.
>
> By default, all RBD mappings on the node share the same "RBD client"
> instance.  Once it's blocklisted, all existing mappings mappings are
> affected.  Unfortunately, new mappings don't check for that and just
> attempt to reuse that instance as usual.
>
> This sharing can be disabled by passing "-o noshare" to "rbd map" but
> I would recommend cleaning up existing mappings instead.

So, we need to execute (on a client node):
$ rbd showmapped
and then
$ rbd unmap ...
for each mapping, correct?

>
> Thanks,
>
>                 Ilya
>
> >
> > What do you think about it?
> >
> > Links:
> > [1] https://github.com/lxc/lxd/pull/11213
> > [2] https://docs.ceph.com/en/quincy/cephfs/eviction/#advanced-un-blocklisting-a-client
> >
> > Kind regards,
> > Alex