Re: EBLOCKLISTED error after rbd map was interrupted by fatal signal

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 22 Feb 2023 18:41:35 +0100

On Wed, Feb 22, 2023 at 3:07 PM Aleksandr Mikhalitsyn
<aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
>
> On Wed, Feb 22, 2023 at 2:38 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> >
> > On Wed, Feb 22, 2023 at 1:17 PM Aleksandr Mikhalitsyn
> > <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> > >
> > > Hi folks,
> > >
> > > Recently we've met a problem [1] with the kernel ceph client/rbd.
> > >
> > > Writing to /sys/bus/rbd/add_single_major in some cases can take a lot
> > > of time, so on the userspace side
> > > we had a timeout and sent a fatal signal to the rbd map process to
> > > interrupt the process.
> > > And this working perfectly well, but then it's impossible to perform
> > > rbd map again cause we are always getting EBLOCKLISTED error.
> >
> > Hi Aleksandr,
>
> Hi Ilya!
>
> Thanks a lot for such a fast reply.
>
> >
> > I'm not sure if there is a causal relationship between "rbd map"
> > getting sent a fatal signal by LXC and these EBLOCKLISTED errors.  Are
> > you saying that that was confirmed to be the root cause, meaning that
> > no such errors were observed after [1] got merged?
>
> AFAIK, no. After [1] was merged we haven't seen any issues with rbd.
> I think Stephane will correct me if I'm wrong.
>
> I also can't be fully sure that there is a strict logical relationship
> between EBLOCKLISTED error and fatal signal.
> After I got a report from LXD folks about this I've tried to analyse
> kernel code and find the places where
> EBLOCKLISTED (ESHUTDOWN|EBLOCKLISTED|EBLACKLISTED) can be sent to the userspace.
> I was surprised that there are no places in the kernel ceph/rbd client
> where we can throw this error, it can only
> be received from ceph monitor as a reply to a kernel client request.
> But we have a lot of checks like this:
> if (rc == -EBLOCKLISTED)
>       fsc->blocklisted = true;
> so, if we receive this error once then it will be saved in struct
> ceph_fs_client without any chance to clear it.

This is CephFS code, not RBD.

> Maybe this is the reason why all "rbd map" attempts are failing?..

As explained, "rbd map" attempts are failing because of RBD client
instance sharing (or rather the way it's implemented in that "rbd map"
doesn't check whether the existing instance is blocklisted).

>
> >
> > >
> > > We've done some brief analysis of the kernel side.
> > >
> > > Kernelside call stack:
> > > sysfs_write [/sys/bus/rbd/add_single_major]
> > > add_single_major_store
> > > do_rbd_add
> > > rbd_add_acquire_lock
> > > rbd_acquire_lock
> > > rbd_try_acquire_lock <- EBLOCKLISTED comes from there for 2nd and
> > > further attempts
> > >
> > > Most probably the place at which it was interrupted by a signal:
> > > static int rbd_add_acquire_lock(struct rbd_device *rbd_dev)
> > > {
> > > ...
> > >
> > >         rbd_assert(!rbd_is_lock_owner(rbd_dev));
> > >         queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
> > >         ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait,
> > >         ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); <=== signal arrives
> > >
> > > As far as I understand, we had been receiving the EBLOCKLISTED errno
> > > because ceph_monc_blocklist_add()
> > > sent the "osd blocklist add" command to the ceph monitor successfully.
> >
> > RBD doesn't use ceph_monc_blocklist_add() to blocklist itself.  It's
> > there to blocklist some _other_ RBD client that happens to be holding
> > the lock and isn't responding to this RBD client's requests to release
> > it.
>
> Got it. Thanks for clarifying this.
>
> >
> > > We had removed the client from blocklist [2].
> >
> > This is very dangerous and generally shouldn't ever be done.
> > Blocklisting is Ceph's term for fencing.  Manually lifting the fence
> > without fully understanding what is going on in the system is a fast
> > ticket to data corruption.
> >
> > I see that [2] does say "Doing this may put data integrity at risk" but
> > not nearly as strong as it should.  Also, it's for CephFS, not RBD.
> >
> > > But we still weren't able to perform the rbd map. It looks like some
> > > extra state is saved on the kernel client side and blocks us.
> >
> > By default, all RBD mappings on the node share the same "RBD client"
> > instance.  Once it's blocklisted, all existing mappings mappings are
> > affected.  Unfortunately, new mappings don't check for that and just
> > attempt to reuse that instance as usual.
> >
> > This sharing can be disabled by passing "-o noshare" to "rbd map" but
> > I would recommend cleaning up existing mappings instead.
>
> So, we need to execute (on a client node):
> $ rbd showmapped
> and then
> $ rbd unmap ...
> for each mapping, correct?

More or less, but note that in case of a filesystem mounted on top of
any of these mappings, you would need to unmount it first.

Thanks,

                Ilya

>
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > What do you think about it?
> > >
> > > Links:
> > > [1] https://github.com/lxc/lxd/pull/11213
> > > [2] https://docs.ceph.com/en/quincy/cephfs/eviction/#advanced-un-blocklisting-a-client
> > >
> > > Kind regards,
> > > Alex