Re: EBLOCKLISTED error after rbd map was interrupted by fatal signal

Aleksandr Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> · Wed, 22 Feb 2023 19:09:37 +0100

On Wed, Feb 22, 2023 at 6:41 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Wed, Feb 22, 2023 at 3:07 PM Aleksandr Mikhalitsyn
> <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Feb 22, 2023 at 2:38 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Feb 22, 2023 at 1:17 PM Aleksandr Mikhalitsyn
> > > <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
> > > >
> > > > Hi folks,
> > > >
> > > > Recently we've met a problem [1] with the kernel ceph client/rbd.
> > > >
> > > > Writing to /sys/bus/rbd/add_single_major in some cases can take a lot
> > > > of time, so on the userspace side
> > > > we had a timeout and sent a fatal signal to the rbd map process to
> > > > interrupt the process.
> > > > And this working perfectly well, but then it's impossible to perform
> > > > rbd map again cause we are always getting EBLOCKLISTED error.
> > >
> > > Hi Aleksandr,
> >
> > Hi Ilya!
> >
> > Thanks a lot for such a fast reply.
> >
> > >
> > > I'm not sure if there is a causal relationship between "rbd map"
> > > getting sent a fatal signal by LXC and these EBLOCKLISTED errors.  Are
> > > you saying that that was confirmed to be the root cause, meaning that
> > > no such errors were observed after [1] got merged?
> >
> > AFAIK, no. After [1] was merged we haven't seen any issues with rbd.
> > I think Stephane will correct me if I'm wrong.
> >
> > I also can't be fully sure that there is a strict logical relationship
> > between EBLOCKLISTED error and fatal signal.
> > After I got a report from LXD folks about this I've tried to analyse
> > kernel code and find the places where
> > EBLOCKLISTED (ESHUTDOWN|EBLOCKLISTED|EBLACKLISTED) can be sent to the userspace.
> > I was surprised that there are no places in the kernel ceph/rbd client
> > where we can throw this error, it can only
> > be received from ceph monitor as a reply to a kernel client request.
> > But we have a lot of checks like this:
> > if (rc == -EBLOCKLISTED)
> >       fsc->blocklisted = true;
> > so, if we receive this error once then it will be saved in struct
> > ceph_fs_client without any chance to clear it.
>
> This is CephFS code, not RBD.

Ah, yep. :)

For RBD we save errno in wake_lock_waiters() function:
/*
* Either image request state machine(s) or rbd_add_acquire_lock()
* (i.e. "rbd map").
*/
static void wake_lock_waiters(struct rbd_device *rbd_dev, int result)
{
...
     if (!completion_done(&rbd_dev->acquire_wait)) {
     rbd_assert(list_empty(&rbd_dev->acquiring_list) &&
     list_empty(&rbd_dev->running_list));
          rbd_dev->acquire_err = result; <== HERE

But it's not a problem.

>
> > Maybe this is the reason why all "rbd map" attempts are failing?..
>
> As explained, "rbd map" attempts are failing because of RBD client
> instance sharing (or rather the way it's implemented in that "rbd map"
> doesn't check whether the existing instance is blocklisted).

yes, I can see it:
static struct rbd_device *__rbd_dev_create(struct rbd_client *rbdc,
struct rbd_spec *spec)
{
    struct rbd_device *rbd_dev;
...
     rbd_dev->rbd_client = rbdc; <<=== comes from rbd_get_client()

/*
* Get a ceph client with specific addr and configuration, if one does
* not exist create it. Either way, ceph_opts is consumed by this
* function.
*/
static struct rbd_client *rbd_get_client(struct ceph_options *ceph_opts)

That explains everything. Thank you!

>
> >
> > >
> > > >
> > > > We've done some brief analysis of the kernel side.
> > > >
> > > > Kernelside call stack:
> > > > sysfs_write [/sys/bus/rbd/add_single_major]
> > > > add_single_major_store
> > > > do_rbd_add
> > > > rbd_add_acquire_lock
> > > > rbd_acquire_lock
> > > > rbd_try_acquire_lock <- EBLOCKLISTED comes from there for 2nd and
> > > > further attempts
> > > >
> > > > Most probably the place at which it was interrupted by a signal:
> > > > static int rbd_add_acquire_lock(struct rbd_device *rbd_dev)
> > > > {
> > > > ...
> > > >
> > > >         rbd_assert(!rbd_is_lock_owner(rbd_dev));
> > > >         queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
> > > >         ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait,
> > > >         ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); <=== signal arrives
> > > >
> > > > As far as I understand, we had been receiving the EBLOCKLISTED errno
> > > > because ceph_monc_blocklist_add()
> > > > sent the "osd blocklist add" command to the ceph monitor successfully.
> > >
> > > RBD doesn't use ceph_monc_blocklist_add() to blocklist itself.  It's
> > > there to blocklist some _other_ RBD client that happens to be holding
> > > the lock and isn't responding to this RBD client's requests to release
> > > it.
> >
> > Got it. Thanks for clarifying this.
> >
> > >
> > > > We had removed the client from blocklist [2].
> > >
> > > This is very dangerous and generally shouldn't ever be done.
> > > Blocklisting is Ceph's term for fencing.  Manually lifting the fence
> > > without fully understanding what is going on in the system is a fast
> > > ticket to data corruption.
> > >
> > > I see that [2] does say "Doing this may put data integrity at risk" but
> > > not nearly as strong as it should.  Also, it's for CephFS, not RBD.
> > >
> > > > But we still weren't able to perform the rbd map. It looks like some
> > > > extra state is saved on the kernel client side and blocks us.
> > >
> > > By default, all RBD mappings on the node share the same "RBD client"
> > > instance.  Once it's blocklisted, all existing mappings mappings are
> > > affected.  Unfortunately, new mappings don't check for that and just
> > > attempt to reuse that instance as usual.
> > >
> > > This sharing can be disabled by passing "-o noshare" to "rbd map" but
> > > I would recommend cleaning up existing mappings instead.
> >
> > So, we need to execute (on a client node):
> > $ rbd showmapped
> > and then
> > $ rbd unmap ...
> > for each mapping, correct?
>
> More or less, but note that in case of a filesystem mounted on top of
> any of these mappings, you would need to unmount it first.

Of course!

>
> Thanks,
>
>                 Ilya

Thanks a lot for your help and explanations, Ilya!

Kind regards,
Alex

>
> >
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> > > >
> > > > What do you think about it?
> > > >
> > > > Links:
> > > > [1] https://github.com/lxc/lxd/pull/11213
> > > > [2] https://docs.ceph.com/en/quincy/cephfs/eviction/#advanced-un-blocklisting-a-client
> > > >
> > > > Kind regards,
> > > > Alex