On Wed, Feb 22, 2023 at 6:41 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Wed, Feb 22, 2023 at 3:07 PM Aleksandr Mikhalitsyn > <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote: > > > > On Wed, Feb 22, 2023 at 2:38 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > > > > > On Wed, Feb 22, 2023 at 1:17 PM Aleksandr Mikhalitsyn > > > <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote: > > > > > > > > Hi folks, > > > > > > > > Recently we've met a problem [1] with the kernel ceph client/rbd. > > > > > > > > Writing to /sys/bus/rbd/add_single_major in some cases can take a lot > > > > of time, so on the userspace side > > > > we had a timeout and sent a fatal signal to the rbd map process to > > > > interrupt the process. > > > > And this working perfectly well, but then it's impossible to perform > > > > rbd map again cause we are always getting EBLOCKLISTED error. > > > > > > Hi Aleksandr, > > > > Hi Ilya! > > > > Thanks a lot for such a fast reply. > > > > > > > > I'm not sure if there is a causal relationship between "rbd map" > > > getting sent a fatal signal by LXC and these EBLOCKLISTED errors. Are > > > you saying that that was confirmed to be the root cause, meaning that > > > no such errors were observed after [1] got merged? > > > > AFAIK, no. After [1] was merged we haven't seen any issues with rbd. > > I think Stephane will correct me if I'm wrong. > > > > I also can't be fully sure that there is a strict logical relationship > > between EBLOCKLISTED error and fatal signal. > > After I got a report from LXD folks about this I've tried to analyse > > kernel code and find the places where > > EBLOCKLISTED (ESHUTDOWN|EBLOCKLISTED|EBLACKLISTED) can be sent to the userspace. > > I was surprised that there are no places in the kernel ceph/rbd client > > where we can throw this error, it can only > > be received from ceph monitor as a reply to a kernel client request. > > But we have a lot of checks like this: > > if (rc == -EBLOCKLISTED) > > fsc->blocklisted = true; > > so, if we receive this error once then it will be saved in struct > > ceph_fs_client without any chance to clear it. > > This is CephFS code, not RBD. Ah, yep. :) For RBD we save errno in wake_lock_waiters() function: /* * Either image request state machine(s) or rbd_add_acquire_lock() * (i.e. "rbd map"). */ static void wake_lock_waiters(struct rbd_device *rbd_dev, int result) { ... if (!completion_done(&rbd_dev->acquire_wait)) { rbd_assert(list_empty(&rbd_dev->acquiring_list) && list_empty(&rbd_dev->running_list)); rbd_dev->acquire_err = result; <== HERE But it's not a problem. > > > Maybe this is the reason why all "rbd map" attempts are failing?.. > > As explained, "rbd map" attempts are failing because of RBD client > instance sharing (or rather the way it's implemented in that "rbd map" > doesn't check whether the existing instance is blocklisted). yes, I can see it: static struct rbd_device *__rbd_dev_create(struct rbd_client *rbdc, struct rbd_spec *spec) { struct rbd_device *rbd_dev; ... rbd_dev->rbd_client = rbdc; <<=== comes from rbd_get_client() /* * Get a ceph client with specific addr and configuration, if one does * not exist create it. Either way, ceph_opts is consumed by this * function. */ static struct rbd_client *rbd_get_client(struct ceph_options *ceph_opts) That explains everything. Thank you! > > > > > > > > > > > > > > We've done some brief analysis of the kernel side. > > > > > > > > Kernelside call stack: > > > > sysfs_write [/sys/bus/rbd/add_single_major] > > > > add_single_major_store > > > > do_rbd_add > > > > rbd_add_acquire_lock > > > > rbd_acquire_lock > > > > rbd_try_acquire_lock <- EBLOCKLISTED comes from there for 2nd and > > > > further attempts > > > > > > > > Most probably the place at which it was interrupted by a signal: > > > > static int rbd_add_acquire_lock(struct rbd_device *rbd_dev) > > > > { > > > > ... > > > > > > > > rbd_assert(!rbd_is_lock_owner(rbd_dev)); > > > > queue_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0); > > > > ret = wait_for_completion_killable_timeout(&rbd_dev->acquire_wait, > > > > ceph_timeout_jiffies(rbd_dev->opts->lock_timeout)); <=== signal arrives > > > > > > > > As far as I understand, we had been receiving the EBLOCKLISTED errno > > > > because ceph_monc_blocklist_add() > > > > sent the "osd blocklist add" command to the ceph monitor successfully. > > > > > > RBD doesn't use ceph_monc_blocklist_add() to blocklist itself. It's > > > there to blocklist some _other_ RBD client that happens to be holding > > > the lock and isn't responding to this RBD client's requests to release > > > it. > > > > Got it. Thanks for clarifying this. > > > > > > > > > We had removed the client from blocklist [2]. > > > > > > This is very dangerous and generally shouldn't ever be done. > > > Blocklisting is Ceph's term for fencing. Manually lifting the fence > > > without fully understanding what is going on in the system is a fast > > > ticket to data corruption. > > > > > > I see that [2] does say "Doing this may put data integrity at risk" but > > > not nearly as strong as it should. Also, it's for CephFS, not RBD. > > > > > > > But we still weren't able to perform the rbd map. It looks like some > > > > extra state is saved on the kernel client side and blocks us. > > > > > > By default, all RBD mappings on the node share the same "RBD client" > > > instance. Once it's blocklisted, all existing mappings mappings are > > > affected. Unfortunately, new mappings don't check for that and just > > > attempt to reuse that instance as usual. > > > > > > This sharing can be disabled by passing "-o noshare" to "rbd map" but > > > I would recommend cleaning up existing mappings instead. > > > > So, we need to execute (on a client node): > > $ rbd showmapped > > and then > > $ rbd unmap ... > > for each mapping, correct? > > More or less, but note that in case of a filesystem mounted on top of > any of these mappings, you would need to unmount it first. Of course! > > Thanks, > > Ilya Thanks a lot for your help and explanations, Ilya! Kind regards, Alex > > > > > > > > > Thanks, > > > > > > Ilya > > > > > > > > > > > What do you think about it? > > > > > > > > Links: > > > > [1] https://github.com/lxc/lxd/pull/11213 > > > > [2] https://docs.ceph.com/en/quincy/cephfs/eviction/#advanced-un-blocklisting-a-client > > > > > > > > Kind regards, > > > > Alex