On Wed, Dec 7, 2022 at 2:31 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > On 07/12/2022 21:19, Xiubo Li wrote: > > > > On 07/12/2022 18:59, Ilya Dryomov wrote: > >> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@xxxxxxxxxx> wrote: > >>> From: Xiubo Li <xiubli@xxxxxxxxxx> > >>> > >>> When received corrupted snap trace we don't know what exactly has > >>> happened in MDS side. And we shouldn't continue writing to OSD, > >>> which may corrupt the snapshot contents. > >>> > >>> Just try to blocklist this client and If fails we need to crash the > >>> client instead of leaving it writeable to OSDs. > >>> > >>> Cc: stable@xxxxxxxxxxxxxxx > >>> URL: https://tracker.ceph.com/issues/57686 > >>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > >>> --- > >>> > >>> Thanks Aaron's feedback. > >>> > >>> V3: > >>> - Fixed ERROR: spaces required around that ':' (ctx:VxW) > >>> > >>> V2: > >>> - Switched to WARN() to taint the Linux kernel. > >>> > >>> fs/ceph/mds_client.c | 3 ++- > >>> fs/ceph/mds_client.h | 1 + > >>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++ > >>> 3 files changed, 28 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > >>> index cbbaf334b6b8..59094944af28 100644 > >>> --- a/fs/ceph/mds_client.c > >>> +++ b/fs/ceph/mds_client.c > >>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct > >>> ceph_connection *con) > >>> struct ceph_mds_client *mdsc = s->s_mdsc; > >>> > >>> pr_warn("mds%d closed our session\n", s->s_mds); > >>> - send_mds_reconnect(mdsc, s); > >>> + if (!mdsc->no_reconnect) > >>> + send_mds_reconnect(mdsc, s); > >>> } > >>> > >>> static void mds_dispatch(struct ceph_connection *con, struct > >>> ceph_msg *msg) > >>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h > >>> index 728b7d72bf76..8e8f0447c0ad 100644 > >>> --- a/fs/ceph/mds_client.h > >>> +++ b/fs/ceph/mds_client.h > >>> @@ -413,6 +413,7 @@ struct ceph_mds_client { > >>> atomic_t num_sessions; > >>> int max_sessions; /* len of sessions > >>> array */ > >>> int stopping; /* true if shutting > >>> down */ > >>> + int no_reconnect; /* true if snap trace > >>> is corrupted */ > >>> > >>> atomic64_t quotarealms_count; /* # realms with > >>> quota */ > >>> /* > >>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >>> index c1c452afa84d..023852b7c527 100644 > >>> --- a/fs/ceph/snap.c > >>> +++ b/fs/ceph/snap.c > >>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct > >>> ceph_mds_client *mdsc, > >>> struct ceph_snap_realm *realm; > >>> struct ceph_snap_realm *first_realm = NULL; > >>> struct ceph_snap_realm *realm_to_rebuild = NULL; > >>> + struct ceph_client *client = mdsc->fsc->client; > >>> int rebuild_snapcs; > >>> int err = -ENOMEM; > >>> + int ret; > >>> LIST_HEAD(dirty_realms); > >>> > >>> lockdep_assert_held_write(&mdsc->snap_rwsem); > >>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct > >>> ceph_mds_client *mdsc, > >>> if (first_realm) > >>> ceph_put_snap_realm(mdsc, first_realm); > >>> pr_err("%s error %d\n", __func__, err); > >>> + > >>> + /* > >>> + * When receiving a corrupted snap trace we don't know what > >>> + * exactly has happened in MDS side. And we shouldn't continue > >>> + * writing to OSD, which may corrupt the snapshot contents. > >>> + * > >>> + * Just try to blocklist this kclient and if it fails we need > >>> + * to crash the kclient instead of leaving it writeable. > >> Hi Xiubo, > >> > >> I'm not sure I understand this "let's blocklist ourselves" concept. > >> If the kernel client shouldn't continue writing to OSDs in this case, > >> why not just stop issuing writes -- perhaps initiating some equivalent > >> of a read-only remount like many local filesystems would do on I/O > >> errors (e.g. errors=remount-ro mode)? > > > > The following patch seems working. Let me do more test to make sure > > there is not further crash. > > > > diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > > index c1c452afa84d..cd487f8a4cb5 100644 > > --- a/fs/ceph/snap.c > > +++ b/fs/ceph/snap.c > > @@ -767,6 +767,7 @@ int ceph_update_snap_trace(struct ceph_mds_client > > *mdsc, > > struct ceph_snap_realm *realm; > > struct ceph_snap_realm *first_realm = NULL; > > struct ceph_snap_realm *realm_to_rebuild = NULL; > > + struct super_block *sb = mdsc->fsc->sb; > > int rebuild_snapcs; > > int err = -ENOMEM; > > LIST_HEAD(dirty_realms); > > @@ -885,6 +886,9 @@ int ceph_update_snap_trace(struct ceph_mds_client > > *mdsc, > > if (first_realm) > > ceph_put_snap_realm(mdsc, first_realm); > > pr_err("%s error %d\n", __func__, err); > > + pr_err("Remounting filesystem read-only\n"); > > + sb->s_flags |= SB_RDONLY; > > + > > return err; > > } > > > > > For readonly approach is also my first thought it should be, but I was > just not very sure whether it would be the best approach. > > Because by evicting the kclient we could prevent the buffer to be wrote > to OSDs. But the readonly one seems won't ? The read-only setting is more for the VFS and the user. Technically, the kernel client could just stop issuing writes (i.e. OSD requests containing a write op) and not set SB_RDONLY. That should cover any buffered data as well. By employing self-blocklisting, you are shifting the responsibility of rejecting OSD requests to the OSDs. I'm saying that not issuing OSD requests from a potentially busted client in the first place is probably a better idea. At the very least you wouldn't need to BUG on ceph_monc_blocklist_add() errors. Thanks, Ilya