On Thu, Dec 8, 2022 at 2:05 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > On 07/12/2022 22:28, Ilya Dryomov wrote: > > On Wed, Dec 7, 2022 at 1:35 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > >> > >> On 07/12/2022 18:59, Ilya Dryomov wrote: > >>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@xxxxxxxxxx> wrote: > >>>> From: Xiubo Li <xiubli@xxxxxxxxxx> > >>>> > >>>> When received corrupted snap trace we don't know what exactly has > >>>> happened in MDS side. And we shouldn't continue writing to OSD, > >>>> which may corrupt the snapshot contents. > >>>> > >>>> Just try to blocklist this client and If fails we need to crash the > >>>> client instead of leaving it writeable to OSDs. > >>>> > >>>> Cc: stable@xxxxxxxxxxxxxxx > >>>> URL: https://tracker.ceph.com/issues/57686 > >>>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > >>>> --- > >>>> > >>>> Thanks Aaron's feedback. > >>>> > >>>> V3: > >>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW) > >>>> > >>>> V2: > >>>> - Switched to WARN() to taint the Linux kernel. > >>>> > >>>> fs/ceph/mds_client.c | 3 ++- > >>>> fs/ceph/mds_client.h | 1 + > >>>> fs/ceph/snap.c | 25 +++++++++++++++++++++++++ > >>>> 3 files changed, 28 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > >>>> index cbbaf334b6b8..59094944af28 100644 > >>>> --- a/fs/ceph/mds_client.c > >>>> +++ b/fs/ceph/mds_client.c > >>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con) > >>>> struct ceph_mds_client *mdsc = s->s_mdsc; > >>>> > >>>> pr_warn("mds%d closed our session\n", s->s_mds); > >>>> - send_mds_reconnect(mdsc, s); > >>>> + if (!mdsc->no_reconnect) > >>>> + send_mds_reconnect(mdsc, s); > >>>> } > >>>> > >>>> static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg) > >>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h > >>>> index 728b7d72bf76..8e8f0447c0ad 100644 > >>>> --- a/fs/ceph/mds_client.h > >>>> +++ b/fs/ceph/mds_client.h > >>>> @@ -413,6 +413,7 @@ struct ceph_mds_client { > >>>> atomic_t num_sessions; > >>>> int max_sessions; /* len of sessions array */ > >>>> int stopping; /* true if shutting down */ > >>>> + int no_reconnect; /* true if snap trace is corrupted */ > >>>> > >>>> atomic64_t quotarealms_count; /* # realms with quota */ > >>>> /* > >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c > >>>> index c1c452afa84d..023852b7c527 100644 > >>>> --- a/fs/ceph/snap.c > >>>> +++ b/fs/ceph/snap.c > >>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc, > >>>> struct ceph_snap_realm *realm; > >>>> struct ceph_snap_realm *first_realm = NULL; > >>>> struct ceph_snap_realm *realm_to_rebuild = NULL; > >>>> + struct ceph_client *client = mdsc->fsc->client; > >>>> int rebuild_snapcs; > >>>> int err = -ENOMEM; > >>>> + int ret; > >>>> LIST_HEAD(dirty_realms); > >>>> > >>>> lockdep_assert_held_write(&mdsc->snap_rwsem); > >>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc, > >>>> if (first_realm) > >>>> ceph_put_snap_realm(mdsc, first_realm); > >>>> pr_err("%s error %d\n", __func__, err); > >>>> + > >>>> + /* > >>>> + * When receiving a corrupted snap trace we don't know what > >>>> + * exactly has happened in MDS side. And we shouldn't continue > >>>> + * writing to OSD, which may corrupt the snapshot contents. > >>>> + * > >>>> + * Just try to blocklist this kclient and if it fails we need > >>>> + * to crash the kclient instead of leaving it writeable. > >>> Hi Xiubo, > >>> > >>> I'm not sure I understand this "let's blocklist ourselves" concept. > >>> If the kernel client shouldn't continue writing to OSDs in this case, > >>> why not just stop issuing writes -- perhaps initiating some equivalent > >>> of a read-only remount like many local filesystems would do on I/O > >>> errors (e.g. errors=remount-ro mode)? > >> I still haven't found how could I handle it this way from ceph layer. I > >> saw they are just marking the inodes as EIO when this happens. > >> > >>> Or, perhaps, all in-memory snap contexts could somehow be invalidated > >>> in this case, making writes fail naturally -- on the client side, > >>> without actually being sent to OSDs just to be nixed by the blocklist > >>> hammer. > >>> > >>> But further, what makes a failure to decode a snap trace special? > >> From the known tracker the snapid was corrupted in one inode in MDS and > >> then when trying to build the snap trace with the corrupted snapid it > >> will corrupt. > >> > >> And also there maybe other cases. > >> > >>> AFAIK we don't do anything close to this for any other decoding > >>> failure. Wouldn't "when received corrupted XYZ we don't know what > >>> exactly has happened in MDS side" argument apply to pretty much all > >>> decoding failures? > >> The snap trace is different from other cases. The corrupted snap trace > >> will affect the whole snap realm hierarchy, which will affect the whole > >> inodes in the mount in worst case. > >> > >> This is why I was trying to evict the mount to prevent further IOs. > > I suspected as much and my other suggestion was to look at somehow > > invalidating snap contexts/realms. Perhaps decode out-of-place and on > > any error set a flag indicating that the snap context can't be trusted > > anymore? The OSD client could then check whether this flag is set > > before admitting the snap context blob into the request message and > > return an error, effectively rejecting the write. > > The snap realms are organize as tree-like hierarchy. When the snap trace > is corruppted maybe only one of the snap realms are affected and maybe > several or all. The problem is when decoding the corrupted snap trace we > couldn't know exactly which realms will be affected. If one realm is > marked as invalid all the child realms should be affected too. I realize that ;) My suggestion (quoted above) was to look into invalidating all snap realms, not a particular one: Or, perhaps, all in-memory snap contexts could somehow be invalidated in this case, making writes fail naturally -- on the client side, without actually being sent to OSDs just to be nixed by the blocklist hammer. Thanks, Ilya