Re: [PATCH v3] ceph: blocklist the kclient when receiving corrupted snap trace

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 9 Dec 2022 12:28:25 +0100

On Thu, Dec 8, 2022 at 2:05 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
>
>
> On 07/12/2022 22:28, Ilya Dryomov wrote:
> > On Wed, Dec 7, 2022 at 1:35 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
> >>
> >> On 07/12/2022 18:59, Ilya Dryomov wrote:
> >>> On Tue, Dec 6, 2022 at 1:59 PM <xiubli@xxxxxxxxxx> wrote:
> >>>> From: Xiubo Li <xiubli@xxxxxxxxxx>
> >>>>
> >>>> When received corrupted snap trace we don't know what exactly has
> >>>> happened in MDS side. And we shouldn't continue writing to OSD,
> >>>> which may corrupt the snapshot contents.
> >>>>
> >>>> Just try to blocklist this client and If fails we need to crash the
> >>>> client instead of leaving it writeable to OSDs.
> >>>>
> >>>> Cc: stable@xxxxxxxxxxxxxxx
> >>>> URL: https://tracker.ceph.com/issues/57686
> >>>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> >>>> ---
> >>>>
> >>>> Thanks Aaron's feedback.
> >>>>
> >>>> V3:
> >>>> - Fixed ERROR: spaces required around that ':' (ctx:VxW)
> >>>>
> >>>> V2:
> >>>> - Switched to WARN() to taint the Linux kernel.
> >>>>
> >>>>    fs/ceph/mds_client.c |  3 ++-
> >>>>    fs/ceph/mds_client.h |  1 +
> >>>>    fs/ceph/snap.c       | 25 +++++++++++++++++++++++++
> >>>>    3 files changed, 28 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> >>>> index cbbaf334b6b8..59094944af28 100644
> >>>> --- a/fs/ceph/mds_client.c
> >>>> +++ b/fs/ceph/mds_client.c
> >>>> @@ -5648,7 +5648,8 @@ static void mds_peer_reset(struct ceph_connection *con)
> >>>>           struct ceph_mds_client *mdsc = s->s_mdsc;
> >>>>
> >>>>           pr_warn("mds%d closed our session\n", s->s_mds);
> >>>> -       send_mds_reconnect(mdsc, s);
> >>>> +       if (!mdsc->no_reconnect)
> >>>> +               send_mds_reconnect(mdsc, s);
> >>>>    }
> >>>>
> >>>>    static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> >>>> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> >>>> index 728b7d72bf76..8e8f0447c0ad 100644
> >>>> --- a/fs/ceph/mds_client.h
> >>>> +++ b/fs/ceph/mds_client.h
> >>>> @@ -413,6 +413,7 @@ struct ceph_mds_client {
> >>>>           atomic_t                num_sessions;
> >>>>           int                     max_sessions;  /* len of sessions array */
> >>>>           int                     stopping;      /* true if shutting down */
> >>>> +       int                     no_reconnect;  /* true if snap trace is corrupted */
> >>>>
> >>>>           atomic64_t              quotarealms_count; /* # realms with quota */
> >>>>           /*
> >>>> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> >>>> index c1c452afa84d..023852b7c527 100644
> >>>> --- a/fs/ceph/snap.c
> >>>> +++ b/fs/ceph/snap.c
> >>>> @@ -767,8 +767,10 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>>>           struct ceph_snap_realm *realm;
> >>>>           struct ceph_snap_realm *first_realm = NULL;
> >>>>           struct ceph_snap_realm *realm_to_rebuild = NULL;
> >>>> +       struct ceph_client *client = mdsc->fsc->client;
> >>>>           int rebuild_snapcs;
> >>>>           int err = -ENOMEM;
> >>>> +       int ret;
> >>>>           LIST_HEAD(dirty_realms);
> >>>>
> >>>>           lockdep_assert_held_write(&mdsc->snap_rwsem);
> >>>> @@ -885,6 +887,29 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
> >>>>           if (first_realm)
> >>>>                   ceph_put_snap_realm(mdsc, first_realm);
> >>>>           pr_err("%s error %d\n", __func__, err);
> >>>> +
> >>>> +       /*
> >>>> +        * When receiving a corrupted snap trace we don't know what
> >>>> +        * exactly has happened in MDS side. And we shouldn't continue
> >>>> +        * writing to OSD, which may corrupt the snapshot contents.
> >>>> +        *
> >>>> +        * Just try to blocklist this kclient and if it fails we need
> >>>> +        * to crash the kclient instead of leaving it writeable.
> >>> Hi Xiubo,
> >>>
> >>> I'm not sure I understand this "let's blocklist ourselves" concept.
> >>> If the kernel client shouldn't continue writing to OSDs in this case,
> >>> why not just stop issuing writes -- perhaps initiating some equivalent
> >>> of a read-only remount like many local filesystems would do on I/O
> >>> errors (e.g. errors=remount-ro mode)?
> >> I still haven't found how could I handle it this way from ceph layer. I
> >> saw they are just marking the inodes as EIO when this happens.
> >>
> >>> Or, perhaps, all in-memory snap contexts could somehow be invalidated
> >>> in this case, making writes fail naturally -- on the client side,
> >>> without actually being sent to OSDs just to be nixed by the blocklist
> >>> hammer.
> >>>
> >>> But further, what makes a failure to decode a snap trace special?
> >>   From the known tracker the snapid was corrupted in one inode in MDS and
> >> then when trying to build the snap trace with the corrupted snapid it
> >> will corrupt.
> >>
> >> And also there maybe other cases.
> >>
> >>> AFAIK we don't do anything close to this for any other decoding
> >>> failure.  Wouldn't "when received corrupted XYZ we don't know what
> >>> exactly has happened in MDS side" argument apply to pretty much all
> >>> decoding failures?
> >> The snap trace is different from other cases. The corrupted snap trace
> >> will affect the whole snap realm hierarchy, which will affect the whole
> >> inodes in the mount in worst case.
> >>
> >> This is why I was trying to evict the mount to prevent further IOs.
> > I suspected as much and my other suggestion was to look at somehow
> > invalidating snap contexts/realms.  Perhaps decode out-of-place and on
> > any error set a flag indicating that the snap context can't be trusted
> > anymore?  The OSD client could then check whether this flag is set
> > before admitting the snap context blob into the request message and
> > return an error, effectively rejecting the write.
>
> The snap realms are organize as tree-like hierarchy. When the snap trace
> is corruppted maybe only one of the snap realms are affected and maybe
> several or all. The problem is when decoding the corrupted snap trace we
> couldn't know exactly which realms will be affected. If one realm is
> marked as invalid all the child realms should be affected too.

I realize that ;)  My suggestion (quoted above) was to look into
invalidating all snap realms, not a particular one:

    Or, perhaps, all in-memory snap contexts could somehow be invalidated
    in this case, making writes fail naturally -- on the client side,
    without actually being sent to OSDs just to be nixed by the blocklist
    hammer.

Thanks,

                Ilya