Re: [PATCH v2 1/7] ceph: fail the request when failing to decode dentry names

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 01 Mar 2022 09:13:13 -0500

On Tue, 2022-03-01 at 21:57 +0800, Xiubo Li wrote:
> On 3/1/22 9:20 PM, Jeff Layton wrote:
> > On Tue, 2022-03-01 at 19:30 +0800, xiubli@xxxxxxxxxx wrote:
> > > From: Xiubo Li <xiubli@xxxxxxxxxx>
> > > 
> > > ------------[ cut here ]------------
> > > kernel BUG at fs/ceph/dir.c:537!
> > > invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > > CPU: 16 PID: 21641 Comm: ls Tainted: G            E     5.17.0-rc2+ #92
> > > Hardware name: Red Hat RHEV Hypervisor, BIOS 1.11.0-2.el7 04/01/2014
> > > 
> > > The corresponding code in ceph_readdir() is:
> > > 
> > > 	BUG_ON(rde->offset < ctx->pos);
> > > 
> > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> > > ---
> > >   fs/ceph/dir.c        | 13 +++++++------
> > >   fs/ceph/inode.c      |  5 +++--
> > >   fs/ceph/mds_client.c |  2 +-
> > >   3 files changed, 11 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > > index a449f4a07c07..6be0c1f793c2 100644
> > > --- a/fs/ceph/dir.c
> > > +++ b/fs/ceph/dir.c
> > > @@ -534,6 +534,13 @@ static int ceph_readdir(struct file *file, struct dir_context *ctx)
> > >   					    .ctext_len	= rde->altname_len };
> > >   		u32 olen = oname.len;
> > >   
> > > +		err = ceph_fname_to_usr(&fname, &tname, &oname, NULL);
> > > +		if (err) {
> > > +			pr_err("%s unable to decode %.*s, got %d\n", __func__,
> > > +			       rde->name_len, rde->name, err);
> > > +			goto out;
> > > +		}
> > > +
> > >   		BUG_ON(rde->offset < ctx->pos);
> > >   		BUG_ON(!rde->inode.in);
> > >   
> > > @@ -542,12 +549,6 @@ static int ceph_readdir(struct file *file, struct dir_context *ctx)
> > >   		     i, rinfo->dir_nr, ctx->pos,
> > >   		     rde->name_len, rde->name, &rde->inode.in);
> > >   
> > > -		err = ceph_fname_to_usr(&fname, &tname, &oname, NULL);
> > > -		if (err) {
> > > -			dout("Unable to decode %.*s. Skipping it.\n", rde->name_len, rde->name);
> > > -			continue;
> > > -		}
> > > -
> > >   		if (!dir_emit(ctx, oname.name, oname.len,
> > >   			      ceph_present_ino(inode->i_sb, le64_to_cpu(rde->inode.in->ino)),
> > >   			      le32_to_cpu(rde->inode.in->mode) >> 12)) {
> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > index 8b0832271fdf..2bc2f02b84e8 100644
> > > --- a/fs/ceph/inode.c
> > > +++ b/fs/ceph/inode.c
> > > @@ -1898,8 +1898,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
> > >   
> > >   		err = ceph_fname_to_usr(&fname, &tname, &oname, &is_nokey);
> > >   		if (err) {
> > > -			dout("Unable to decode %.*s. Skipping it.", rde->name_len, rde->name);
> > > -			continue;
> > > +			pr_err("%s unable to decode %.*s, got %d\n", __func__,
> > > +			       rde->name_len, rde->name, err);
> > > +			goto out;
> > >   		}
> > >   
> > 
> > Is this really an improvement?
> 
> Yeah, if we just continue without setting the rde->offset it will crash 
> in "BUG_ON(rde->offset < ctx->pos);" in ceph_readdir().
> 
> 

Ok.

> > Suppose I have one dentry with a corrupt
> > name. Do I want to fail a readdir request which might allow me to get at
> > other dentries in that directory that isn't corrupt?
> 
> It's a little hard to handle the code in ceph_readdir():
> 
>   503         /* search start position */
>   504         if (rinfo->dir_nr > 0) {
>   505                 int step, nr = rinfo->dir_nr;
>   506                 while (nr > 0) {
>   507                         step = nr >> 1;
>   508                         if (rinfo->dir_entries[i + step].offset < 
> ctx->pos) {
>   509                                 i +=  step + 1;
>   510                                 nr -= step + 1;
>   511                         } else {
>   512                                 nr = step;
>   513                         }
>   514                 }
>   515         }
> 
> In this case how to set the rde->offset ?
> 
> 

Yeah, that is the nasty part. I would probably just pretend that the
corrupt dentry doesn't exist. Offsets are set by ceph_make_fpos, AFAICT,
and I don't think skipping one should affect the position of the other.

OTOH, this is just a nice-to-have thing. If it's too nasty to deal with,
we can just return an error for now and aim to do better error handling
here later.

> > 
> > Maybe we should try to emit some placeholder there?
> > 
> > 
> > >   		dname.name = oname.name;
> > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > > index 914a6e68bb56..94b4c6508044 100644
> > > --- a/fs/ceph/mds_client.c
> > > +++ b/fs/ceph/mds_client.c
> > > @@ -3474,7 +3474,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
> > >   	if (err == 0) {
> > >   		if (result == 0 && (req->r_op == CEPH_MDS_OP_READDIR ||
> > >   				    req->r_op == CEPH_MDS_OP_LSSNAP))
> > > -			ceph_readdir_prepopulate(req, req->r_session);
> > > +			err = ceph_readdir_prepopulate(req, req->r_session);
> > >   	}
> > >   	current->journal_info = NULL;
> > >   	mutex_unlock(&req->r_fill_mutex);
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>