Re: CephFS meltdown fallout: mds assert failure, kernel oopses

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2019-08-15 at 16:45 +0900, Hector Martin wrote:
> On 15/08/2019 03.40, Jeff Layton wrote:
> > On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote:
> > > Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
> > > Please take a look.
> > > 
> > 
> > (sorry for duplicate mail -- the other one ended up in moderation)
> > 
> > Thanks Ilya,
> > 
> > That function is pretty straightforward. We don't do a whole lot of
> > pointer chasing in there, so I'm a little unclear on where this would
> > have crashed. Right offhand, that kernel is probably missing
> > 1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but
> > that seems unlikely to result in an oops.
> > 
> > Hector, if you have the debuginfo for this kernel installed on one of
> > these machines, could you run gdb against the ceph.ko module and then
> > do:
> > 
> >       gdb> list *(ceph_lock_message+0x212)
> > 
> > That may give me a better hint as to what went wrong.
> 
> This is what I get:
> 
> (gdb)  list *(ceph_lock_message+0x212)
> 0xd782 is in ceph_lock_message 
> (/build/linux-hwe-B83fOS/linux-hwe-4.18.0/fs/ceph/locks.c:116).
> 111                     req->r_wait_for_completion = 
> ceph_lock_wait_for_completion;
> 112
> 113             err = ceph_mdsc_do_request(mdsc, inode, req);
> 114
> 115             if (operation == CEPH_MDS_OP_GETFILELOCK) {
> 116                     fl->fl_pid = 
> -le64_to_cpu(req->r_reply_info.filelock_reply->pid);
> 117                     if (CEPH_LOCK_SHARED == 
> req->r_reply_info.filelock_reply->type)
> 118                             fl->fl_type = F_RDLCK;
> 119                     else if (CEPH_LOCK_EXCL == 
> req->r_reply_info.filelock_reply->type)
> 120                             fl->fl_type = F_WRLCK;
> 
> Disasm:
> 
>     0x000000000000d77b <+523>:   mov    0x250(%rbx),%rdx
>     0x000000000000d782 <+530>:   mov    0x20(%rdx),%rdx
>     0x000000000000d786 <+534>:   neg    %edx
>     0x000000000000d788 <+536>:   mov    %edx,0x48(%r15)
> 
> That means req->r_reply_info.filelock_reply was NULL.
> 
> 

Many thanks, Hector. Would you mind opening a bug against the kernel
client at https://tracker.ceph.com ? That's better than doing this via
email and we'll want to make sure we keep track of this.  Did you say
that this was reproducible?

Now...

Note that we don't actually check whether ceph_mdsc_do_request returned
success before we start dereferencing there. I suspect that function
returned an error, and the pointer was left zeroed out.

Probably, we just need to turn that if statement into:

	if (!err && operation == CEPH_MDS_OP_GETFILELOCK) {

I'll queue up a patch.

Thanks for the report!
-- 
Jeff Layton <jlayton@xxxxxxxxxxxxxxx>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux