On Thu, 2019-08-15 at 16:45 +0900, Hector Martin wrote: > On 15/08/2019 03.40, Jeff Layton wrote: > > On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote: > > > Jeff, the oops seems to be a NULL dereference in ceph_lock_message(). > > > Please take a look. > > > > > > > (sorry for duplicate mail -- the other one ended up in moderation) > > > > Thanks Ilya, > > > > That function is pretty straightforward. We don't do a whole lot of > > pointer chasing in there, so I'm a little unclear on where this would > > have crashed. Right offhand, that kernel is probably missing > > 1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but > > that seems unlikely to result in an oops. > > > > Hector, if you have the debuginfo for this kernel installed on one of > > these machines, could you run gdb against the ceph.ko module and then > > do: > > > > gdb> list *(ceph_lock_message+0x212) > > > > That may give me a better hint as to what went wrong. > > This is what I get: > > (gdb) list *(ceph_lock_message+0x212) > 0xd782 is in ceph_lock_message > (/build/linux-hwe-B83fOS/linux-hwe-4.18.0/fs/ceph/locks.c:116). > 111 req->r_wait_for_completion = > ceph_lock_wait_for_completion; > 112 > 113 err = ceph_mdsc_do_request(mdsc, inode, req); > 114 > 115 if (operation == CEPH_MDS_OP_GETFILELOCK) { > 116 fl->fl_pid = > -le64_to_cpu(req->r_reply_info.filelock_reply->pid); > 117 if (CEPH_LOCK_SHARED == > req->r_reply_info.filelock_reply->type) > 118 fl->fl_type = F_RDLCK; > 119 else if (CEPH_LOCK_EXCL == > req->r_reply_info.filelock_reply->type) > 120 fl->fl_type = F_WRLCK; > > Disasm: > > 0x000000000000d77b <+523>: mov 0x250(%rbx),%rdx > 0x000000000000d782 <+530>: mov 0x20(%rdx),%rdx > 0x000000000000d786 <+534>: neg %edx > 0x000000000000d788 <+536>: mov %edx,0x48(%r15) > > That means req->r_reply_info.filelock_reply was NULL. > > Many thanks, Hector. Would you mind opening a bug against the kernel client at https://tracker.ceph.com ? That's better than doing this via email and we'll want to make sure we keep track of this. Did you say that this was reproducible? Now... Note that we don't actually check whether ceph_mdsc_do_request returned success before we start dereferencing there. I suspect that function returned an error, and the pointer was left zeroed out. Probably, we just need to turn that if statement into: if (!err && operation == CEPH_MDS_OP_GETFILELOCK) { I'll queue up a patch. Thanks for the report! -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com