> On 3 Oct 2016, at 20:27, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Mon, Oct 3, 2016 at 1:19 PM, Nikolay Borisov <kernel@xxxxxxxx> wrote: >> Hello, >> >> I've been investigating the following crash with cephfs: >> >> [8734559.785146] general protection fault: 0000 [#1] SMP >> [8734559.791921] ioatdma shpchp ipmi_devintf ipmi_si ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6 [last unloaded: stat_faker_4410clouder4] >> [8734559.793307] CPU: 31 PID: 1917 Comm: rsync Tainted: G W O 4.4.10-clouder4 #1 >> [8734559.793686] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015 >> [8734559.793920] task: ffff883f3defc4c0 ti: ffff8821ef2e0000 task.ti: ffff8821ef2e0000 >> [8734559.794306] RIP: 0010:[<ffffffff813134d0>] [<ffffffff813134d0>] lockref_get_not_dead+0x10/0xa0 >> [8734559.794754] RSP: 0018:ffff8821ef2e3c28 EFLAGS: 00010296 >> [8734559.794981] RAX: ffff881621afe000 RBX: 7261666153203689 RCX: 0000000000000007 >> [8734559.795364] RDX: 0000000000000189 RSI: ffff8821ef2e3c38 RDI: 7261666153203689 >> [8734559.795742] RBP: ffff8821ef2e3c68 R08: 0000000000000002 R09: 000000000000077d >> [8734559.796130] R10: ffffea005886bf80 R11: ffff882cb6fe4e00 R12: 0000000000005c48 >> [8734559.796511] R13: ffff8821ef2e3ef8 R14: 0000000000000000 R15: ffff88015aabcdd8 >> [8734559.796892] FS: 00007fbed7c5e700(0000) GS:ffff881fffc60000(0000) knlGS:0000000000000000 >> [8734559.797276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [8734559.797507] CR2: ffffffffff600400 CR3: 000000085bd95000 CR4: 00000000001406e0 >> [8734559.797886] Stack: >> [8734559.798110] ffff8821ef2e3c58 0000000000000000 ffff8821ef2e3c68 ffffffff8107b082 >> [8734559.798747] 0000000000000000 ffffea005886bf80 0000000000005c48 7261666153203631 >> [8734559.799388] ffff8821ef2e3d78 ffffffffa04ed6ae ffff8821ef2e3e58 ffff8821ef2e3d08 >> [8734559.800032] Call Trace: >> [8734559.800260] [<ffffffff8107b082>] ? __might_sleep+0x52/0x90 >> [8734559.800496] [<ffffffffa04ed6ae>] __dcache_readdir+0x21e/0x480 [ceph] >> [8734559.800727] [<ffffffff811ad482>] ? path_put+0x22/0x30 >> [8734559.800957] [<ffffffffa04f67b8>] ? __touch_cap+0x28/0x90 [ceph] >> [8734559.801195] [<ffffffffa04f6965>] ? ceph_cap_string+0xe5/0x100 [ceph] >> [8734559.801432] [<ffffffffa04f6bb1>] ? __ceph_caps_issued_mask+0x141/0x150 [ceph] >> [8734559.801819] [<ffffffffa04ee23a>] ceph_readdir+0x6ea/0x7d0 [ceph] >> [8734559.802060] [<ffffffff8115e56a>] ? __might_fault+0x3a/0x50 >> [8734559.802290] [<ffffffff811a87fa>] ? cp_new_stat+0x15a/0x180 >> [8734559.802521] [<ffffffff8107b082>] ? __might_sleep+0x52/0x90 >> [8734559.802751] [<ffffffff811b5b7e>] iterate_dir+0xae/0x130 >> [8734559.802981] [<ffffffff811b5d90>] SyS_getdents+0x90/0x110 >> [8734559.803216] [<ffffffff811b5ea0>] ? SyS_old_readdir+0x90/0x90 >> [8734559.803445] [<ffffffff81639617>] entry_SYSCALL_64_fastpath+0x12/0x6a >> [8734559.803673] Code: e8 56 5e 32 00 ff 43 04 c6 03 00 65 ff 0d 5d 77 cf 7e eb d2 0f 1f 80 00 00 00 00 55 48 89 e5 53 48 8d 75 d0 48 83 ec 38 48 89 fb <48> 8b 17 48 8d 7d e0 89 55 e0 48 89 55 c0 8b 45 e0 89 45 d0 85 >> [8734559.808422] RIP [<ffffffff813134d0>] lockref_get_not_dead+0x10/0xa0 >> [8734559.808721] RSP <ffff8821ef2e3c28> >> >> So the faulting instruction is (%rdi),%rdx, looking at the register >> dump RDI clearly has a bogus value. I started backtracking from there >> to acquire more context e.g. ge the state of the dir's ceph_inode_info >> as well as the ceph_readdir_cache_control and here is what I found: >> >> 1. The dentry representing the dir which is being passed to __dcache_readdir: >> http://sprunge.us/bAQH - the filename is rather strange, searching among the files >> in the ceph mount point I couldn't find this file. Also, here is the state of the >> ceph_inode_info: http://sprunge.us/AYRI >> >> crash> struct ceph_readdir_cache_control ffff8821ef2e3ce8 >> struct ceph_readdir_cache_control { >> page = 0xffffea005886bf80, >> dentries = 0xffff881621afe000, >> index = 2953 >> } >> >> >> According to the state of the ceph_inoide_info this means that >> ceph_dir_is_complete_ordered would return true and the second condition >> should also be true since ptr_pos is held in r12 and the dir size is 26496. >> So the dentry being passed should be the 2953 % 512 = 393 in the cache_ctl.dentries array. >> Unfortunately my crashdump excldues the page cache pages and I cannot really see >> what are the contents of the dentries array. >> >> Could you provide any info on how to further debug this > > Zheng will know more, but this may have been fixed by > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=af5e5eb574776cdf1b756a27cc437bff257e22fe yes, the crash should be fixed by above commit. > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3d714c33632ef6bfdfaacc74ae6ba297b4c5820 > > in 4.6. > > Thanks, > > Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com