On Sun, Dec 20, 2015 at 6:38 PM, Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> wrote: > On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman >> <eric.eastman@xxxxxxxxxxxxxx> wrote: >>>> Hi Yan Zheng, Eric Eastman >>>> >>>> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing >>>> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal >>>> handling fix"). >>>> >>>> Related report & discussion was here: >>>> https://lkml.org/lkml/2015/12/12/149 >>>> >>>> I'm not sure the current reported issue of ceph was related to that though, >>>> but at least try testing with an upgraded or patched kernel could verify it. >>>> :) >>>> >>>> Thanks, > >> >> please try rc5 kernel without patches and DEBUG_VM=y >> >> Regards >> Yan, Zheng > > > The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36 > hours with no ERRORS or WARNINGS. My plan is to install the 4.4rc6 > kernel from the Ubuntu kernel-ppa site once it is available, and rerun > the tests. > > Before running this test I had to rebuild the Ceph File System as > after the last logged errors on Friday using the 4.4rc4 kernel, the > Ceph File system hung accessing the exported image file. After > rebooting my iSCSI gateway using the Ceph File System, from / using > command: strace du -a cephfs, the mount point, the hang happened on > the newfsstatat call on my image file: > > write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock > ) = 25 > close(5) = 0 > write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb > ) = 14 > newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, > ...}, AT_SYMLINK_NOFOLLOW) = 0 > openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3 > fcntl(3, F_GETFD) = 0 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0 > fcntl(3, F_GETFL) = 0x38800 (flags > O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, > ...}, AT_SYMLINK_NOFOLLOW) = 0 > fcntl(3, F_DUPFD, 3) = 5 > fcntl(5, F_GETFD) = 0 > fcntl(5, F_SETFD, FD_CLOEXEC) = 0 > getdents(3, /* 8 entries */, 65536) = 288 > getdents(3, /* 0 entries */, 65536) = 0 > close(3) = 0 > newfstatat(5, "iscsi900g.img", ^C > ^C^C^C > ^Z > I could not break out with a ^C, and had to background the process to > get my prompt back. The process would not die so I had to hard reset > the system. > > This same hang happened on 2 other kernel mounted systems using a 4.3.0 kernel. > > On a separate system, I fuse mounted the file system and a du -a > cephfs hung at the same point. Once again I could not break out of the > hang, and had to hard reset the system. > > Restarting the MDS and Monitors did not clear the issue. Taking a > quick look at the dumpcache showed it was large > > # ceph mds tell 0 dumpcache /tmp/dump.txt > ok > # wc /tmp/dump.txt > 370556 5002449 59211054 /tmp/dump.txt > # tail /tmp/dump.txt > [inode 10000259276 [...c4,head] ~mds0/stray0/10000259276/ auth v977593 > snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0 > rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228] > [inode 1000020c1ba [...a6,head] ~mds0/stray0/1000020c1ba/ auth v742016 > snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0 > rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88] > [inode 100000d0088 [...77,head] ~mds0/stray6/100000d0088/ auth v292336 > snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0 > rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378] These are deleted files that haven't been trimmed yet... > > I tried one more thing: > > ceph daemon mds.0 flush journal > > and restarted the MDS. Accessing the file system still locked up, but > a du -a cephfs did not even get to the iscsi900g.img file. As I was > running on a broken rc kernel, with snapshots turned on ...and I think we have some known issues in the tracker about snap trimming and snapshotted inodes. So this is not entirely surprising. :/ -Greg >, when this > corruption happened, I decided to recreated the file system and > restarted the ESXi iSCSI test. > > Regards, > Eric > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html