On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman > <eric.eastman@xxxxxxxxxxxxxx> wrote: >>> Hi Yan Zheng, Eric Eastman >>> >>> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing >>> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal >>> handling fix"). >>> >>> Related report & discussion was here: >>> https://lkml.org/lkml/2015/12/12/149 >>> >>> I'm not sure the current reported issue of ceph was related to that though, >>> but at least try testing with an upgraded or patched kernel could verify it. >>> :) >>> >>> Thanks, > > please try rc5 kernel without patches and DEBUG_VM=y > > Regards > Yan, Zheng The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36 hours with no ERRORS or WARNINGS. My plan is to install the 4.4rc6 kernel from the Ubuntu kernel-ppa site once it is available, and rerun the tests. Before running this test I had to rebuild the Ceph File System as after the last logged errors on Friday using the 4.4rc4 kernel, the Ceph File system hung accessing the exported image file. After rebooting my iSCSI gateway using the Ceph File System, from / using command: strace du -a cephfs, the mount point, the hang happened on the newfsstatat call on my image file: write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock ) = 25 close(5) = 0 write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb ) = 14 newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, ...}, AT_SYMLINK_NOFOLLOW) = 0 openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3 fcntl(3, F_GETFD) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0 fcntl(3, F_GETFL) = 0x38800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) fcntl(3, F_SETFD, FD_CLOEXEC) = 0 newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, ...}, AT_SYMLINK_NOFOLLOW) = 0 fcntl(3, F_DUPFD, 3) = 5 fcntl(5, F_GETFD) = 0 fcntl(5, F_SETFD, FD_CLOEXEC) = 0 getdents(3, /* 8 entries */, 65536) = 288 getdents(3, /* 0 entries */, 65536) = 0 close(3) = 0 newfstatat(5, "iscsi900g.img", ^C ^C^C^C ^Z I could not break out with a ^C, and had to background the process to get my prompt back. The process would not die so I had to hard reset the system. This same hang happened on 2 other kernel mounted systems using a 4.3.0 kernel. On a separate system, I fuse mounted the file system and a du -a cephfs hung at the same point. Once again I could not break out of the hang, and had to hard reset the system. Restarting the MDS and Monitors did not clear the issue. Taking a quick look at the dumpcache showed it was large # ceph mds tell 0 dumpcache /tmp/dump.txt ok # wc /tmp/dump.txt 370556 5002449 59211054 /tmp/dump.txt # tail /tmp/dump.txt [inode 10000259276 [...c4,head] ~mds0/stray0/10000259276/ auth v977593 snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0 rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228] [inode 1000020c1ba [...a6,head] ~mds0/stray0/1000020c1ba/ auth v742016 snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0 rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88] [inode 100000d0088 [...77,head] ~mds0/stray6/100000d0088/ auth v292336 snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0 rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378] I tried one more thing: ceph daemon mds.0 flush journal and restarted the MDS. Accessing the file system still locked up, but a du -a cephfs did not even get to the iscsi900g.img file. As I was running on a broken rc kernel, with snapshots turned on, when this corruption happened, I decided to recreated the file system and restarted the ESXi iSCSI test. Regards, Eric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html