> On Wed, Jun 1, 2016 at 6:15 AM, James Webb <jamesw@xxxxxxxxxxx> wrote: >> Dear ceph-users... >> >> My team runs an internal buildfarm using ceph as a backend storage platform. We’ve recently upgraded to Jewel and are having reliability issues that we need some help with. >> >> Our infrastructure is the following: >> - We use CEPH/CEPHFS (10.2.1) >> - We have 3 mons and 6 storage servers with a total of 36 OSDs (~4160 PGs). >> - We use enterprise SSDs for everything including journals >> - We have one main mds and one standby mds. >> - We are using ceph kernel client to mount cephfs. >> - We have upgrade to Ubuntu 16.04 (4.4.0-22-generic kernel) >> - We are using a kernel NFS to serve NFS clients from a ceph mount (~ 32 nfs threads. 0 swappiness) >> - These are physical machines with 8 cores & 32GB memory >> >> On a regular basis, we lose all IO via ceph FS. We’re still trying to isolate the issue but it surfaces as an issue between MDS and ceph client. >> We can’t tell if our our NFS server is overwhelming the MDS or if this is some unrelated issue. Tuning NFS server has not solved our issues. >> So far our only recovery has been to fail the MDS and then restart our NFS. Any help or advice will be appreciated on the CEPH side of things. >> I’m pretty sure we’re running with default tuning of CEPH MDS configuration parameters. >> >> >> Here are the relevant log entries. >> >> From my primary MDS server, I start seeing these entries start to pile up: >> >> 2016-05-31 14:34:07.091117 7f9f2eb87700 0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000004491 pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877480 seconds ago\ >> 2016-05-31 14:34:07.091129 7f9f2eb87700 0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000005ddf pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877382 seconds ago\ >> 2016-05-31 14:34:07.091133 7f9f2eb87700 0 log_channel(cluster) log [WRN] : client.4283066 isn't responding to mclientcaps(revoke), ino 10000000a2a pending pAsLsXsFsxcrwb issued pAsxLsXsxFsxcrwb, sent 63.877356 seconds ago >> >> From my NFS server, I see these entries from dmesg also start piling up: >> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 0 expected 4294967296 >> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 1 expected 4294967296 >> [Tue May 31 14:33:09 2016] libceph: skipping mds0 X.X.X.195:6800 seq 2 expected 4294967296 >> > > 4294967296 is 0x100000000, this looks like sequence overflow. > > In src/msg/Message.h: > > class Message { > ... > unsigned get_seq() const { return header.seq; } > void set_seq(unsigned s) { header.seq = s; } > ... > } > > in src/msg/simple/Pipe.cc > > class Pipe { > ... > __u32 get_out_seq() { return out_seq; } > ... > } > > Is this bug or intentional ? Hrm, I think this a bug^Woversight. Sage's commit 9731226228dd ("convert more types in ceph_fs.h to __le* notation") from early 2008 changed ceph_msg_header's seq from __u32 to __le64 and also changed dout()s in the kernel from %d to %lld, so the 32 -> 64 switch seems like it was intentional. Message::get/set_seq() remained unsigned... The question is which do we fix now - changing the kernel client to wrap at 32 would be less of a hassle and easier in terms of backporting, but the problem is really in the userspace messenger. Sage? Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com