On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson <kylehutson@xxxxxxx> wrote: > Thank you, John! > > That was exactly the bug we were hitting. My Google-fu didn't lead me to > this one. here is the bug report http://tracker.ceph.com/issues/10449. It's a kernel client bug which causes the session map size increase infinitely. which version of linux kernel are using? Regards Yan, Zheng > > On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.spray@xxxxxxxxxx> wrote: >> >> On 15/04/2015 20:02, Kyle Hutson wrote: >>> >>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going >>> pretty well. >>> >>> Then, about noon today, we had an mds crash. And then the failover mds >>> crashed. And this cascaded through all 4 mds servers we have. >>> >>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears >>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect' >>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to >>> 'active', it crashes again. >>> >>> I have the mds log at >>> http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log >>> <http://people.beocat.cis.ksu.edu/%7Ekylehutson/ceph-mds.hobbit01.log> >>> >>> For the possibly, but not necessarily, useful background info. >>> - Yesterday we took our erasure coded pool and increased both pg_num and >>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%), >>> but those seem to be continuing to clean themselves up. >>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph) >>> filesystem to this filesystem. >>> - Before we realized the mds crashes, we had just changed the size of our >>> metadata pool from 2 to 4. >> >> >> It looks like you're seeing http://tracker.ceph.com/issues/10449, which is >> a situation where the SessionMap object becomes too big for the MDS to >> save.The cause of it in that case was stuck requests from a misbehaving >> client running a slightly older kernel. >> >> Assuming you're using the kernel client and having a similar problem, you >> could try to work around this situation by forcibly unmounting the clients >> while the MDS is offline, such that during clientreplay the MDS will remove >> them from the SessionMap after timing out, and then next time it tries to >> save the map it won't be oversized. If that works, you could then look into >> getting newer kernels on the clients to avoid hitting the issue again -- the >> #10449 ticket has some pointers about which kernel changes were relevant. >> >> Cheers, >> John > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com