On Thu, Apr 16, 2015 at 9:48 AM, Adam Tygart <mozes@xxxxxxx> wrote: > What is significantly smaller? We have 67 requests in the 16,400,000 > range and 250 in the 18,900,000 range. > that explains the crash. could you help me to debug this issue. send /sys/kernel/debug/ceph/*/mdsc to me. run "echo module ceph +p > /sys/kernel/debug/dynamic_debug/control" on the cephfs mount machine restart the mds and wait until it crash again run "echo module ceph -p > /sys/kernel/debug/dynamic_debug/control" on the cephfs mount machine send kernel message of the cephfs mount machine to me (should in /var/log/kerne.log or /var/log/message) to recover from the crash. you can either force reset the machine contains cephfs mount or add "mds wipe sessions = 1" to mds section of ceph.conf Regards Yan, Zheng > Thanks, > > Adam > > On Wed, Apr 15, 2015 at 8:38 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >> On Thu, Apr 16, 2015 at 9:07 AM, Adam Tygart <mozes@xxxxxxx> wrote: >>> We are using 3.18.6-gentoo. Based on that, I was hoping that the >>> kernel bug referred to in the bug report would have been fixed. >>> >> >> The bug was supposed to be fixed, but you hit the bug again. could you >> check if the kernel client has any hang mds request. (check >> /sys/kernel/debug/ceph/*/mdsc on the machine that contain cephfs >> mount. If there is any request whose ID is significant smaller than >> other requests' IDs) >> >> Regards >> Yan, Zheng >> >>> -- >>> Adam >>> >>> On Wed, Apr 15, 2015 at 8:02 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote: >>>> On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson <kylehutson@xxxxxxx> wrote: >>>>> Thank you, John! >>>>> >>>>> That was exactly the bug we were hitting. My Google-fu didn't lead me to >>>>> this one. >>>> >>>> >>>> here is the bug report http://tracker.ceph.com/issues/10449. It's a >>>> kernel client bug which causes the session map size increase >>>> infinitely. which version of linux kernel are using? >>>> >>>> Regards >>>> Yan, Zheng >>>> >>>> >>>>> >>>>> On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.spray@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 15/04/2015 20:02, Kyle Hutson wrote: >>>>>>> >>>>>>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going >>>>>>> pretty well. >>>>>>> >>>>>>> Then, about noon today, we had an mds crash. And then the failover mds >>>>>>> crashed. And this cascaded through all 4 mds servers we have. >>>>>>> >>>>>>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears >>>>>>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect' >>>>>>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to >>>>>>> 'active', it crashes again. >>>>>>> >>>>>>> I have the mds log at >>>>>>> http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log >>>>>>> <http://people.beocat.cis.ksu.edu/%7Ekylehutson/ceph-mds.hobbit01.log> >>>>>>> >>>>>>> For the possibly, but not necessarily, useful background info. >>>>>>> - Yesterday we took our erasure coded pool and increased both pg_num and >>>>>>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%), >>>>>>> but those seem to be continuing to clean themselves up. >>>>>>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph) >>>>>>> filesystem to this filesystem. >>>>>>> - Before we realized the mds crashes, we had just changed the size of our >>>>>>> metadata pool from 2 to 4. >>>>>> >>>>>> >>>>>> It looks like you're seeing http://tracker.ceph.com/issues/10449, which is >>>>>> a situation where the SessionMap object becomes too big for the MDS to >>>>>> save.The cause of it in that case was stuck requests from a misbehaving >>>>>> client running a slightly older kernel. >>>>>> >>>>>> Assuming you're using the kernel client and having a similar problem, you >>>>>> could try to work around this situation by forcibly unmounting the clients >>>>>> while the MDS is offline, such that during clientreplay the MDS will remove >>>>>> them from the SessionMap after timing out, and then next time it tries to >>>>>> save the map it won't be oversized. If that works, you could then look into >>>>>> getting newer kernels on the clients to avoid hitting the issue again -- the >>>>>> #10449 ticket has some pointers about which kernel changes were relevant. >>>>>> >>>>>> Cheers, >>>>>> John >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com