Re: [cephfs] Kernel outage / timeout

Jack <ceph@xxxxxxxxxxxxxx> · Tue, 4 Dec 2018 19:50:10 +0100

Why is the client frozen at the first place ?
Is this because it somehow lost the connection to the mon (have not
found anything about this yet) ?
How can I prevent this ?
Can I make the client reconnect in less that 15 minutes, to lessen the
impact ?

Best regards,

On 12/04/2018 07:41 PM, Gregory Farnum wrote:
> Yes, this is exactly it with the "reconnect denied".
> -Greg
> 
> On Tue, Dec 4, 2018 at 3:00 AM NingLi <lining916740672@xxxxxxxxxx> wrote:
> 
>>
>> Hi，maybe this reference can help you
>>
>>
>> http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs
>>
>>
>>> On Dec 4, 2018, at 18:55, ceph@xxxxxxxxxxxxxx wrote:
>>>
>>> Hi,
>>>
>>> I have some wild freeze using cephfs with the kernel driver
>>> For instance:
>>> [Tue Dec  4 10:57:48 2018] libceph: mon1 10.5.0.88:6789 session lost,
>>> hunting for new mon
>>> [Tue Dec  4 10:57:48 2018] libceph: mon2 10.5.0.89:6789 session
>> established
>>> [Tue Dec  4 10:58:20 2018] ceph: mds0 caps stale
>>> [..] server is now frozen, filesystem accesses are stuck
>>> [Tue Dec  4 11:13:02 2018] libceph: mds0 10.5.0.88:6804 socket closed
>>> (con state OPEN)
>>> [Tue Dec  4 11:13:03 2018] libceph: mds0 10.5.0.88:6804 connection reset
>>> [Tue Dec  4 11:13:03 2018] libceph: reset on mds0
>>> [Tue Dec  4 11:13:03 2018] ceph: mds0 closed our session
>>> [Tue Dec  4 11:13:03 2018] ceph: mds0 reconnect start
>>> [Tue Dec  4 11:13:04 2018] ceph: mds0 reconnect denied
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 000000003f1ae609 1099692263746
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 00000000ccd58b71 1099692263749
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 00000000da5acf8f 1099692263750
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 000000005ddc2fcf 1099692263751
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 00000000469a70f4 1099692263754
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 000000005c0038f9 1099692263757
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 00000000e7288aa2 1099692263758
>>> [Tue Dec  4 11:13:04 2018] ceph:  dropping dirty+flushing Fw state for
>>> 00000000b431209a 1099692263759
>>> [Tue Dec  4 11:13:04 2018] libceph: mds0 10.5.0.88:6804 socket closed
>>> (con state NEGOTIATING)
>>> [Tue Dec  4 11:13:31 2018] libceph: osd12 10.5.0.89:6805 socket closed
>>> (con state OPEN)
>>> [Tue Dec  4 11:13:35 2018] libceph: osd17 10.5.0.89:6800 socket closed
>>> (con state OPEN)
>>> [Tue Dec  4 11:13:35 2018] libceph: osd9 10.5.0.88:6813 socket closed
>>> (con state OPEN)
>>> [Tue Dec  4 11:13:41 2018] libceph: osd0 10.5.0.87:6800 socket closed
>>> (con state OPEN)
>>>
>>> Kernel 4.17 is used, we got the same issue with 4.18
>>> Ceph 13.2.1 is used
>>> From what I understand, the kernel hang itself for some reason (perhaps
>>> it simply cannot handle some wild event)
>>>
>>> Is there a fix for that ?
>>>
>>> Secondly, it seems that the kernel reconnect itself after 15 minutes
>>> everytime
>>> Where is that tunable ? Could I lower that variables, so that hang have
>>> less impacts ?
>>>
>>>
>>> On ceph.log, I get Health check failed: 1 MDSs report slow requests
>>> (MDS_SLOW_REQUEST), but this is probably the consequence, not the cause
>>>
>>> Any tip ?
>>>
>>> Best regards,
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com