Hi Ilya,
While trying to reproduce the issue I've found that: - it is relatively easy to reproduce 5-6 minutes hangs just by killing active mds process (triggering failover) while writing a lot of data. Unacceptable timeout, but not the case of http://tracker.ceph.com/issues/15255- it is hard to reproduce the endless hang (I've spent an hour without success)
One thing I've noticed analysing logs is that "endless hang" always was accompanied with following messages: Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session lost, hunting for new mon Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 session established Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 session lost, hunting for new mon Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session lost, hunting for new mon Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session established Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session lost, hunting for new mon Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established Jul 20 15:33:58 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session lost, hunting for new mon Jul 20 15:34:29 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established
On Thu, Jul 20, 2017 at 3:23 PM, Дмитрий Глушенок < glush@xxxxxxxxxx> wrote: Looks like I have similar issue as described in this bug: http://tracker.ceph.com/issues/15255 Writer (dd in my case) can be restarted and then writing continues, but until restart dd looks like hanged on write.
20 июля 2017 г., в 16:12, Дмитрий Глушенок <glush@xxxxxxxxxx> написал(а):
Hi,
Repeated the test using kernel 4.12.0. OSD node crash seems to be handled fine now, but MDS crash still leads to hanged writes to CephFS. Now it was enough just to crash the first MDS - failover didn't happened. At the same time FUSE client was running on another client - no problems with it.
Could you please post the exact steps for reproducing with 4.12 to that ticket? It sounds like something that should be prioritized. Thanks, Ilya
-- Dmitry Glushenok Jet Infosystems
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com