Re: How's cephfs going?

Дмитрий Глушенок <glush@xxxxxxxxxx> · Thu, 20 Jul 2017 19:35:22 +0300

Hi Ilya,

While trying to reproduce the issue I've found that:
- it is relatively easy to reproduce 5-6 minutes hangs just by killing active mds process (triggering failover) while writing a lot of data. Unacceptable timeout, but not the case of http://tracker.ceph.com/issues/15255
- it is hard to reproduce the endless hang (I've spent an hour without success)

One thing I've noticed analysing logs is that "endless hang" always was accompanied with following messages:
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session lost, hunting for new mon
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 session established
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 session lost, hunting for new mon
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session lost, hunting for new mon
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session established
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 session lost, hunting for new mon
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established
Jul 20 15:33:58 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session lost, hunting for new mon
Jul 20 15:34:29 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 session established

Bug http://tracker.ceph.com/issues/17664 describes such behaviour and it was fixed in releases starting with v11.1.0 (I'm using 10.2.7). So, the lost session somehow triggers client disconnection and fencing (as described at http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs).

Do you still think it should be posted to http://tracker.ceph.com/issues/15255 ?

20 июля 2017 г., в 17:02, Ilya Dryomov <idryomov@xxxxxxxxx> написал(а):

On Thu, Jul 20, 2017 at 3:23 PM, Дмитрий Глушенок <glush@xxxxxxxxxx> wrote:
Looks like I have similar issue as described in this bug:
http://tracker.ceph.com/issues/15255
Writer (dd in my case) can be restarted and then writing continues, but
until restart dd looks like hanged on write.

20 июля 2017 г., в 16:12, Дмитрий Глушенок <glush@xxxxxxxxxx> написал(а):

Hi,

Repeated the test using kernel 4.12.0. OSD node crash seems to be handled
fine now, but MDS crash still leads to hanged writes to CephFS. Now it was
enough just to crash the first MDS - failover didn't happened. At the same
time FUSE client was running on another client - no problems with it.

Could you please post the exact steps for reproducing with 4.12 to that
ticket?  It sounds like something that should be prioritized.

Thanks,

                Ilya

--
Dmitry Glushenok
Jet Infosystems

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com