No gracefull handling of a maxed out cluster network with noup / nodown set.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm playing with our new Ceph cluster and it seems that Ceph is not gracefully handling a maxed out cluster network.

I had some "flapping" nodes once every few minutes when pushing a lot of traffic to the nodes so I decided to set the noup and nodown as described in the docs.
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
After this the setup actually breaks: it will start complaining about slow requests and the ceph cluster stops processing all traffic.

ceph -w shows the following:
2013-11-20 08:02:20.031412 osd.4 [WRN] slow request 120.991605 seconds old, received at 2013-11-20 08:00:19.039748: osd_op(client.4650.0:46 benchmark_data_fqdn_hostname_9016_object45 [write 0~4194304] 3.a11ea1e6 e158) v4 currently waiting for subops from [17,26]

When I disable noup and nodown things start working again.
So I am inclined to just take the flapping nodes for granted now since, except for some short flapping in the Ceph logging, things actually do keep working.
(also this is rados bench, actual traffic might well be IO limited)

Suggestions?

Thx,
Robert
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux