No gracefull handling of a maxed out cluster network with noup / nodown set.

Robert van Leeuwen <Robert.vanLeeuwen@xxxxxxxxxxxxx> · Wed, 20 Nov 2013 08:25:10 +0000

Hi,

I'm playing with our new Ceph cluster and it seems that Ceph is not gracefully handling a maxed out cluster network.

I had some "flapping" nodes once every few minutes when pushing a lot of traffic to the nodes so I decided to set the noup and nodown as described in the docs.
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
After this the setup actually breaks: it will start complaining about slow requests and the ceph cluster stops processing all traffic.

ceph -w shows the following:
2013-11-20 08:02:20.031412 osd.4 [WRN] slow request 120.991605 seconds old, received at 2013-11-20 08:00:19.039748: osd_op(client.4650.0:46 benchmark_data_fqdn_hostname_9016_object45 [write 0~4194304] 3.a11ea1e6 e158) v4 currently waiting for subops from [17,26]

When I disable noup and nodown things start working again.
So I am inclined to just take the flapping nodes for granted now since, except for some short flapping in the Ceph logging, things actually do keep working.
(also this is rados bench, actual traffic might well be IO limited)

Suggestions?

Thx,
Robert
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com