kvm guest with rbd-disks are unaccesible after app. 3h afterwards one OSD node fails

ulembke@xxxxxxxxxxxx (Udo Lembke) · Mon, 01 Sep 2014 17:13:43 +0200

Hi list,
on the weekend one of five OSD-nodes fails (hung with kernel panic).
The cluster degraded (12 of 60 osds), but from our monitoring-host the
noout-flag is set in this case.

But around three hours later the kvm-guest, which used storage on the
ceph cluster (and use writes) are unaccessible. After restarting the
failed ceph-node the ceph-cluster are healthy again, but the VMs need to
be restartet to work again.

In the ceph.config I had defined "osd_pool_default_min_size = 1"
therefore I don't understand why this happens.
Which parameter must be changed/set so that the kvm-clients still
working on the unhealthy cluster?

Ceph-version is  0.72.2 - pool replication 2.

Thanks for a hint.

Udo