Hi, On 07/12/13 19:57, Edwin Peer wrote:
Seconds of down time is quite severe, especially when it is a planned shut down or rejoining. I can understand if an OSD just disappears, that some requests might be directed to the now gone node, but I see similar latency hiccups on scheduled shut downs and rejoins too?
have you tried to reduce the "osd recovery max active" and "osd backfill max" from the defaults ? There are also some option to reduce the recovery's priority.
The defaults say 63 prio for client and 10 for recovery. But also a "osd recovery max active" to 5 and 2 I/O threads per OSD. I notice that reducing the "osd recovery max active" to 1 reduced the I/O latency penalty when recovery is active.
Reweighting an OSD to 0.9 should be enough to let you see how your cluster performs under recovery.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com