Hello, yesterday I upgraded our most busy (in other words lethally overloaded) production cluster to the latest Firefly in preparation for a Hammer upgrade and then phasing in of a cache tier. When restarting the ODSs it took 3 minutes (1 minute in a consecutive repeat to test the impact of primed caches) during which the cluster crawled to a near stand-still and the dreaded slow requests piled up, causing applications in the VMs to fail. I had of course set things to "noout" beforehand, in hopes of staving off this kind of scenario. Note that the other OSDs and their backing storage were NOT overloaded during that time, only the backing storage of the OSD being restarted was under duress. I was under the (wishful thinking?) impression that with noout set and a controlled OSD shutdown/restart, operations would be redirect to the new primary for the duration. The strain on the restarted OSDs when recovering those operations (which I also saw) I was prepared for, the near screeching halt not so much. Any thoughts on how to mitigate this further or is this the expected behavior? Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com