Le 10/09/2015 22:56, Robert LeBlanc a écrit : > We are trying to add some additional OSDs to our cluster, but the > impact of the backfilling has been very disruptive to client I/O and > we have been trying to figure out how to reduce the impact. We have > seen some client I/O blocked for more than 60 seconds. There has been > CPU and RAM head room on the OSD nodes, network has been fine, disks > have been busy, but not terrible. It seems you've already exhausted most of the ways I know. When confronted to this situation, I used a simple script to throttle backfills (freezing them, then re-enabling them), this helped our VMs at the time but you must be prepared for very long migrations and some experimentations with different schedulings. You simply pass it the number of seconds backfills are allowed to proceed then the number of seconds during them they pause. Here's the script, which should be self-explanatory: http://pastebin.com/sy7h1VEy something like : ./throttler 10 120 limited the impact on our VMs (the idea being that during the 10s the backfill won't be able to trigger filestore syncs and the 120s pause will allow the filestore syncs to remove "dirty" data from the journals without interfering too much with concurrent writes). I believe you must have a high filestore sync value to hope to benefit from this (we use 30s). At the very least the long pause will eventually allow VMs to move data to disk regularly instead of being nearly frozen. Note that your pgs are more than 10G each, if the OSDs can't stop a backfill before finishing transferring the current pg this won't help (I assume backfills go through journals and they probably won't be able to act as write-back caches anymore as one PG will be enough to fill them up). Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com