Hello everyone, I'm seeing some OSD behavior that I consider unexpected; perhaps someone can shed some insight. Ceph giant (0.87.0), osd max backfills and osd recovery max active both set to 1. Please take a moment to look at the following "ceph health detail" screen dump: HEALTH_WARN 14 pgs backfill; 1 pgs backfilling; 15 pgs stuck unclean; recovery 16/65732491 objects degraded (0.000%); 328254/65732491 objects misplaced (0.499%) pg 20.3db is stuck unclean for 13547.432043, current state active+remapped+wait_backfill, last acting [45,90,157] pg 15.318 is stuck unclean for 13547.380581, current state active+remapped+wait_backfill, last acting [41,17,120] pg 15.34a is stuck unclean for 13548.115170, current state active+remapped+wait_backfill, last acting [64,87,80] pg 20.6f is stuck unclean for 13548.019218, current state active+remapped+wait_backfill, last acting [13,38,98] pg 20.44c is stuck unclean for 13548.075430, current state active+remapped+wait_backfill, last acting [174,127,139] pg 20.bc is stuck unclean for 13545.743397, current state active+remapped+wait_backfill, last acting [72,64,104] pg 15.1ac is stuck unclean for 13548.181461, current state active+remapped+wait_backfill, last acting [121,145,84] pg 15.1af is stuck unclean for 13547.962269, current state active+remapped+backfilling, last acting [150,62,101] pg 20.396 is stuck unclean for 13547.835109, current state active+remapped+wait_backfill, last acting [134,49,96] pg 15.1ba is stuck unclean for 13548.128752, current state active+remapped+wait_backfill, last acting [122,63,162] pg 15.3fd is stuck unclean for 13547.644431, current state active+remapped+wait_backfill, last acting [156,38,131] pg 20.41c is stuck unclean for 13548.133470, current state active+remapped+wait_backfill, last acting [78,85,168] pg 20.525 is stuck unclean for 13545.272774, current state active+remapped+wait_backfill, last acting [76,57,148] pg 15.1ca is stuck unclean for 13547.944928, current state active+remapped+wait_backfill, last acting [157,19,36] pg 20.11e is stuck unclean for 13545.368614, current state active+remapped+wait_backfill, last acting [36,134,8] pg 20.525 is active+remapped+wait_backfill, acting [76,57,148] pg 20.44c is active+remapped+wait_backfill, acting [174,127,139] pg 20.41c is active+remapped+wait_backfill, acting [78,85,168] pg 15.3fd is active+remapped+wait_backfill, acting [156,38,131] pg 20.3db is active+remapped+wait_backfill, acting [45,90,157] pg 20.396 is active+remapped+wait_backfill, acting [134,49,96] pg 15.34a is active+remapped+wait_backfill, acting [64,87,80] pg 15.318 is active+remapped+wait_backfill, acting [41,17,120] pg 15.1ca is active+remapped+wait_backfill, acting [157,19,36] pg 15.1ba is active+remapped+wait_backfill, acting [122,63,162] pg 15.1ac is active+remapped+wait_backfill, acting [121,145,84] pg 15.1af is active+remapped+backfilling, acting [150,62,101] pg 20.11e is active+remapped+wait_backfill, acting [36,134,8] pg 20.bc is active+remapped+wait_backfill, acting [72,64,104] pg 20.6f is active+remapped+wait_backfill, acting [13,38,98] recovery 16/65732491 objects degraded (0.000%); 328254/65732491 objects misplaced (0.499%) As you can see, there is barely any overlap between the acting OSDs for those PGs. osd max backfills should only limit the number of concurrent backfills out of a single OSD, and so in the situation above I would expect the 15 backfills to happen mostly concurrently. As it is they are being serialized, and that seems to needlessly slow down the process and extend the time needed to complete recovery. I'm pretty sure I'm missing something obvious here, but what is it? All insight greatly appreciated. :) Thank you! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com