Unexpectedly low number of concurrent backfills

Florian Haas <florian@xxxxxxxxxxx> · Tue, 17 Feb 2015 21:09:38 +0100

Hello everyone,

I'm seeing some OSD behavior that I consider unexpected; perhaps
someone can shed some insight.

Ceph giant (0.87.0), osd max backfills and osd recovery max active
both set to 1.

Please take a moment to look at the following "ceph health detail" screen dump:

HEALTH_WARN 14 pgs backfill; 1 pgs backfilling; 15 pgs stuck unclean;
recovery 16/65732491 objects degraded (0.000%); 328254/65732491
objects misplaced (0.499%)
pg 20.3db is stuck unclean for 13547.432043, current state
active+remapped+wait_backfill, last acting [45,90,157]
pg 15.318 is stuck unclean for 13547.380581, current state
active+remapped+wait_backfill, last acting [41,17,120]
pg 15.34a is stuck unclean for 13548.115170, current state
active+remapped+wait_backfill, last acting [64,87,80]
pg 20.6f is stuck unclean for 13548.019218, current state
active+remapped+wait_backfill, last acting [13,38,98]
pg 20.44c is stuck unclean for 13548.075430, current state
active+remapped+wait_backfill, last acting [174,127,139]
pg 20.bc is stuck unclean for 13545.743397, current state
active+remapped+wait_backfill, last acting [72,64,104]
pg 15.1ac is stuck unclean for 13548.181461, current state
active+remapped+wait_backfill, last acting [121,145,84]
pg 15.1af is stuck unclean for 13547.962269, current state
active+remapped+backfilling, last acting [150,62,101]
pg 20.396 is stuck unclean for 13547.835109, current state
active+remapped+wait_backfill, last acting [134,49,96]
pg 15.1ba is stuck unclean for 13548.128752, current state
active+remapped+wait_backfill, last acting [122,63,162]
pg 15.3fd is stuck unclean for 13547.644431, current state
active+remapped+wait_backfill, last acting [156,38,131]
pg 20.41c is stuck unclean for 13548.133470, current state
active+remapped+wait_backfill, last acting [78,85,168]
pg 20.525 is stuck unclean for 13545.272774, current state
active+remapped+wait_backfill, last acting [76,57,148]
pg 15.1ca is stuck unclean for 13547.944928, current state
active+remapped+wait_backfill, last acting [157,19,36]
pg 20.11e is stuck unclean for 13545.368614, current state
active+remapped+wait_backfill, last acting [36,134,8]
pg 20.525 is active+remapped+wait_backfill, acting [76,57,148]
pg 20.44c is active+remapped+wait_backfill, acting [174,127,139]
pg 20.41c is active+remapped+wait_backfill, acting [78,85,168]
pg 15.3fd is active+remapped+wait_backfill, acting [156,38,131]
pg 20.3db is active+remapped+wait_backfill, acting [45,90,157]
pg 20.396 is active+remapped+wait_backfill, acting [134,49,96]
pg 15.34a is active+remapped+wait_backfill, acting [64,87,80]
pg 15.318 is active+remapped+wait_backfill, acting [41,17,120]
pg 15.1ca is active+remapped+wait_backfill, acting [157,19,36]
pg 15.1ba is active+remapped+wait_backfill, acting [122,63,162]
pg 15.1ac is active+remapped+wait_backfill, acting [121,145,84]
pg 15.1af is active+remapped+backfilling, acting [150,62,101]
pg 20.11e is active+remapped+wait_backfill, acting [36,134,8]
pg 20.bc is active+remapped+wait_backfill, acting [72,64,104]
pg 20.6f is active+remapped+wait_backfill, acting [13,38,98]
recovery 16/65732491 objects degraded (0.000%); 328254/65732491
objects misplaced (0.499%)

As you can see, there is barely any overlap between the acting OSDs
for those PGs. osd max backfills should only limit the number of
concurrent backfills out of a single OSD, and so in the situation
above I would expect the 15 backfills to happen mostly concurrently.
As it is they are being serialized, and that seems to needlessly slow
down the process and extend the time needed to complete recovery.

I'm pretty sure I'm missing something obvious here, but what is it?

All insight greatly appreciated. :) Thank you!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com