Re: Reducing backfilling/recovery long tail

Loic Dachary <loic@xxxxxxxxxxx> · Fri, 12 Dec 2014 18:59:42 +0100

On 12/12/2014 17:12, Sage Weil wrote:
> On Fri, 12 Dec 2014, Loic Dachary wrote:
>> Hi Sam & Sage,
>>
>> In the context of http://tracker.ceph.com/issues/9566 I'm inclined to 
>> think the best solution would be that the AsyncReserver choose a PG 
>> instead of just picking the next one in the list when there is a free 
>> slot. It would always choose a PG that must move to/from an OSDs for 
>> which there are more PGs waiting in the AsyncRerserver than any other 
>> OSD. The sort involved does not seem too expensive.
>>
>> Calculating priorities before adding the PG to the AsyncReserver seems 
>> wrong because the state of the system will change significantly while 
>> the PG is waiting to be processed. For instance the first PGs to be 
>> added have a low priority while the next have increasing priorities when 
>> they accumulate. If reservations are canceled because the OSD map 
>> changed again (maybe another OSD is decommissioned before recovery of 
>> the first one completes), you may end up having high priorities for PGs 
>> that are no longer associated with busy OSDs. That could backfire and 
>> create even more frequent long tails.
>>
>> What do you think ?
> 
> That makes sense.  In order to make that decision, it means that the OSDs 
> need to be sharing the level of recovery work they have pending on a 
> regular basis, right?
>  

It may not be necessary. The local_reserver is populated with all PGs that need to move. Say 50 of them are for osd.0 and 10 are for osd.1. The decision is made to schedule a PG for osd.0 because it has more PG to go. This PG will then try to get a remote_reserver slot on osd.0 : if it turns out that osd.0 already is busy, it will be queued. Up to osd_max_backfill can be queued for a given osd in the remote_reserver in this way because only osd_max_backfill PGs will get a slot in the local_reserver. Since the remote_reserver queue is capped by osd_max_backfill, its length does not accurately reflect the workload associated to an OSD. For this reason the priority could be modified when asking for the remote reservation (the priority field that we currently have) to reflect the workload. If the workload change while PGs are waiting in the remote_reserver queue, it could be that these PGs are given a priority that is sub-optimal. It is probably an acceptable tradeoff since it impacts onl
y osd_max_backfill PGs per osd. In contrast, hundreds of PGs could be queued in the local_reserver and setting a priority for them at the time they are queued could have lasting undesirable side effects.

I should probably enumerate the steps of an actual situation to clarify my thinking :-)

Cheers

> sage
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature