Re: Improving latency and ordering of the backfilling workload

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 15 Dec 2014 19:13:14 +0100

On 15/12/2014 19:03, Sage Weil wrote:
> On Mon, 15 Dec 2014, Loic Dachary wrote:
>> On 15/12/2014 18:20, Sage Weil wrote:
>>> On Mon, 15 Dec 2014, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> On 15/12/2014 17:44, Sage Weil wrote:
>>>>> On Mon, 15 Dec 2014, Loic Dachary wrote:
>>>>>> Hi Sam,
>>>>>>
>>>>>> Here is what could be done (in the context of http://tracker.ceph.com/issues/9566
>>>>>> ), please let me know if that makes sense:
>>>>>>
>>>>>> * ordering:
>>>>>>
>>>>>>   * when dequeuing a pending local reservation, chose one that contains 
>>>>>> a PG that belongs to the busiest OSD (i.e. the OSD for which there are 
>>>>>> more PGs waiting for a local reservation than any other)
>>>>>
>>>>> I'm worried the reservation count won't be an accurate enough proxy for 
>>>>> the amount of work the remote OSD has to do.  
>>>>
>>>> Are you thinking about taking into account the number and size of 
>>>> objects in a given PGs ? The length of the local reservation queue 
>>>> accurately reflects the number of PGs that need work (because the length 
>>>> of the reservation queue is not bounded). But it does not reflect the 
>>>> content of the PGs at all, indeed.
>>>
>>> Including that information could help, yeah, but the main thing is that 
>>> any estimate of "the busiest OSD" based on local information is going to 
>>> be weak if it's only based on info reservation requests.  
>>
>> What other information would be relevant in addition to the number of 
>> PGs that need to backfill and their size (objects & bytes) ?
> 
> Maybe the background client workload?  If an OSD is more heavily loaded 
> than others than it should probably start it's recovery sooner as its rate 
> of progress will be a bit lower.
> 
>>> Unless that 
>>> information is refreshed periodically by the requesting OSD (I think we 
>>> also discussed that a bit last week).
>>
>> I tried to take that into account by proposing to calculate the priority 
>> when the reservation is dequeued from the waiting list instead of when 
>> it is added to the waiting list. When the local reservation is dequeued, 
>> it gets one of the osd_max_backfill slots in the AsyncReserver and will 
>> then get work to do : the delay between calculating the priority and 
>> actual backfilling is minimum. The delay actually is the latency between 
>> when the remote reservation is sent and when it comes back successfully. 
>> By adding the priority to the remote reservation request, we make the 
>> peer OSD aware of the local priority and compare it with the priority of 
>> the other OSDs asking for a remote reservation. The peer OSD will be 
>> grant us a remote reservation quickly if we are the OSD declaring to 
>> have most work to do.
>>
>> I sense you have something else in mind in terms of algorithm and/or 
>> data sources. Hopefully this explanation will allow you to see what I'm 
>> missing and guide me ;-)
> 
> Oh, I see.  That sounds very reasonable.  I suspect even with this 
> approach though it will help to periodically refresh that reservation, 
> though, as the remote OSD may have lots of people contending for recovery.  
> Whoever is not first in line will be there for a while and their priority 
> will likely be less than accurate by the time the next item is dequeued 
> there?

The priority is attached to each reservation and is relative to one PG reservation request. The remote reservation priority will be reconsidered each time a new PG asks for a remote reservation (because it will use the priority queues of the AsyncReserver). If we want to revise the priority during the backfilling of a given PG that already has a local+remote slot allocated to it, it means we should periodically consider cancelling an on going backfill operation to give a chance to an other, maybe busier, OSD. 

Am I following ?

> 
> Sorry if my drive-by suggestions aren't helping; I'm only half following 
> this discussion!

It's helping a lot !

> sage
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature