Re: why degraded object must recover before accept new io

zengran zhang <z13121369189@xxxxxxxxx> · Sat, 4 Aug 2018 09:30:23 +0800

>> On Wed, May 30, 2018 at 9:19 AM, zengran zhang <z13121369189@xxxxxxxxx> wrote:
>>> Hi
>>>     Let's say acting set is [3, 1, 0], obj1 was marked missing on
>>> osd.0 after peering, new io on obj1 will wait obj1 until be recovered.
>>> So my question is  why cant we do the new io on [3, 1] and let osd.0
>>> keep missing obj1 without wait on recover, osd.0 update pglog only
>>> like backfill does? if the size of osds with newest object is more
>>> than min_size,  do we need to wait recover?
>>
>> This isn't impossible to do, but it's *yet another* piece of metadata
>> we'd need to keep track of and account for in all the other recovery
>> and IO paths, so nobody's done it yet. Somebody would have to design
>> the algorithms, write the code, and persuade us the UX/performance
>> improvement is worth the ongoing maintenance burden. In particular,
>> note that any write-before-recovery needs to make sure it still obeys
>> the min_size rules, and I don't think any systems are set up to enable
>> that tracking separate from the acting set right now.
>>
>> I haven't looked closely at this code in a while so I don't know how
>> easy it would be to implement, or how high the bar for accepting such
>> a PR might be. It's not a bad thing to look at AFAIK, though! :)
>> -Greg

Let's call the modify on degraded object degraded write, of course the write
need obeys the min_size rules and object could not missing on primary also.

normally/currently we carefully check the past intervals to find out if an
object is unrecoverable, this because an interval of OSDs is a promise that
any write in that interval was captured by all the interval shards.
But degraded break the rule, so there must be a remedy.

I think a possible approach is we record the location of the truely happening
degraded write into the missing item of the degraded shard. So primary shard
could gather the location sets of all degraded write on all shards, I think
this would be done on the peering stage: GetMissing.(maybe this not enough,
because currently we get missing from actingbackfill only, so unfound item
need extra verify..). This could let us to turn the pg to incomplete when some
shards must probe.

I kown there is already an approach of async recovery, but i think we could do
better.

Some discussion and fewer code on https://github.com/ceph/ceph/pull/22505.

regards!

2018-06-01 0:02 GMT+08:00 zengran zhang <z13121369189@xxxxxxxxx>:
> 2018-05-31 6:57 GMT+08:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
>> On Wed, May 30, 2018 at 9:19 AM, zengran zhang <z13121369189@xxxxxxxxx> wrote:
>>> Hi
>>>     Let's say acting set is [3, 1, 0], obj1 was marked missing on
>>> osd.0 after peering, new io on obj1 will wait obj1 until be recovered.
>>> So my question is  why cant we do the new io on [3, 1] and let osd.0
>>> keep missing obj1 without wait on recover, osd.0 update pglog only
>>> like backfill does? if the size of osds with newest object is more
>>> than min_size,  do we need to wait recover?
>>
>> This isn't impossible to do, but it's *yet another* piece of metadata
>> we'd need to keep track of and account for in all the other recovery
>> and IO paths, so nobody's done it yet. Somebody would have to design
>> the algorithms, write the code, and persuade us the UX/performance
>> improvement is worth the ongoing maintenance burden. In particular,
>> note that any write-before-recovery needs to make sure it still obeys
>> the min_size rules, and I don't think any systems are set up to enable
>> that tracking separate from the acting set right now.
>>
>> I haven't looked closely at this code in a while so I don't know how
>> easy it would be to implement, or how high the bar for accepting such
>> a PR might be. It's not a bad thing to look at AFAIK, though! :)
>> -Greg
>
> Heartfelt thanks for the reply, we will remove the restrictions first
> and see what need to do next...
>
>>
>>
>>>
>>>     i see the new async recover feature move the osd.0 from acting to
>>> async_recover_target,keeping acting set bigger than min_size, and
>>> osd.0 being choosen is because it have more missing objects, so
>>> objects missing on acting set still need recover first...
>>>
>>>    best regards
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html