Re: Ceph backfilling explained ( maybe )

Loic Dachary <loic@xxxxxxxxxxx> · Sun, 26 May 2013 13:45:16 +0200

Hi,

Although I am yet to fully understand the logic of the placement group recovery ( I'm eager to read Sam's doc/dev/osd_internals/pg_recovery.rst :-), I wrote down my understanding of backfilling : http://dachary.org/?p=2009 . 

Cheers

On 05/25/2013 09:15 PM, Loic Dachary wrote:
> Hi !
> 
> On 05/25/2013 08:06 PM, Samuel Just wrote:
>> Hi, thanks for taking the time to try to get all this documented!
>>
>> Placement groups are assigned to a set of OSDs by crush.
>>
>> (4.1, osdmap(e 1)) --CRUSH--> [3,1,2]
>>
>> where the primary is 3.  When 3 dies, the osdmap is updated to reflect this
>> and we get a new mapping for pg 4.1:
>>
>> (4,1, osdmap(e 2)) --CRUSH--> [1,2,4]
>>
>> Here, 1 and 2 already have up-to-date copies of 4.1.  osd 4, however, needs
>> to be brought up to date.  During peering, osd 1 will learn that osd 4
>> falls into
>> 1 of 2 cases.
>>
>> Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg
>> 4.1 happens to overlap osd 1's pg log for pg 4.1.  In that case, by running
>> through the log of operations, we can determine exactly which objects need
>> to be copied over.  We usually refer to this as just "recovery" (or log based
>> recovery).
>>
>> In case 2, either osd 4's pg log does not overlap that of osd 1.  In this case,
>> we cannot determine from the log which objects need to be copied over.
>> To bring osd 4 up to date, we therefore need to backfill.
>>
>> Backfill involves the primary and the backfill peer (there is only ever one in
>> the acting set at a time, see PG::choose_acting) scanning over their pg stores
>> and copying the objects which are different or missing from the primary to the
>> backfill peer.  Because this may take a long time, we track the a last_backfill
>> attribute for each local pg copy indicating how far the local copy has been
>> backfilled.  In the case that the copy is complete, last_backfill is
>> hobject_t::max().
> 
> Is it true that if two osd briefly disconnect while backfilling, they may be in the case 1 above (i.e. log based recovery ) and then backfilling again when done, starting from last_backfill and up ? 
> 
>> More exactly, a local pg copy is described by a few pieces of information:
>> 1) the local pg log
> 
> pg_log_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1371
> pg_log_entry_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1277
> 
>> 2) the local last_backfill
> 
> pg_info_t::last_backfill https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1102
> 
>> 3) the local last_complete
> 
> pg_info_t::last_complete https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1089
> 
>> 4) the local missing set
> 
> pg_missing_t https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1468
> 
>> The local pg store reflects all updates up to version last_complete on all
> 
> I assume you mean 'local pg log' instead of 'local pg log'. 
> 
>> hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing
>> set.  Comparing the pg logs is used to fill in the missing set for OSDs which
>> were only down for a brief period thus avoiding a costly backfill in many cases.
> 
> The pg logs are trimmed ( https://github.com/ceph/ceph/blob/master/src/osd/PG.cc#L216 ), this is why the pg logs of two OSDs that have been disconnected for too long are unlikely to overlap ? And therefore require a backfill because the two pg logs cannot be compared ?
> 
>> This is a bit of a rough brain dump and may be somewhat misleading/wrong.
> 
> It is very helpful as it is, thanks :-)
> 
>> I'll get it cleaned up and put it into
>> doc/dev/osd_internals/pg_recovery.rst next
>> week.
>>
> 
> That would be great. 
> 
>> Also, rados objects currently have three pieces:
>> 1) data - read, write, writefull, etc.
>> 2) xattrs
>> 3) omap
>> The omap is much like the xattrs except that it can generally store a much
>> larger number of keys and support efficient scans.  It's used at the moment
>> for a few things including rgw bucket indices.  The omap entries are copied
>> over along with the rest of the object in recovery.  Behind the scenes, all
>> omap entries for all objects stored on an OSD are stored prefixed in a single
>> big leveldb instance.
>>
>> omap operations probably shouldn't be supported on objects in an
>> ErasureCodedPG :)
> 
> I thought omap / xattrs were mutually exclusive. I did not realize both could be used at the same time.
> 
> Cheers
> 
>> -Sam
>>
>> On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>>>
>>>
>>> On 05/25/2013 04:48 PM, Leen Besselink wrote:
>>>> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 05/25/2013 02:33 PM, Leen Besselink wrote:
>>>>> Hi Leen,
>>>>>
>>>>>> - a Cehp object can store keys/values, not just data
>>>>>
>>>>> I did not know that. Could you explain or give me the URL ?
>>>>>
>>>>
>>>> Well, I got that impression from some of the earlier talks and from this blog post:
>>>>
>>>> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/
>>>>
>>>> But I haven't read it in while.
>>>>
>>>> But at this time I only see something like:
>>>>
>>>> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr
>>>>
>>>> Which looks like it is storing it in filesystem attributes.
>>>>
>>>> So maybe an object can be a piece of data or a key/value store.
>>>
>>> Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-)
>>>
>>> Cheers
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> All that is necessary for the triumph of evil is that good people do nothing.
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

Attachment:
signature.asc

Description: OpenPGP digital signature