Re: Ceph backfilling explained ( maybe )

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, thanks for taking the time to try to get all this documented!

Placement groups are assigned to a set of OSDs by crush.

(4.1, osdmap(e 1)) --CRUSH--> [3,1,2]

where the primary is 3.  When 3 dies, the osdmap is updated to reflect this
and we get a new mapping for pg 4.1:

(4,1, osdmap(e 2)) --CRUSH--> [1,2,4]

Here, 1 and 2 already have up-to-date copies of 4.1.  osd 4, however, needs
to be brought up to date.  During peering, osd 1 will learn that osd 4
falls into
1 of 2 cases.

Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg
4.1 happens to overlap osd 1's pg log for pg 4.1.  In that case, by running
through the log of operations, we can determine exactly which objects need
to be copied over.  We usually refer to this as just "recovery" (or log based
recovery).

In case 2, either osd 4's pg log does not overlap that of osd 1.  In this case,
we cannot determine from the log which objects need to be copied over.
To bring osd 4 up to date, we therefore need to backfill.

Backfill involves the primary and the backfill peer (there is only ever one in
the acting set at a time, see PG::choose_acting) scanning over their pg stores
and copying the objects which are different or missing from the primary to the
backfill peer.  Because this may take a long time, we track the a last_backfill
attribute for each local pg copy indicating how far the local copy has been
backfilled.  In the case that the copy is complete, last_backfill is
hobject_t::max().

More exactly, a local pg copy is described by a few pieces of information:
1) the local pg log
2) the local last_backfill
3) the local last_complete
4) the local missing set
The local pg store reflects all updates up to version last_complete on all
hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing
set.  Comparing the pg logs is used to fill in the missing set for OSDs which
were only down for a brief period thus avoiding a costly backfill in many cases.

This is a bit of a rough brain dump and may be somewhat misleading/wrong.
I'll get it cleaned up and put it into
doc/dev/osd_internals/pg_recovery.rst next
week.

Also, rados objects currently have three pieces:
1) data - read, write, writefull, etc.
2) xattrs
3) omap
The omap is much like the xattrs except that it can generally store a much
larger number of keys and support efficient scans.  It's used at the moment
for a few things including rgw bucket indices.  The omap entries are copied
over along with the rest of the object in recovery.  Behind the scenes, all
omap entries for all objects stored on an OSD are stored prefixed in a single
big leveldb instance.

omap operations probably shouldn't be supported on objects in an
ErasureCodedPG :)
-Sam

On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
>
>
> On 05/25/2013 04:48 PM, Leen Besselink wrote:
>> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote:
>>>
>>>
>>> On 05/25/2013 02:33 PM, Leen Besselink wrote:
>>> Hi Leen,
>>>
>>>> - a Cehp object can store keys/values, not just data
>>>
>>> I did not know that. Could you explain or give me the URL ?
>>>
>>
>> Well, I got that impression from some of the earlier talks and from this blog post:
>>
>> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/
>>
>> But I haven't read it in while.
>>
>> But at this time I only see something like:
>>
>> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr
>>
>> Which looks like it is storing it in filesystem attributes.
>>
>> So maybe an object can be a piece of data or a key/value store.
>
> Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-)
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux