Re: Ceph backfilling explained ( maybe )

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, May 25, 2013 at 11:06:08AM -0700, Samuel Just wrote:
> Hi, thanks for taking the time to try to get all this documented!
> 
> Placement groups are assigned to a set of OSDs by crush.
> 

Darn, silly me.

I made a stupid mistake, I meant it is CRUSH-algoritm and RADOS is the protocol.

> (4.1, osdmap(e 1)) --CRUSH--> [3,1,2]
> 
> where the primary is 3.  When 3 dies, the osdmap is updated to reflect this
> and we get a new mapping for pg 4.1:
> 
> (4,1, osdmap(e 2)) --CRUSH--> [1,2,4]
> 
> Here, 1 and 2 already have up-to-date copies of 4.1.  osd 4, however, needs
> to be brought up to date.  During peering, osd 1 will learn that osd 4
> falls into
> 1 of 2 cases.
> 
> Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg
> 4.1 happens to overlap osd 1's pg log for pg 4.1.  In that case, by running
> through the log of operations, we can determine exactly which objects need
> to be copied over.  We usually refer to this as just "recovery" (or log based
> recovery).
> 
> In case 2, either osd 4's pg log does not overlap that of osd 1.  In this case,
> we cannot determine from the log which objects need to be copied over.
> To bring osd 4 up to date, we therefore need to backfill.
> 
> Backfill involves the primary and the backfill peer (there is only ever one in
> the acting set at a time, see PG::choose_acting) scanning over their pg stores
> and copying the objects which are different or missing from the primary to the
> backfill peer.  Because this may take a long time, we track the a last_backfill
> attribute for each local pg copy indicating how far the local copy has been
> backfilled.  In the case that the copy is complete, last_backfill is
> hobject_t::max().
> 
> More exactly, a local pg copy is described by a few pieces of information:
> 1) the local pg log
> 2) the local last_backfill
> 3) the local last_complete
> 4) the local missing set
> The local pg store reflects all updates up to version last_complete on all
> hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing
> set.  Comparing the pg logs is used to fill in the missing set for OSDs which
> were only down for a brief period thus avoiding a costly backfill in many cases.
> 
> This is a bit of a rough brain dump and may be somewhat misleading/wrong.
> I'll get it cleaned up and put it into
> doc/dev/osd_internals/pg_recovery.rst next
> week.
> 
> Also, rados objects currently have three pieces:
> 1) data - read, write, writefull, etc.
> 2) xattrs
> 3) omap
> The omap is much like the xattrs except that it can generally store a much
> larger number of keys and support efficient scans.  It's used at the moment
> for a few things including rgw bucket indices.  The omap entries are copied
> over along with the rest of the object in recovery.  Behind the scenes, all
> omap entries for all objects stored on an OSD are stored prefixed in a single
> big leveldb instance.
> 
> omap operations probably shouldn't be supported on objects in an
> ErasureCodedPG :)
> -Sam
> 
> On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> >
> >
> > On 05/25/2013 04:48 PM, Leen Besselink wrote:
> >> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote:
> >>>
> >>>
> >>> On 05/25/2013 02:33 PM, Leen Besselink wrote:
> >>> Hi Leen,
> >>>
> >>>> - a Cehp object can store keys/values, not just data
> >>>
> >>> I did not know that. Could you explain or give me the URL ?
> >>>
> >>
> >> Well, I got that impression from some of the earlier talks and from this blog post:
> >>
> >> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/
> >>
> >> But I haven't read it in while.
> >>
> >> But at this time I only see something like:
> >>
> >> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr
> >>
> >> Which looks like it is storing it in filesystem attributes.
> >>
> >> So maybe an object can be a piece of data or a key/value store.
> >
> > Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-)
> >
> > Cheers
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > All that is necessary for the triumph of evil is that good people do nothing.
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux