On Sat, May 25, 2013 at 11:06:08AM -0700, Samuel Just wrote: > Hi, thanks for taking the time to try to get all this documented! > > Placement groups are assigned to a set of OSDs by crush. > Darn, silly me. I made a stupid mistake, I meant it is CRUSH-algoritm and RADOS is the protocol. > (4.1, osdmap(e 1)) --CRUSH--> [3,1,2] > > where the primary is 3. When 3 dies, the osdmap is updated to reflect this > and we get a new mapping for pg 4.1: > > (4,1, osdmap(e 2)) --CRUSH--> [1,2,4] > > Here, 1 and 2 already have up-to-date copies of 4.1. osd 4, however, needs > to be brought up to date. During peering, osd 1 will learn that osd 4 > falls into > 1 of 2 cases. > > Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg > 4.1 happens to overlap osd 1's pg log for pg 4.1. In that case, by running > through the log of operations, we can determine exactly which objects need > to be copied over. We usually refer to this as just "recovery" (or log based > recovery). > > In case 2, either osd 4's pg log does not overlap that of osd 1. In this case, > we cannot determine from the log which objects need to be copied over. > To bring osd 4 up to date, we therefore need to backfill. > > Backfill involves the primary and the backfill peer (there is only ever one in > the acting set at a time, see PG::choose_acting) scanning over their pg stores > and copying the objects which are different or missing from the primary to the > backfill peer. Because this may take a long time, we track the a last_backfill > attribute for each local pg copy indicating how far the local copy has been > backfilled. In the case that the copy is complete, last_backfill is > hobject_t::max(). > > More exactly, a local pg copy is described by a few pieces of information: > 1) the local pg log > 2) the local last_backfill > 3) the local last_complete > 4) the local missing set > The local pg store reflects all updates up to version last_complete on all > hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing > set. Comparing the pg logs is used to fill in the missing set for OSDs which > were only down for a brief period thus avoiding a costly backfill in many cases. > > This is a bit of a rough brain dump and may be somewhat misleading/wrong. > I'll get it cleaned up and put it into > doc/dev/osd_internals/pg_recovery.rst next > week. > > Also, rados objects currently have three pieces: > 1) data - read, write, writefull, etc. > 2) xattrs > 3) omap > The omap is much like the xattrs except that it can generally store a much > larger number of keys and support efficient scans. It's used at the moment > for a few things including rgw bucket indices. The omap entries are copied > over along with the rest of the object in recovery. Behind the scenes, all > omap entries for all objects stored on an OSD are stored prefixed in a single > big leveldb instance. > > omap operations probably shouldn't be supported on objects in an > ErasureCodedPG :) > -Sam > > On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > > > > > > On 05/25/2013 04:48 PM, Leen Besselink wrote: > >> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote: > >>> > >>> > >>> On 05/25/2013 02:33 PM, Leen Besselink wrote: > >>> Hi Leen, > >>> > >>>> - a Cehp object can store keys/values, not just data > >>> > >>> I did not know that. Could you explain or give me the URL ? > >>> > >> > >> Well, I got that impression from some of the earlier talks and from this blog post: > >> > >> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/ > >> > >> But I haven't read it in while. > >> > >> But at this time I only see something like: > >> > >> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr > >> > >> Which looks like it is storing it in filesystem attributes. > >> > >> So maybe an object can be a piece of data or a key/value store. > > > > Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-) > > > > Cheers > > > > -- > > Loïc Dachary, Artisan Logiciel Libre > > All that is necessary for the triumph of evil is that good people do nothing. > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html