Hi, thanks for taking the time to try to get all this documented! Placement groups are assigned to a set of OSDs by crush. (4.1, osdmap(e 1)) --CRUSH--> [3,1,2] where the primary is 3. When 3 dies, the osdmap is updated to reflect this and we get a new mapping for pg 4.1: (4,1, osdmap(e 2)) --CRUSH--> [1,2,4] Here, 1 and 2 already have up-to-date copies of 4.1. osd 4, however, needs to be brought up to date. During peering, osd 1 will learn that osd 4 falls into 1 of 2 cases. Case 1 is that osd 4 already had an old copy of pg 4.1 AND its pg log for pg 4.1 happens to overlap osd 1's pg log for pg 4.1. In that case, by running through the log of operations, we can determine exactly which objects need to be copied over. We usually refer to this as just "recovery" (or log based recovery). In case 2, either osd 4's pg log does not overlap that of osd 1. In this case, we cannot determine from the log which objects need to be copied over. To bring osd 4 up to date, we therefore need to backfill. Backfill involves the primary and the backfill peer (there is only ever one in the acting set at a time, see PG::choose_acting) scanning over their pg stores and copying the objects which are different or missing from the primary to the backfill peer. Because this may take a long time, we track the a last_backfill attribute for each local pg copy indicating how far the local copy has been backfilled. In the case that the copy is complete, last_backfill is hobject_t::max(). More exactly, a local pg copy is described by a few pieces of information: 1) the local pg log 2) the local last_backfill 3) the local last_complete 4) the local missing set The local pg store reflects all updates up to version last_complete on all hobject_ts hoid such that hoid < last_backfill AND hoid is not in the missing set. Comparing the pg logs is used to fill in the missing set for OSDs which were only down for a brief period thus avoiding a costly backfill in many cases. This is a bit of a rough brain dump and may be somewhat misleading/wrong. I'll get it cleaned up and put it into doc/dev/osd_internals/pg_recovery.rst next week. Also, rados objects currently have three pieces: 1) data - read, write, writefull, etc. 2) xattrs 3) omap The omap is much like the xattrs except that it can generally store a much larger number of keys and support efficient scans. It's used at the moment for a few things including rgw bucket indices. The omap entries are copied over along with the rest of the object in recovery. Behind the scenes, all omap entries for all objects stored on an OSD are stored prefixed in a single big leveldb instance. omap operations probably shouldn't be supported on objects in an ErasureCodedPG :) -Sam On Sat, May 25, 2013 at 10:37 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > > > On 05/25/2013 04:48 PM, Leen Besselink wrote: >> On Sat, May 25, 2013 at 04:27:16PM +0200, Loic Dachary wrote: >>> >>> >>> On 05/25/2013 02:33 PM, Leen Besselink wrote: >>> Hi Leen, >>> >>>> - a Cehp object can store keys/values, not just data >>> >>> I did not know that. Could you explain or give me the URL ? >>> >> >> Well, I got that impression from some of the earlier talks and from this blog post: >> >> http://ceph.com/community/my-first-impressions-of-ceph-as-a-summer-intern/ >> >> But I haven't read it in while. >> >> But at this time I only see something like: >> >> http://ceph.com/docs/master/rados/api/librados/?highlight=rados_getxattr#rados_getxattr >> >> Which looks like it is storing it in filesystem attributes. >> >> So maybe an object can be a piece of data or a key/value store. > > Thanks for explaining: I did not know about the works of Eleanor Cawthon. I knew about the objects xattributes but I thought you meant that the data inside of the object could be structured as key/value pairs. My bad :-) > > Cheers > > -- > Loïc Dachary, Artisan Logiciel Libre > All that is necessary for the triumph of evil is that good people do nothing. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html