Re: raw rados listing with seek to marker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 3 Dec 2016, Yehuda Sadeh-Weinraub wrote:
> rgw uses raw rados listing when doing certain operations. For example
> when listing metadata entries (these aren't being indexed). When doing
> a full metadata sync the rgw at the secondary zone sends a request to
> the master to get a list of all the metadata entries (of a specific
> section), so that it can build an internal index for these entries so
> that it can fetch them.
> One of the main issues here is that since the metadata listing relies
> on raw rados objects listing, the operation cannot be paged.
> I've looked at the nobjects listing code, and this is what I
> understand happening:
>  - we iterate through all the pgs in the pool, one by one and in order
>  - for each pg we're at we read a chunk of the next X (where X 1024)
> entries via rados pgls operation
>  - for each chunk rados returns a cookie, which is some kind of a
> descriptor that points at either the current, or the next chunk that
> can be used to fetch the next chunk
> 
> rados itself with the nobjects api provides some seek mechanism, but
> it's pretty rudimentary and only seeks by rounding down to the current
> pg. I was looking at introducing a marker for nobjects listing (see:
> https://github.com/yehudasa/ceph/commits/wip-18079), but there are a
> few points I'm not completely sure about:
>  - I was using cookie and current_pg as the marker, but that only
> points to the current chunk that was just read (but not necessarily
> consumed completely). Is there a way to generate a cookie that would
> point at any entry within the current chunk?

The real marker is this:

    collection_list_handle_t cookie;

in Objecter::NListContext, which is defined as

typedef hobject_t collection_list_handle_t;

In principle, we could use the hobject_t to iterater over the whole pool, 
but the mapping onto PGs is fragmented when sortbitwise isn't set, so we 
can't really remove this until we drop support for pre-jewel clusters and 
!sortbitwise in the client.

>  - Is entries order guaranteed within the chunk? E.g., if I know that
> the last cookie was C, and the last object we saw was O, can we
> request C again, and skip to object O and not miss any entries created
> before original operation started? (it is ok to miss on entries
> created after original operation started, as we're going to get these
> via separate log)

The OSD sets C either as hobject_t::get_max() (end of pg) or the next 
object that it didn't return as part of the result set.  If you are 
listing objects and return A,B,C and D is the 'next', and before you 
continue a C+1 is created, you'll "miss" that, but that's generally okay 
since it was creating during (not before) you were listing.

>  - is there any other field that we need to keep for the marker other
> than these two?

The hobject_t alone should be enough.  When you set the marker, just 
calculate current_pg from that.  That means that getting/setting the 
marker won't work properly for non-sortbitwise OSDs, but I think that's 
okay--starting with kraken we *require* that sortbitwise is set before 
allowing new OSDs to start, so a jewel is the last version that can ever 
have !sortbitwise.  You can put a check in rgw that verifies that the 
flag is set before starting...

>  - what to do with the old object listing api? the code internally
> defaults to using it, so if implementing seek only for nobjects it
> won't work by default. I'm not sure we want to implement it for the
> legacy listing infrastructure.

Hmm... I think we should implement it for the non-"n" variant too (I 
don't think it's necessarily legacy; users may be enumerating 
within a namespace).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux