Re: rados pool object listing suckage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It occurs to me that, in the non-recovery case, we are relying on the
readdir() guarantee that each object will be returned at most once. If
that is so, why not just create a danging symlink or something from
the PG directory to represent the objects that are in recovery at the
moment. Because we know readdir() works, we know this will work.

sincerely,
Colin


On Thu, May 12, 2011 at 11:18 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 12 May 2011, Colin McCabe wrote:
>> On Wed, May 11, 2011 at 2:57 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > The ability to list objects in rados pools is something we originally
>> > added to round out the librados interface and to support the requirements
>> > of radosgw (S3 and swift both let you list objects in buckets).  The
>> > current interface is stateless.  You iterate over the PGs in the pool, and
>> > for each one, request up to N objects.  The result has up to N results and
>> > (optionally) a cookie to continue and get more objects for that pg.
>> > Currently, that cookie is a readdir offset from the directory on the
>> > server.
>> >
>> > The problem is that we currently list objects based only on what we have
>> > on disk.  If the PG is degraded and we're recovery, we miss any objects
>> > that aren't yet copied locally.
>> >
>> > Because the 'list all objects in a pool' operation consists of many
>> > smaller requests, and there is no state to associate them, it's hard to
>> > solve this one correctly.  The osd _knows_ which objects are missing at
>> > all times, but it needs to include those in the results in a sane way.
>> >
>> > My original thought was use the low 2^63 values of the offset/cookie to
>> > represent any (potentially) missing objects and use that as an index into
>> > the missing set, and then use the high values to represent the readdir
>> > offset.  That's easy enough to do (and I just did), but because recovery
>> > is progressing while the client is doing the series of list-objects calls,
>> > you can get duplicates (client sees an object while it was missing, and
>> > also sees it later when it was on disk).  Conversely, if a pg starts out
>> > as not degraded, we may return no missing objects, and later return an
>> > incomplete set of on-disk objects because the primary changed.
>> >
>> > This approach can be hacked if the client restarts the whole process (for
>> > a given pg) when the primary changes (because, for any given pg mapping,
>> > the missing set will also shrink or stay the same).  That will work, but
>> > will potentially return some objects twice.
>>
>> Can the client keep track of all the objects it has already seen? Then
>> it could eliminate dups from the result it gets from the server.
>>
>> You can't track this information on the server because of the obvious
>> DDOS exploit, but it seems like the client could keep around this
>> state.
>
> So radosgw does that (it has to sort the full list alphanumerically, even,
> blech!).  Objecter and librados don't though, because otherwise any user
> would incur the overhead of doing the filter step (which involves building
> a set<object_t> and is expensive for a very large pool).
>
> I guess barring anything super clever we haven't thought of, I'm inclined
> to leave dedup as the Objecter/librados user's responsibility.  That's
> what I committed yesterday.
>
> This is hard because we're sharding the pool contents across multiple PGs
> on different nodes for scalability.  I assume S3 is also sharding bucket
> contents across multiple nodes, but they always return the object list
> alphanumerically sorted (!), so there must be a separate index stored
> centrally somewhere.  That architectural difference makes meeting their
> API requirement in radosgw expensive.
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux