Hi,
we are seeing a trend towards rather large RGW S3 buckets lately.
we've worked on
several clusters with 100 - 500 million objects in a single bucket,
and we've
been asked about the possibilities of buckets with several billion
objects more
than once.
From our experience: buckets with tens of million objects work just
fine with
no big problems usually. Buckets with hundreds of million objects
require some
attention. Buckets with billions of objects? "How about indexless
buckets?" -
"No, we need to list them".
A few stories and some questions:
1. The recommended number of objects per shard is 100k. Why? How was this
default configuration derived?
It doesn't really match my experiences. We know a few clusters running
with
larger shards because resharding isn't possible for various reasons at
the
moment. They sometimes work better than buckets with lots of shards.
So we've been considering to at least double that 100k target shard size
for large buckets, that would make the following point far less annoying.
2. Many shards + ordered object listing = lots of IO
Unfortunately telling people to not use ordered listings when they
don't really
need them doesn't really work as their software usually just doesn't
support
that :(
A listing request for X objects will retrieve up to X objects from
each shard
for ordering them. That will lead to quite a lot of traffic between
the OSDs
and the radosgw instances, even for relatively innocent simple queries
as X
defaults to 1000 usually.
Simple example: just getting the first page of a bucket listing with 4096
shards fetches around 1 GB of data from the OSD to return ~300kb or so
to the
S3 client.
I've got two clusters here that are only used for some relatively
low-bandwidth
backup use case here. However, there are a few buckets with hundreds
of millions
of objects that are sometimes being listed by the backup system.
The result is that this cluster has an average read IO of 1-2 GB/s,
all going
to the index pool. Not a big deal since that's coming from SSDs and
goes over
80 Gbit/s LACP bonds. But it does pose the question about scalability
as the user-
visible load created by the S3 clients is quite low.
3. Deleting large buckets
Someone accidentaly put 450 million small objects into a bucket and
only noticed
when the cluster ran full. The bucket isn't needed, so just delete it
and case
closed?
Deleting is unfortunately far slower than adding objects, also
radosgw-admin leaks
memory during deletion: https://tracker.ceph.com/issues/40700
Increasing --max-concurrent-ios helps with deletion speed (option does
effect
deletion concurrency, documentation says it's only for other specific
commands).
Since the deletion is going faster than new data is being added to
that cluster
the "solution" was to run the deletion command in a memory-limited
cgroup and
restart it automatically after it gets killed due to leaking.
How could the bucket deletion of the future look like? Would it be
possible
to put all objects in buckets into RADOS namespaces and implement some
kind
of efficient namespace deletion on the OSD level similar to how pool
deletions
are handled at a lower level?
4. Common prefixes could filtered in the rgw class on the OSD instead
of in radosgw
Consider a bucket with 100 folders with 1000 objects in each and only
one shard
/p1/1, /p1/2, ..., /p1/1000, /p2/1, /p2/2, ..., /p2/1000, ... /p100/1000
Now a user wants to list / with aggregating common prefixes on the
delimiter / and
wants up to 1000 results.
So there'll be 100 results returned to the client: the common prefixes
p1 to p100.
How much data will be transfered between the OSDs and radosgw for this
request?
How many omap entries does the OSD scan?
radosgw will ask the (single) index object to list the first 1000
objects. It'll
return 1000 objects in a quite unhelpful way: /p1/1, /p1/2, ....,
/p1/1000
radosgw will discard 999 of these and detect one common prefix and
continue the
iteration at /p1/\xFF to skip the remaining entries in /p1/ if there
are any.
The OSD will then return everything in /p2/ in that next request and
so on.
So it'll internally list every single object in that bucket. That will
be a problem
for large buckets and having lots of shards doesn't help either.
This shouldn't be too hard to fix: add an option "aggregate prefixes"
to the RGW
class method and duplicate the fast-forward logic from radosgw in
cls_rgw. It doesn't
even need to change the response type or anything, it just needs to
limit entries in
common prefixes to one result.
Is this a good idea or am I missing something?
IO would be reduced by a factor of 100 for that particular
pathological case. I've
unfortunately seen a real-world setup that I think hits a case like that.
Paul