Maintaining write performance under a steady intake of small objects

Florian Haas <florian@xxxxxxxxxxx> · Mon, 24 Apr 2017 19:52:38 +0200

Hi everyone,

so this will be a long email — it's a summary of several off-list
conversations I've had over the last couple of weeks, but the TL;DR
version is this question:

How can a Ceph cluster maintain near-constant performance
characteristics while supporting a steady intake of a large number of
small objects?

This is probably a very common problem, but we have a bit of a dearth of
truly adequate best practices for it. To clarify, what I'm talking about
is an intake on the order of millions per hour. That might sound like a
lot, but if you consider an intake of 700 objects/s at 20 KiB/object,
that's just 14 MB/s. That's not exactly hammering your cluster — but it
amounts to 2.5 million objects created per hour.

Under those circumstances, two things tend to happen:

(1) There's a predictable decline in insert bandwidth. In other words, a
cluster that may allow inserts at a rate of 2.5M/hr rapidly goes down to
1.8M/hr and then 1.7M/hr ... and by "rapidly" I mean hours, not days. As
I understand it, this is mainly due to the FileStore's propensity to
index whole directories with a readdir() call which is an linear-time
operation.

(2) FileStore's mitigation strategy for this is to proactively split
directories so they never get so large as for readdir() to become a
significant bottleneck. That's fine, but in a cluster with a steadily
growing number of objects, that tends to lead to lots and lots of
directory splits happening simultanously — causing inserts to slow to a
crawl.

For (2) there is a workaround: we can initialize a pool with an expected
number of objects, set a pool max_objects quota, and disable on-demand
splitting altogether by setting a negative filestore merge threshold.
That way, all splitting occurs at pool creation time, and before another
split were to happen, you hit the pool quota. So you never hit that
brick wall causes by the thundering herd of directory splits. Of course,
it also means that when you want to insert yet more objects, you need
another pool — but you can handle that at the application level.

It's actually a bit of a dilemma: we want directory splits to happen
proactively, so that readdir() doesn't slow things down, but then we
also *don't* want them to happen, because while they do, inserts flatline.

(2) will likely be killed off completely by BlueStore, because there are
no more directories, hence nothing to split.

For (1) there really isn't a workaround that I'm aware of for FileStore.
And at least preliminary testing shows that BlueStore clusters suffer
from similar, if not the same, performance degradation (although, to be
fair, I haven't yet seen tests under the above parameters with rocksdb
and WAL on NVMe hardware).

For (1) however I understand that there would be a potential solution in
FileStore itself, by throwing away Ceph's own directory indexing and
just rely on flat directory lookups — which should be logarithmic-time
operations in both btrfs and XFS, as both use B-trees for directory
indexing. But I understand that that would be a fairly massive operation
that looks even less attractive to undertake with BlueStore around the
corner.

One suggestion that has been made (credit to Greg) was to do object
packing, i.e. bunch up a lot of discrete data chunks into a single RADOS
object. But in terms of distribution and lookup logic that would have to
be built on top, that seems weird to me (CRUSH on top of CRUSH to find
out which RADOS object a chunk belongs to, or some such?)

So I'm hoping for the likes of Wido and Dan and Mark to have some
alternate suggestions here: what's your take on this? Do you have
suggestions for people with a constant intake of small objects?

Looking forward to hearing your thoughts.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com