Fwd: Erasure coded pools and reading ranges of objects.

Teja A <tejaseattle@xxxxxxxxx> · Wed, 24 Aug 2022 16:25:27 -0700

Hello.

I have been kicking the tires with Ceph using the librados API and observed
some peculiar object access patterns when reading a portion of the object
(as opposed to the whole object). First I want to offer some background. My
use case requires that I use erasure coded pools and store large objects
(100 MiB - 1 GiB) meant to be written/read sequentially for optimal
performance. However, I also want to be able to read smaller ranges of
these large objects (say 5-10 MiB) at a time efficiently (ranged reads).

I am trying to figure out the optimal configuration to use for my erasure
coded pool settings that results in the maximum possible read throughput
for the cluster (as opposed to lowest latency for any given ranged read). I
hope to achieve this by minimizing the number of distinct OSDs that need to
be read from for these ranged reads which should minimize the number of
disk seeks needed to return data (on these OSDs), I should be able to get
better tail latencies for reads under high contention. Intuitively, I
should be able to achieve this by using large chunk sizes.

Concretely, if I had my EC pool settings have k=10, m=3 and I had the
stripe_unit (or chunk size) set to 1 MiB, then reading the first 5 MiB
range of a large object should only read from the first 5 OSDs that contain
the object. However, I have observed (using blktrace on all the OSDs that
make up the pool) that reads are being issued to all of the k=10 OSDs and
the amount of data being read on each OSD is equal to the chunk size. This
seems weird because even though I only care about the first 5 MiB of the
data that can be read back from the first 5 OSDs, rados seems to be issuing
reads for the entire stripe of 10 MiB. This can be wasteful under load.

So my question is if this is by design. Specifically, is it a requirement
that rados issue reads for an entire stripe even though only a portion of
it is requested to be read? Is this behavior configurable?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx