Re: Fwd: Erasure coded pools and reading ranges of objects.

Frank Schilder <frans@xxxxxx> · Thu, 25 Aug 2022 07:29:45 +0000

Hi Teja,

yes, this is by design. If you want to optimise IO on EC pools, you need to study the EC algorithms, because they spread IO on top of how OSDs do it. Specifically, they have their own chunk sizes. Apart from that, EC will always spread data over all OSDs to have proper parity information.

If you want to improve throughput, here are the first things you should look at:

- always use a power of 2 for k (e.g. 2,4,8,16). k=10 contains 5 as a factor, which will lead to tail chunks when read/writes get split up into the power of 2 buckets used to store data
- combat tail latencies by using large m>=3 and enable fast read on the pool. Your approach will not eliminate tail latencies as it doesn't drop reads from slow OSDs. It will still expect that all OSDs contacted return data, including the slow ones. To eliminate tail latencies you need to issue reads to more OSDs than you need to reconstruct the data and use the data from the first k fast responding OSDs to serve the request. With EC 8+3 and fast read, read requests are sent to 11 OSDs and the 8 fast responders serve the read requests.

You could try (and benchmark) these:

- use larger bluestore min_alloc_sizes than default, note that this will only be good if you *never* modify large files (WORM workloads)
- try other EC algorithms that allow a wide EC profiles like 8+3
- read the documentation for these EC profiles, their chunk sizes can be modified and this does have influence on IOP/s and bandwidth

Do extensive use-case benchmarking. Don't use the built-in benchmarks, they are both, unstable and unrealistic. Deploy your system in the way you will use it and use the API clients will see for testing. Results are heavily dependent on the specific hardware and OSD config deployed. Hence, the only general recommendations that make some sense are the ones above. Everything else comes from reading all documentation, finding what might apply to your use case and testing testing testing.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Teja A <tejaseattle@xxxxxxxxx>
Sent: 25 August 2022 01:25:27
To: ceph-users@xxxxxxx
Subject:  Fwd: Erasure coded pools and reading ranges of objects.

Hello.

I have been kicking the tires with Ceph using the librados API and observed
some peculiar object access patterns when reading a portion of the object
(as opposed to the whole object). First I want to offer some background. My
use case requires that I use erasure coded pools and store large objects
(100 MiB - 1 GiB) meant to be written/read sequentially for optimal
performance. However, I also want to be able to read smaller ranges of
these large objects (say 5-10 MiB) at a time efficiently (ranged reads).

I am trying to figure out the optimal configuration to use for my erasure
coded pool settings that results in the maximum possible read throughput
for the cluster (as opposed to lowest latency for any given ranged read). I
hope to achieve this by minimizing the number of distinct OSDs that need to
be read from for these ranged reads which should minimize the number of
disk seeks needed to return data (on these OSDs), I should be able to get
better tail latencies for reads under high contention. Intuitively, I
should be able to achieve this by using large chunk sizes.

Concretely, if I had my EC pool settings have k=10, m=3 and I had the
stripe_unit (or chunk size) set to 1 MiB, then reading the first 5 MiB
range of a large object should only read from the first 5 OSDs that contain
the object. However, I have observed (using blktrace on all the OSDs that
make up the pool) that reads are being issued to all of the k=10 OSDs and
the amount of data being read on each OSD is equal to the chunk size. This
seems weird because even though I only care about the first 5 MiB of the
data that can be read back from the first 5 OSDs, rados seems to be issuing
reads for the entire stripe of 10 MiB. This can be wasteful under load.

So my question is if this is by design. Specifically, is it a requirement
that rados issue reads for an entire stripe even though only a portion of
it is requested to be read? Is this behavior configurable?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx