Hi Teja, yes, this is by design. If you want to optimise IO on EC pools, you need to study the EC algorithms, because they spread IO on top of how OSDs do it. Specifically, they have their own chunk sizes. Apart from that, EC will always spread data over all OSDs to have proper parity information. If you want to improve throughput, here are the first things you should look at: - always use a power of 2 for k (e.g. 2,4,8,16). k=10 contains 5 as a factor, which will lead to tail chunks when read/writes get split up into the power of 2 buckets used to store data - combat tail latencies by using large m>=3 and enable fast read on the pool. Your approach will not eliminate tail latencies as it doesn't drop reads from slow OSDs. It will still expect that all OSDs contacted return data, including the slow ones. To eliminate tail latencies you need to issue reads to more OSDs than you need to reconstruct the data and use the data from the first k fast responding OSDs to serve the request. With EC 8+3 and fast read, read requests are sent to 11 OSDs and the 8 fast responders serve the read requests. You could try (and benchmark) these: - use larger bluestore min_alloc_sizes than default, note that this will only be good if you *never* modify large files (WORM workloads) - try other EC algorithms that allow a wide EC profiles like 8+3 - read the documentation for these EC profiles, their chunk sizes can be modified and this does have influence on IOP/s and bandwidth Do extensive use-case benchmarking. Don't use the built-in benchmarks, they are both, unstable and unrealistic. Deploy your system in the way you will use it and use the API clients will see for testing. Results are heavily dependent on the specific hardware and OSD config deployed. Hence, the only general recommendations that make some sense are the ones above. Everything else comes from reading all documentation, finding what might apply to your use case and testing testing testing. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Teja A <tejaseattle@xxxxxxxxx> Sent: 25 August 2022 01:25:27 To: ceph-users@xxxxxxx Subject: Fwd: Erasure coded pools and reading ranges of objects. Hello. I have been kicking the tires with Ceph using the librados API and observed some peculiar object access patterns when reading a portion of the object (as opposed to the whole object). First I want to offer some background. My use case requires that I use erasure coded pools and store large objects (100 MiB - 1 GiB) meant to be written/read sequentially for optimal performance. However, I also want to be able to read smaller ranges of these large objects (say 5-10 MiB) at a time efficiently (ranged reads). I am trying to figure out the optimal configuration to use for my erasure coded pool settings that results in the maximum possible read throughput for the cluster (as opposed to lowest latency for any given ranged read). I hope to achieve this by minimizing the number of distinct OSDs that need to be read from for these ranged reads which should minimize the number of disk seeks needed to return data (on these OSDs), I should be able to get better tail latencies for reads under high contention. Intuitively, I should be able to achieve this by using large chunk sizes. Concretely, if I had my EC pool settings have k=10, m=3 and I had the stripe_unit (or chunk size) set to 1 MiB, then reading the first 5 MiB range of a large object should only read from the first 5 OSDs that contain the object. However, I have observed (using blktrace on all the OSDs that make up the pool) that reads are being issued to all of the k=10 OSDs and the amount of data being read on each OSD is equal to the chunk size. This seems weird because even though I only care about the first 5 MiB of the data that can be read back from the first 5 OSDs, rados seems to be issuing reads for the entire stripe of 10 MiB. This can be wasteful under load. So my question is if this is by design. Specifically, is it a requirement that rados issue reads for an entire stripe even though only a portion of it is requested to be read? Is this behavior configurable? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx