Re: Any concerns using EC with CLAY in Quincy (or Pacific)?

Jeremy Austin <jhaustin@xxxxxxxxx> · Fri, 18 Nov 2022 09:41:33 -0900

Hi Sean,

My use of EC is specifically for slow, bulk storage. I did test jerasure
some years ago, but I don't think I kept my results. I'm having issues
today with arxiv.org which had papers…  I wanted to reduce disk usage
primarily, and network IO secondarily. In my case, I preferred the reduced
disk i/o of CLAY. I recall running a bunch of scenarios for specific values
of k and m in small clusters.

https://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)
<-- I did not compare
https://docs.ceph.com/en/quincy/rados/operations/erasure-code-clay/ <-- has
a comparison with LRC

In actual practice, I have no problems running a variety of interactive
services on it, so I ended up using it for cephfs. I use simple replicas
for IOPS-sensitive applications.

d=5

k=4

m=2

plugin=clay

This is about as small as is practical. I'm using OSD failure domain due to
the physical layout of osds per node (some larger, some smaller). In
practice this increases the likelihood of data going offline due to a host
failure, but it is an acceptable level of risk for this application.

My $.02,
Jeremy

On Wed, Nov 16, 2022 at 6:47 PM Sean Matheny <sean.matheny@xxxxxxxxxxx>
wrote:

> Hi Jeremy,
>
> Thanks for the feedback, and good to know that clay has been stable for
> you. Would you mind sharing what your motivation was going with clay? Was
> it for the recovery tail performance of clay versus jerasure, or some other
> reason(s)? Did you happen to do any benchmarking of clay vs erasure (either
> in normal write and read, or in recovery scenarios)?
>
> Ngā mihi,
>
> Sean Matheny
> HPC Cloud Platform DevOps Lead
> New Zealand eScience Infrastructure (NeSI)
>
> e: sean.matheny@xxxxxxxxxxx
>
> On 12/11/2022, at 9:43 AM, Jeremy Austin <jhaustin@xxxxxxxxx> wrote:
>
> I'm running 16.2.9 and have been using clay for 3 or 4 years. I can't
> speak to your scale, but I have had no long term reliability problems at
> small scale, including one or two hard power-down scenarios. (Alaska power
> is not too great! Not so much a grid as a very short stepladder.)
>
> On Thu, Oct 20, 2022 at 12:05 PM Sean Matheny <sean.matheny@xxxxxxxxxxx>
> wrote:
>
>> HI all,
>>
>> We've deployed a new cluster on Quincy 17.2.3 with 260x 18TB spinners
>> across 11 chassis that will be used exclusively in the next year or so as a
>> S3 store. 100Gb per chassis shared by both cluster and public networks,
>> NVMe DB/WAL, 32 phys cores @ 2.3Ghz base, 192GB chassis ram (per 24 OSDs).
>>
>> We're looking to use the clay ec plugin for our rgw (data) pool, as it
>> appears to use less reads in recovery, and might be beneficial. I'm going
>> to be benchmarking recovery scenarios ahead of production, but that of
>> course doesn't give a view on longer-term reliability. :)  Anyone hear of
>> any bad experiences, or any reason not to use over jerasure? Any reason to
>> use cauchy-good instead of reed-solomon for the use case above?
>>
>>
>> Ngā mihi,
>>
>> Sean Matheny
>> HPC Cloud Platform DevOps Lead
>> New Zealand eScience Infrastructure (NeSI)
>>
>> e: sean.matheny@xxxxxxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
>
> --
> Jeremy Austin
> jhaustin@xxxxxxxxx
>
>
>

-- 
Jeremy Austin
jhaustin@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx