Re: CephFS with erasure coding, do I need a cache-pool?

Linh Vu <vul@xxxxxxxxxxxxxx> · Wed, 18 Jul 2018 02:23:41 +0000

I think the P4600 should be fine, although 2TB is probably way over kill for 15 OSDs. 

Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the WAL and DB getting filled up at 2GB/10GB each. Our newer nodes use the Intel Optane 900P 480GB, that's actually faster than the P4600 and significantly
 cheaper in our country (we bought ~100 OSD nodes recently and that was a big saving) and has a big 10 DWPD. For NLSAS OSDs, even the older P3700 is more than enough, but for our flash OSDs, the Optane 900P performs a lot better. It's about 2x faster than the
 P3700 we had, and allow us to get more out of our flash drives. 

From: Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>

Sent: Wednesday, 18 July 2018 12:00:14 PM

To: Linh Vu; ceph-users

Subject: Re: [ceph-users] CephFS with erasure coding, do I need a cache-pool?

Thanks, Linh!

A question regarding choice of NVMe - do you think an

Intel P4510 or P4600 would do well for WAL+DB? I'd

thinking about using a single 2 TB NVMe for 15 OSDs.

Would you recommend a different model?

Is there any experience on how many 4k IOPS one should

have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older

cluster, and we use Intel Optanes for WAL. We wanted to

use them for DB too - only to learn that while fast

they're just to small for the DB for several OSDs ...

so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone

(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)

for our OSD nodes, and we'd like to use it in our

next cluster too, because of the high storage density

and the good hdd-price to server-price ratio.

But it can only fit a single NVMe-drive (we use one of

the 16 HDD slots for an U.2 drive and connect it to the

single M.2-PCIe slot on the mainboard).

Cheers,

Oliver

On 18.07.2018 09:11, Linh Vu wrote:

> On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and 

> DBs (we accept that the risk of 1 card failing is low, and our failure 

> domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB 

> of DB.

> 

> 

> On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node, 

> and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL 

> and 40GB of DB.

> 

> 

> On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such 

> special allocation. 😊

> 

> 

> Cheers,

> 

> Linh

> 

> 

> ------------------------------------------------------------------------

> *From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>

> *Sent:* Tuesday, 17 July 2018 11:39:26 PM

> *To:* Linh Vu; ceph-users

> *Subject:* Re: [ceph-users] CephFS with erasure coding, do I need a 

> cache-pool?

> Dear Linh,

> 

> another question, if I may:

> 

> How do you handle Bluestore WAL and DB, and

> how much SSD space do you allocate for them?

> 

> 

> Cheers,

> 

> Oliver

> 

> 

> On 17.07.2018 08:55, Linh Vu wrote:

> > Hi Oliver,

> >

> >

> > We have several CephFS on EC pool deployments, one been in production

> > for a while, the others about to pending all the Bluestore+EC fixes in

> > 12.2.7 😊

> >

> >

> > Firstly as John and Greg have said, you don't need SSD cache pool at all.

> >

> >

> > Secondly, regarding k/m, it depends on how many hosts or racks you have,

> > and how many failures you want to tolerate.

> >

> >

> > For our smallest pool with only 8 hosts in 4 different racks and 2

> > different pairs of switches (note: we consider switch failure more

> > common than rack cooling or power failure), we're using 4/2 with failure

> > domain = host. We currently use this for SSD scratch storage for HPC.

> >

> >

> > For one of our larger pools, with 24 hosts over 6 different racks and 6

> > different pairs of switches, we're using 4:2 with failure domain = rack.

> >

> >

> > For another pool with similar host count but not spread over so many

> > pairs of switches, we're using 6:3 and failure domain = host.

> >

> >

> > Also keep in mind that a higher value of k/m may give you more

> > throughput but increase latency especially for small files, so it also

> > depends on how important performance is and what kind of file size you

> > store on your CephFS.

> >

> >

> > Cheers,

> >

> > Linh

> >

> > ------------------------------------------------------------------------

> > *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of

> > Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>

> > *Sent:* Sunday, 15 July 2018 9:46:16 PM

> > *To:* ceph-users

> > *Subject:* [ceph-users] CephFS with erasure coding, do I need a 

> cache-pool?

> > Dear all,

> >

> > we're planning a new Ceph-Clusterm, with CephFS as the

> > main workload, and would like to use erasure coding to

> > use the disks more efficiently. Access pattern will

> > probably be more read- than write-heavy, on average.

> >

> > I don't have any practical experience with erasure-

> > coded pools so far.

> >

> > I'd be glad for any hints / recommendations regarding

> > these questions:

> >

> > * Is an SSD cache pool recommended/necessary for

> > CephFS on an erasure-coded HDD pool (using Ceph

> > Luminous and BlueStore)?

> >

> > * What are good values for k/m for erasure coding in

> > practice (assuming a cluster of about 300 OSDs), to

> > make things robust and ease maintenance (ability to

> > take a few nodes down)? Is k/m = 6/3 a good choice?

> >

> > * Will it be sufficient to have k+m racks, resp. failure

> > domains?

> >

> >

> > Cheers and thanks for any advice,

> >

> > Oliver

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com