Re: CephFS with erasure coding, do I need a cache-pool?

Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx> · Wed, 18 Jul 2018 10:00:14 +0800

Thanks, Linh!

A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?

Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).

Cheers,

Oliver

On 18.07.2018 09:11, Linh Vu wrote:
On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and 
DBs (we accept that the risk of 1 card failing is low, and our failure 
domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB 
of DB.

On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node, 
and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL 
and 40GB of DB.

On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such 
special allocation. 😊

Cheers,

Linh

------------------------------------------------------------------------
*From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
*Sent:* Tuesday, 17 July 2018 11:39:26 PM
*To:* Linh Vu; ceph-users
*Subject:* Re:  CephFS with erasure coding, do I need a 
cache-pool?
Dear Linh,

another question, if I may:

How do you handle Bluestore WAL and DB, and
how much SSD space do you allocate for them?

Cheers,

Oliver

On 17.07.2018 08:55, Linh Vu wrote:
 > Hi Oliver,
 >
 >
 > We have several CephFS on EC pool deployments, one been in production
 > for a while, the others about to pending all the Bluestore+EC fixes in
 > 12.2.7 😊
 >
 >
 > Firstly as John and Greg have said, you don't need SSD cache pool at all.
 >
 >
 > Secondly, regarding k/m, it depends on how many hosts or racks you have,
 > and how many failures you want to tolerate.
 >
 >
 > For our smallest pool with only 8 hosts in 4 different racks and 2
 > different pairs of switches (note: we consider switch failure more
 > common than rack cooling or power failure), we're using 4/2 with failure
 > domain = host. We currently use this for SSD scratch storage for HPC.
 >
 >
 > For one of our larger pools, with 24 hosts over 6 different racks and 6
 > different pairs of switches, we're using 4:2 with failure domain = rack.
 >
 >
 > For another pool with similar host count but not spread over so many
 > pairs of switches, we're using 6:3 and failure domain = host.
 >
 >
 > Also keep in mind that a higher value of k/m may give you more
 > throughput but increase latency especially for small files, so it also
 > depends on how important performance is and what kind of file size you
 > store on your CephFS.
 >
 >
 > Cheers,
 >
 > Linh
 >
 > ------------------------------------------------------------------------
 > *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of
 > Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
 > *Sent:* Sunday, 15 July 2018 9:46:16 PM
 > *To:* ceph-users
 > *Subject:*  CephFS with erasure coding, do I need a 
cache-pool?
 > Dear all,
 >
 > we're planning a new Ceph-Clusterm, with CephFS as the
 > main workload, and would like to use erasure coding to
 > use the disks more efficiently. Access pattern will
 > probably be more read- than write-heavy, on average.
 >
 > I don't have any practical experience with erasure-
 > coded pools so far.
 >
 > I'd be glad for any hints / recommendations regarding
 > these questions:
 >
 > * Is an SSD cache pool recommended/necessary for
 > CephFS on an erasure-coded HDD pool (using Ceph
 > Luminous and BlueStore)?
 >
 > * What are good values for k/m for erasure coding in
 > practice (assuming a cluster of about 300 OSDs), to
 > make things robust and ease maintenance (ability to
 > take a few nodes down)? Is k/m = 6/3 a good choice?
 >
 > * Will it be sufficient to have k+m racks, resp. failure
 > domains?
 >
 >
 > Cheers and thanks for any advice,
 >
 > Oliver
 > _______________________________________________
 > ceph-users mailing list
 > ceph-users@xxxxxxxxxxxxxx
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com