Re: CephFS with erasure coding, do I need a cache-pool?

Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx> · Fri, 20 Jul 2018 00:56:11 +0800

Yes, I'd love to go with Optanes ... you think 480 GB will be
fine for WAL+DB for 15x12TB, long term? I only hesitate because
I've seen recommendations of "10 GB DB per 1 TB HDD" several times.

How much total HDD capacity do you have per Optane 900P 480GB?

Cheers,

Oliver

On 18.07.2018 10:23, Linh Vu wrote:
I think the P4600 should be fine, although 2TB is probably way over kill 
for 15 OSDs.

Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the 
WAL and DB getting filled up at 2GB/10GB each. Our newer nodes use the 
Intel Optane 900P 480GB, that's actually faster than the P4600 and 
significantly cheaper in our country (we bought ~100 OSD nodes recently 
and that was a big saving) and has a big 10 DWPD. For NLSAS OSDs, even 
the older P3700 is more than enough, but for our flash OSDs, the Optane 
900P performs a lot better. It's about 2x faster than the P3700 we had, 
and allow us to get more out of our flash drives.

------------------------------------------------------------------------
*From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
*Sent:* Wednesday, 18 July 2018 12:00:14 PM
*To:* Linh Vu; ceph-users
*Subject:* Re:  CephFS with erasure coding, do I need a 
cache-pool?
Thanks, Linh!

A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?

Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?

We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?

We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).

Cheers,

Oliver

On 18.07.2018 09:11, Linh Vu wrote:
 > On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and
 > DBs (we accept that the risk of 1 card failing is low, and our failure
 > domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB
 > of DB.
 >
 >
 > On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node,
 > and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL
 > and 40GB of DB.
 >
 >
 > On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such
 > special allocation. 😊
 >
 >
 > Cheers,
 >
 > Linh
 >
 >
 > ------------------------------------------------------------------------
 > *From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
 > *Sent:* Tuesday, 17 July 2018 11:39:26 PM
 > *To:* Linh Vu; ceph-users
 > *Subject:* Re:  CephFS with erasure coding, do I need a
 > cache-pool?
 > Dear Linh,
 >
 > another question, if I may:
 >
 > How do you handle Bluestore WAL and DB, and
 > how much SSD space do you allocate for them?
 >
 >
 > Cheers,
 >
 > Oliver
 >
 >
 > On 17.07.2018 08:55, Linh Vu wrote:
 > > Hi Oliver,
 > >
 > >
 > > We have several CephFS on EC pool deployments, one been in production
 > > for a while, the others about to pending all the Bluestore+EC fixes in
 > > 12.2.7 😊
 > >
 > >
 > > Firstly as John and Greg have said, you don't need SSD cache pool 
at all.
 > >
 > >
 > > Secondly, regarding k/m, it depends on how many hosts or racks you 
have,
 > > and how many failures you want to tolerate.
 > >
 > >
 > > For our smallest pool with only 8 hosts in 4 different racks and 2
 > > different pairs of switches (note: we consider switch failure more
 > > common than rack cooling or power failure), we're using 4/2 with 
failure
 > > domain = host. We currently use this for SSD scratch storage for HPC.
 > >
 > >
 > > For one of our larger pools, with 24 hosts over 6 different racks and 6
 > > different pairs of switches, we're using 4:2 with failure domain = 
rack.
 > >
 > >
 > > For another pool with similar host count but not spread over so many
 > > pairs of switches, we're using 6:3 and failure domain = host.
 > >
 > >
 > > Also keep in mind that a higher value of k/m may give you more
 > > throughput but increase latency especially for small files, so it also
 > > depends on how important performance is and what kind of file size you
 > > store on your CephFS.
 > >
 > >
 > > Cheers,
 > >
 > > Linh
 > >
 > > 
------------------------------------------------------------------------
 > > *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of
 > > Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
 > > *Sent:* Sunday, 15 July 2018 9:46:16 PM
 > > *To:* ceph-users
 > > *Subject:*  CephFS with erasure coding, do I need a
 > cache-pool?
 > > Dear all,
 > >
 > > we're planning a new Ceph-Clusterm, with CephFS as the
 > > main workload, and would like to use erasure coding to
 > > use the disks more efficiently. Access pattern will
 > > probably be more read- than write-heavy, on average.
 > >
 > > I don't have any practical experience with erasure-
 > > coded pools so far.
 > >
 > > I'd be glad for any hints / recommendations regarding
 > > these questions:
 > >
 > > * Is an SSD cache pool recommended/necessary for
 > > CephFS on an erasure-coded HDD pool (using Ceph
 > > Luminous and BlueStore)?
 > >
 > > * What are good values for k/m for erasure coding in
 > > practice (assuming a cluster of about 300 OSDs), to
 > > make things robust and ease maintenance (ability to
 > > take a few nodes down)? Is k/m = 6/3 a good choice?
 > >
 > > * Will it be sufficient to have k+m racks, resp. failure
 > > domains?
 > >
 > >
 > > Cheers and thanks for any advice,
 > >
 > > Oliver
 > > _______________________________________________
 > > ceph-users mailing list
 > > ceph-users@xxxxxxxxxxxxxx
 > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com