Yes, I'd love to go with Optanes ... you think 480 GB will be
fine for WAL+DB for 15x12TB, long term? I only hesitate because
I've seen recommendations of "10 GB DB per 1 TB HDD" several times.
How much total HDD capacity do you have per Optane 900P 480GB?
Cheers,
Oliver
On 18.07.2018 10:23, Linh Vu wrote:
I think the P4600 should be fine, although 2TB is probably way over kill
for 15 OSDs.
Our older nodes use the P3700 400GB for 16 OSDs. I have yet to see the
WAL and DB getting filled up at 2GB/10GB each. Our newer nodes use the
Intel Optane 900P 480GB, that's actually faster than the P4600 and
significantly cheaper in our country (we bought ~100 OSD nodes recently
and that was a big saving) and has a big 10 DWPD. For NLSAS OSDs, even
the older P3700 is more than enough, but for our flash OSDs, the Optane
900P performs a lot better. It's about 2x faster than the P3700 we had,
and allow us to get more out of our flash drives.
------------------------------------------------------------------------
*From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
*Sent:* Wednesday, 18 July 2018 12:00:14 PM
*To:* Linh Vu; ceph-users
*Subject:* Re: CephFS with erasure coding, do I need a
cache-pool?
Thanks, Linh!
A question regarding choice of NVMe - do you think an
Intel P4510 or P4600 would do well for WAL+DB? I'd
thinking about using a single 2 TB NVMe for 15 OSDs.
Would you recommend a different model?
Is there any experience on how many 4k IOPS one should
have for WAL+DB per OSD?
We have a few new BlueStore nodes in an older
cluster, and we use Intel Optanes for WAL. We wanted to
use them for DB too - only to learn that while fast
they're just to small for the DB for several OSDs ...
so I hope a "regular" NVMe is fast enough?
We currently use the Gigabyte D120-C21 server barebone
(https://b2b.gigabyte.com/Storage-Server/D120-C21-rev-100)
for our OSD nodes, and we'd like to use it in our
next cluster too, because of the high storage density
and the good hdd-price to server-price ratio.
But it can only fit a single NVMe-drive (we use one of
the 16 HDD slots for an U.2 drive and connect it to the
single M.2-PCIe slot on the mainboard).
Cheers,
Oliver
On 18.07.2018 09:11, Linh Vu wrote:
> On our NLSAS OSD nodes, there is 1x NVMe PCIe card for all the WALs and
> DBs (we accept that the risk of 1 card failing is low, and our failure
> domain is host anyway). Each OSD (16 per host) gets 2GB of WAL and 10GB
> of DB.
>
>
> On our Flash (SSD but not NVMe) OSD nodes, there are 8 OSDs per node,
> and 2x NVMe PCIe cards for the WALs and DBs. Each OSD gets 4GB of WAL
> and 40GB of DB.
>
>
> On our upcoming NVMe OSD nodes, for obvious reason, we don't do any such
> special allocation. 😊
>
>
> Cheers,
>
> Linh
>
>
> ------------------------------------------------------------------------
> *From:* Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
> *Sent:* Tuesday, 17 July 2018 11:39:26 PM
> *To:* Linh Vu; ceph-users
> *Subject:* Re: CephFS with erasure coding, do I need a
> cache-pool?
> Dear Linh,
>
> another question, if I may:
>
> How do you handle Bluestore WAL and DB, and
> how much SSD space do you allocate for them?
>
>
> Cheers,
>
> Oliver
>
>
> On 17.07.2018 08:55, Linh Vu wrote:
> > Hi Oliver,
> >
> >
> > We have several CephFS on EC pool deployments, one been in production
> > for a while, the others about to pending all the Bluestore+EC fixes in
> > 12.2.7 😊
> >
> >
> > Firstly as John and Greg have said, you don't need SSD cache pool
at all.
> >
> >
> > Secondly, regarding k/m, it depends on how many hosts or racks you
have,
> > and how many failures you want to tolerate.
> >
> >
> > For our smallest pool with only 8 hosts in 4 different racks and 2
> > different pairs of switches (note: we consider switch failure more
> > common than rack cooling or power failure), we're using 4/2 with
failure
> > domain = host. We currently use this for SSD scratch storage for HPC.
> >
> >
> > For one of our larger pools, with 24 hosts over 6 different racks and 6
> > different pairs of switches, we're using 4:2 with failure domain =
rack.
> >
> >
> > For another pool with similar host count but not spread over so many
> > pairs of switches, we're using 6:3 and failure domain = host.
> >
> >
> > Also keep in mind that a higher value of k/m may give you more
> > throughput but increase latency especially for small files, so it also
> > depends on how important performance is and what kind of file size you
> > store on your CephFS.
> >
> >
> > Cheers,
> >
> > Linh
> >
> >
------------------------------------------------------------------------
> > *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of
> > Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>
> > *Sent:* Sunday, 15 July 2018 9:46:16 PM
> > *To:* ceph-users
> > *Subject:* CephFS with erasure coding, do I need a
> cache-pool?
> > Dear all,
> >
> > we're planning a new Ceph-Clusterm, with CephFS as the
> > main workload, and would like to use erasure coding to
> > use the disks more efficiently. Access pattern will
> > probably be more read- than write-heavy, on average.
> >
> > I don't have any practical experience with erasure-
> > coded pools so far.
> >
> > I'd be glad for any hints / recommendations regarding
> > these questions:
> >
> > * Is an SSD cache pool recommended/necessary for
> > CephFS on an erasure-coded HDD pool (using Ceph
> > Luminous and BlueStore)?
> >
> > * What are good values for k/m for erasure coding in
> > practice (assuming a cluster of about 300 OSDs), to
> > make things robust and ease maintenance (ability to
> > take a few nodes down)? Is k/m = 6/3 a good choice?
> >
> > * Will it be sufficient to have k+m racks, resp. failure
> > domains?
> >
> >
> > Cheers and thanks for any advice,
> >
> > Oliver
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com