Re: Suggestion to build ceph storage

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Mon, 20 Jun 2022 09:23:06 +1200

On Sun, 19 Jun 2022 at 02:29, Satish Patel <satish.txt@xxxxxxxxx> wrote:

> Greeting folks,
>
> We are planning to build Ceph storage for mostly cephFS for HPC workload
> and in future we are planning to expand to S3 style but that is yet to be
> decided. Because we need mass storage, we bought the following HW.
>
> 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We
> understand SSD/NvME would be best fit but it's way out of budget.
>
> I hope you have extra HW on hand for Monitor and MDS  servers

> Ceph recommends using a faster disk for wal/db if the data disk is slow and
> in my case I do have a slower disk for data.
>
> Question:
> 1. Let's say if i want to put a NvME disk for wal/db then what size i
> should buy.
>

The official recommendation is to budget 4% of OSD size for WAL/DB - so in
your case that would be 720GB per OSD. Especially if you want to go to S3
later you should stick closer to that limit since RGW is a heavy meta data
user.
Also with 12 OSD per node you should have at least 2 NVME - so 2x4TB might
do or maybe 3x3TB
The WAL/DB device is a Single Point of Failure for all OSDs attached (in
other words - if the WAL/DB device fails then all OSDs that have their
WAL/DB located there need to be rebuilt)
Make sure you budget for good number of DWPD (I assume in HPC scenario
you'll have a lot of scratch data) and test it with O_DIRECT and F_SYNC and
QD=1 and BS=4K to find one that can reliably handle high IOPS under that
condition

> 2. Do I need to partition wal/db for each OSD or just a single
> partition can share for all OSDs?
>

You need one partition per OSD

> 3. Can I put the OS on the same disk where the wal/db is going to sit ?
> (This way i don't need to spend extra money for extra disk)
>

Yes you can but in your case that would mean putting the WAL/DB on the HDD
- I would predict your HPC users not being very impressed with the
resulting performance but YMMV

> Any suggestions you have for this kind of storage would be much
> appreciated.
>

Budget plenty of RAM to deal with  recovery scenarios - I'd say in your
case 256GB minimum.
Normally you build a POC and test the heck out of it to cover your usage
scenarios but you already bought the HW so not a lot you can change now -
but you should test and tune your setup before you put production data on
it to ensure that you have a good idea how the system is going to behave
when it get s under load. Make sure you test failure scenarios (failing
OSDs,  failing nodes, network cuts, failing MDS etc.) so you know what to
expect and how to handle them

Another bottleneck in CephFS setups tends to be the MDS - again in your
setup you probably want at least 2 MDS in active-active (i.e. shared load)
plus 1 or 2 on standby as failover but others on this list have more
experience with that.

_______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx