On Sun, 19 Jun 2022 at 02:29, Satish Patel <satish.txt@xxxxxxxxx> wrote: > Greeting folks, > > We are planning to build Ceph storage for mostly cephFS for HPC workload > and in future we are planning to expand to S3 style but that is yet to be > decided. Because we need mass storage, we bought the following HW. > > 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We > understand SSD/NvME would be best fit but it's way out of budget. > > I hope you have extra HW on hand for Monitor and MDS servers > Ceph recommends using a faster disk for wal/db if the data disk is slow and > in my case I do have a slower disk for data. > > Question: > 1. Let's say if i want to put a NvME disk for wal/db then what size i > should buy. > The official recommendation is to budget 4% of OSD size for WAL/DB - so in your case that would be 720GB per OSD. Especially if you want to go to S3 later you should stick closer to that limit since RGW is a heavy meta data user. Also with 12 OSD per node you should have at least 2 NVME - so 2x4TB might do or maybe 3x3TB The WAL/DB device is a Single Point of Failure for all OSDs attached (in other words - if the WAL/DB device fails then all OSDs that have their WAL/DB located there need to be rebuilt) Make sure you budget for good number of DWPD (I assume in HPC scenario you'll have a lot of scratch data) and test it with O_DIRECT and F_SYNC and QD=1 and BS=4K to find one that can reliably handle high IOPS under that condition > 2. Do I need to partition wal/db for each OSD or just a single > partition can share for all OSDs? > You need one partition per OSD > 3. Can I put the OS on the same disk where the wal/db is going to sit ? > (This way i don't need to spend extra money for extra disk) > Yes you can but in your case that would mean putting the WAL/DB on the HDD - I would predict your HPC users not being very impressed with the resulting performance but YMMV > Any suggestions you have for this kind of storage would be much > appreciated. > Budget plenty of RAM to deal with recovery scenarios - I'd say in your case 256GB minimum. Normally you build a POC and test the heck out of it to cover your usage scenarios but you already bought the HW so not a lot you can change now - but you should test and tune your setup before you put production data on it to ensure that you have a good idea how the system is going to behave when it get s under load. Make sure you test failure scenarios (failing OSDs, failing nodes, network cuts, failing MDS etc.) so you know what to expect and how to handle them Another bottleneck in CephFS setups tends to be the MDS - again in your setup you probably want at least 2 MDS in active-active (i.e. shared load) plus 1 or 2 on standby as failover but others on this list have more experience with that. _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx