Re: Dear Abby: Why Is Architecting CEPH So Hard?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If you want the lowest cost per TB then you will be going with larger nodes in your cluster but it does mean you minimum cluster size is going to be many PB’s in size.

There are a number of fixed costs associated with a node.

So Motherboard, Network cards, disk controllers, the more disks you spread these fixed costs across then the lower the overhead cost so the lower the cost per TB.

So lets say our hypothetical server has a cost of 1000 for the motherboard, 1000 for the network card and 1000 for the disk controller. Just trying to keep the maths simple here.

So 3000 is the cost of the server. If you spread cost across 60 disks then you end up with an additional cost per disk of 50. If you spread it across 24 disks then you have  an additional cost of 125 and across 12 disks you have an additional cost of 250.

I have left memory out of this as memory is a fixed amount per OSD device. I have also left the chassis out of this but a 60 drive chassis is not 3x the price of a 24 drive chassis and is not 5X the cost of a 12 drive chassis. If it is then you need to be looking for a new chassis vendor.

CPU is the one variable here which is not linear and the CPU vendor tax for higher core counts can be significant. So a 60 drive chassis would need 60 threads available which puts you into dual socket on the intel side of things. AMD would allow you to get to a single socket motherboard with a 32 Core CPU for that 60 drive chassis. Single socket motherboard is lower cost than dual socket and feeds back into the calculation above.

Now the question is what is the tax that a particular chassis vendor is charging you. I know from the configs we do on a regular basis that a 60 drive chassis will give you the lowest cost per TB. BUT it has implications. Your cluster size needs to be up in the order of 10PB minimum. 60 x 18TB gives you around 1PB per node.  Oh did you notice here we are going for the bigger disk drives. Why because the more data you can spread your fixed costs across the lower the overall cost per GB.

If you have a node failure you will have to recreate 1PB of lost data. This pushes you to 25G networking or faster. In many cases I would be looking at 100G, 100G Top of Rack switches are so cheap why wouldn't you go down this route.

You will get into power, weight and cooling  issues with many DC’s though and this is something to consider.

The amount of NVME space for RocksDB and WAL is also a fixed amount based on the number of OSD devices so this has no effect on the cost per TB. When deciding between different chassis densities.

So if your requirement is lowest cost per TB then 60 drive chassis is the way to go and will give you the lowest price point.








From: Martin Verges <martin.verges@xxxxxxxx>
Date: Thursday, 23 April 2020 at 06:39
To: lin.yunfan <lin.yunfan@xxxxxxxxx>
Cc: brian.topping@xxxxxxxxx <brian.topping@xxxxxxxxx>, ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject:  Re: Dear Abby: Why Is Architecting CEPH So Hard?
>From all our calculations of clusters, going with smaller systems reduced
the TCO because of much cheaper hardware.
Having 100 Ceph nodes is not an issue, therefore you can scale small and
large clusters with the exact same hardware.

But please, prove me wrong. I would love to see a way to reduce the TCO
even more and if you have a way, I would love to hear about it.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 23. Apr. 2020 um 05:18 Uhr schrieb lin.yunfan <lin.yunfan@xxxxxxxxx
>:

> I have seen a lot of people saying not to go with big nodes.
> What is the exact reason for that?
> I can understand that if the cluster is not big enough then the total
> nodes count could be too small to withstand a node failure, but if the
> cluster is big enough wouldn't the big node be more cost effective?
>
>
> lin.yunfan
> lin.yunfan@xxxxxxxxx
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=2&name=lin.yunfan&uid=lin.yunfan%40gmail.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm91cce6d5df0eecf9304e3975eeeae111.jpg&items=%5B%22%22%2C%22lin.yunfan%40gmail.com%22%2C%22%22%2C%22%22%2C%22%22%5D>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
> On 4/23/2020 06:33,Brian Topping<brian.topping@xxxxxxxxx>
> <brian.topping@xxxxxxxxx> wrote:
>
> Great set of suggestions, thanks! One to consider:
>
> On Apr 22, 2020, at 4:14 PM, Jack <ceph@xxxxxxxxxxxxxx> wrote:
>
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue
>
>
>
> This is probably the wrong forum to understand how you are not burning
> them out. Any kind of logs or monitor databases on a small SATADOM will
> cook them quick, especially an MLC. There is no extra space for wear
> leveling and the like. I tried making it work with fancy systemd logging to
> memory and having those logs pulled by a log scraper storing to the actual
> data drives, but there was no place for the monitor DB. No monitor DB means
> Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for
> the cluster and instant death if the monitors aren’t replicated.
>
> My node chassis have two motherboards and each is hard limited to four
> SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is
> stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data
> on partition 3 of each disk. This way any disk could fail and I could still
> boot. Merging the volumes (ie no SATADOM), wear leveling was statistically
> more effective. And I don’t have to get into crazy system configurations
> that nobody would want to maintain or document.
>
> $0.02…
>
> Brian
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux