Re: Dear Abby: Why Is Architecting CEPH So Hard?

Jarett DeAngelis <jarett@xxxxxxxxxxxx> · Wed, 22 Apr 2020 23:19:52 -0400

Well, for starters, "more network" = "faster cluster."

On Wed, Apr 22, 2020 at 11:18 PM lin.yunfan <lin.yunfan@xxxxxxxxx> wrote:

> I have seen a lot of people saying not to go with big nodes.
> What is the exact reason for that?
> I can understand that if the cluster is not big enough then the total
> nodes count could be too small to withstand a node failure, but if the
> cluster is big enough wouldn't the big node be more cost effective?
>
>
> lin.yunfan
> lin.yunfan@xxxxxxxxx
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=2&name=lin.yunfan&uid=lin.yunfan%40gmail.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm91cce6d5df0eecf9304e3975eeeae111.jpg&items=%5B%22%22%2C%22lin.yunfan%40gmail.com%22%2C%22%22%2C%22%22%2C%22%22%5D>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
> On 4/23/2020 06:33，Brian Topping<brian.topping@xxxxxxxxx>
> <brian.topping@xxxxxxxxx> wrote：
>
> Great set of suggestions, thanks! One to consider:
>
> On Apr 22, 2020, at 4:14 PM, Jack <ceph@xxxxxxxxxxxxxx> wrote:
>
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue
>
>
>
> This is probably the wrong forum to understand how you are not burning
> them out. Any kind of logs or monitor databases on a small SATADOM will
> cook them quick, especially an MLC. There is no extra space for wear
> leveling and the like. I tried making it work with fancy systemd logging to
> memory and having those logs pulled by a log scraper storing to the actual
> data drives, but there was no place for the monitor DB. No monitor DB means
> Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for
> the cluster and instant death if the monitors aren’t replicated.
>
> My node chassis have two motherboards and each is hard limited to four
> SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is
> stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data
> on partition 3 of each disk. This way any disk could fail and I could still
> boot. Merging the volumes (ie no SATADOM), wear leveling was statistically
> more effective. And I don’t have to get into crazy system configurations
> that nobody would want to maintain or document.
>
> $0.02…
>
> Brian
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx