Re: 3-node Ceph with DAS storage and multipath

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 21 May 2022 10:45:18 -0700

This helps.  So this is a unique type of blade / CI chassis.  You describe it as a PoC: would you use similar hardware for production?  That Chassis/Frame could have a large blast radius.  One of the great things about Ceph of course is that it’s adaptable to a wide variety of hardware, but there are some caveats:

* When using dense hardware that packs a lot into a single chassis, consider what happens when that chassis smokes, is down for maintenance, etc.
* Large / deep chassis can pull a lot of power, so evaluate your production configuration and the KW available to your racks / PDUs.  It is not uncommon for racks with large / dense chassis to only be half filled because of available power limitations, or the weight capacity of a raised floor.  I’ve even seen DCs with strict policies that all racks must have front and rear doors, and sometimes deep chassis prevent doors from closing unless the racks are extra deep themselves.

Something like a DL360 or DL380 is common for Ceph.  I’m happy to discuss infrastructure off-list if you like.

> 
> Hello Anthony,
> Thanks for the quick feedback and suggestions, really appreciated!
> Sorry for not being fully clear with my setup.
> 
> Here is my full  PoC HW configuration:
> 
> 1 x HPE Synergy 12000 Configure-to-order Frame with 10x Fans
> 2 (for redundancy) x HPE Synergy 12Gb SAS Connection Module with 12 Internal Ports 1 and 4

I’m not familiar with this hardware, but my sense is that Ceph inherently provides redundancy, and that additional redundancy in the drive data plane is at best wasted, and is probably money you don’t need to spend.

> 2 (for redundancy) x HPE Virtual Connect SE 100Gb F32 Module for Synergy
> 1 x HPE Synergy D3940 12Gb SAS CTO Drive Enclosure (2.5in) Drive Bays with 2 I/O adapters
> 24 x HPE 1.6TB SAS 12G Mixed Use SFF SC Multi Vendor SSD (Inside D3940 DAS storage)
> 3 x HPE Synergy 480 Gen10 Compute Module with the below specifications
>   -- 2 x Intel Xeon-Gold 6252 (2.1GHz/24-core/150W) FIO Processor Kit for HPE Synergy 480/660 Gen10.

Those CPUs are previous-generation FWIW, not entirely surprising for a loaner.
For a greenfield production deployment I’d consider some changes:

* Don’t spend extra for multipathing, and disable it on your PoC if you can for simplicity
* Consider larger SSDs, 3.84TB or 7.6TB, but in a production deployment you also need to consider the discrete number of drives for fault tolerance, balancing, and software bottlenecks.
* Seriously consider NVMe instead of SAS for a new design.  With a judicious design you might be surprised at how competitive the cost can be.  SAS market share is progressively declining and the choice available / new drives will continue to shrink

Also, “mixed use” probably means 3DWPD-class, for Ceph purposes I’ve personally seen 1DWPD-class drives be plenty.  ymmv depending on your use-case and intended lifetime.

>   -- (192 GB RAM per compute module) HPE 16GB (1x16GB) Single Rank x4 DDR4-2933 CAS-21-21-21 Registered Smart Memory Kit
>   -- 2 x HPE 300GB SAS 12G Enterprise 15K SFF (2.5in) SC 3yr Wty Digitally Signed Firmware HDD
>   -- 1 x HPE Smart Array P416ie-m SR Gen10 (8 Int 8 Ext Lanes/2GB Cache) 12G SAS Mezzanine Controller

See the list archives for a litany of reasons why I don’t like RoC HBAs.

>   -- 1 x HPE Synergy 6820C 25/50Gb Converged Network Adapter
> 
> 
> HW was configured as per below:
> 
> Each compute module:
> - 2 x 300GB HDD (the internal drives in the Synergy Gen10 servers) configured in RAID 1 array (for OS)
> - 2 x 1,6TB (from the DAS storage) per each compute module in RAID 1 array (for additional local storage to each compute module)

What are you using these for?  You wouldn’t need them for Ceph alone.

> - 6 x 1,6TB (from the DAS storage) per each compute module as JBOD (raw drives, no RAID) (for software-defined storage Ceph)

The default osd_memory_target is 4GB, I like to provision 2x that for peak usage, so 48GB of RAM for OSDs + an allowance for OS & anything else running on the node.

> - 4 network interfaces each with Speed: 25000Mb/s configured in 1 bonding for Ceph communication and 1 bonding for all other communication
> - Installed OS RHEL 8.6
> 
> 
>> More RAM and cores than you need, but you probably know that. I’m guessing that this is repurposed hardware?
> 
> This is a temporary demo HW which we are using to learn what is Ceph and how to configure it and if will suite for our purposes. After the demo period is over we will return it to HP.
> 
>> You call it DAS but don’t say anything about the interface. SAS? SATA? AoE? NVMe? FC?
> 
> Sorry for not being clear here. The DAS storage is 'HPE Synergy D3940 12Gb SAS CTO Drive Enclosure (2.5in) Drive Bays with 2 I/O adapters'
> The drives inside the DAS are 'HPE 1.6TB SAS 12G Mixed Use SFF SC Multi Vendor SSD (Inside D3940 DAS storage)'.
> They are SAS SSD drives.
> 
>> Why not try Quincy?
> 
> Great suggestion indeed, we want to use latest stable Ceph release.
> I was looking here https://docs.ceph.com/en/quincy/releases/index.html
> and I saw only 'Pacific' and 'Octopus', so this is the reason I overlooked Quincy.

That is being fixed right now actually.

> 
>> I don’t think Ceph has any concept of multipathing. Since Ceph keeps multiple copies of data, it’s usually overkill.
> 
> I suspected the same, that Ceph doesn't has any concept for multipathing. 
> Shall I configure the DAS storage drives with RHEL 8.6 multipath and present to Ceph the mod0 (multipath device 0)?

My only experience with multipathing was with FC in a Sun lab years ago, so I can’t answer that confidently, but I can say that you’d probably either want to do that, or I’d configure without multipathing if you can.  KISS principle.

> 
>> Swap is an anachronism from decades ago, suggest you not waste any of your small boot drives on it.
>> Also, why carve up the boot drives into so many, small partitions?
>> 
>> I would suggest instead something like
>> 
>> |-sdn1 8:209 0 953M 0 part /boot/efi
>> |-sdn2 8:210 0 953M 0 part /boot
>> |-sdn3 8:211 0 264.5G 0 part
>> | |-rhel_compute1-root 253:0 0 grow 0 lvm /
>> 
>> period. If that makes you nervous, partition 40GB for / and make the rest /home, with a bind mount of /home/varlog onto /var/log.
> 
> Thank you, I will reconsider my partition layout again and will take it into account for further improvement.

There’s a lot of personal preference in partitioning, sysadmins are a strongly opinionated breed.  The practice of having a small / dates back to the days when drives were things like Fujitsu Eagles, SMD with 14” platters.  RAID wasn’t a thing yet, nor were widespread journaling filesytems, so the small / reduced the probability that a medium error would land in the most vital partition.  With modern drives, and especially with RAID, the need for this is no longer there.  ISTR HPUX for example decades ago defaulting to one drive-spanning filesystem, with maybe some swap carved out.

A common reason for a seperate /var or /var/log is to fence the potential for runaway logs.  With modern logrotate and drive sizes, my personal preference is that runaways are rare and I’d rather fix the problem than address the symptom by partitioning, especially since smaller partitions actually increase the likelihood of filling up.

> 
> Regards,
> Kosta
> 
> On Saturday, 21 May 2022 10:33:37 am (+03:00), Anthony D'Atri wrote:
> 
>> inline.
>> 
>>> 
>>> Hello ceph-users,
>>> 
>>> Recently I started preparing 3-node Ceph cluster (on bare metal hardware)
>>> We have the HW configuration ready - 3 servers HPE Synergy 480 Gen10
>>> Compute Module, each server with 2xCPUs Intel Xeon-Gold 6252
>>> (2.1GHz/24-core), 192GB RAM
>> 
>> More RAM and cores than you need, but you probably know that. I’m guessing that this is repurposed hardware?
>> 
>>> 2x300GB HDD for OS RHEL 8.6 (already
>>> installed) and we have DAS (direct-attached-storage) with 18 x 1.6TB SSD
>>> drives inside. I attached 6 x1.6TB SSD from the DAS to each of the 3
>>> servers (as JBOD).
>> 
>> You call it DAS but don’t say anything about the interface. SAS? SATA? AoE? NVMe? FC?
>> 
>> 
>>> Now I can see these 6 SSDs as 12 devices because the DAS storage has two
>>> paths for redundancy to each of the disks (sda, sdb, sdc, sdd, sde, sdf,
>>> sdg, sdh, sdi, sdj, sdk, sdl).
>>> I'm not sure how to handle the DAS storage multipath properly and according
>>> to best practices.
>>> For installation I will use cephadm with latest Ceph release Pacific 16.2.7
>> 
>> Why not try Quincy?
>> 
>>> My question is shall I configure multipath from RHEL 8.6 OS in advance (for
>>> example sda+sdbb=md0) or I should leave cephadm to handle the multipath by
>>> itself?
>> 
>> I don’t think Ceph has any concept of multipathing. Since Ceph keeps multiple copies of data, it’s usually overkill.
>> 
>>> 
>>> | |-rhel_compute1-root 253:0 0 18.6G 0 lvm /
>>> | |-rhel_compute1-var_log 253:2 0 9.3G 0 lvm /var/log
>>> | |-rhel_compute1-var_tmp 253:3 0 4.7G 0 lvm /var/tmp
>>> | |-rhel_compute1-tmp 253:4 0 4.7G 0 lvm /tmp
>>> | |-rhel_compute1-var 253:5 0 37.3G 0 lvm /var
>>> | |-rhel_compute1-opt 253:6 0 37.3G 0 lvm /opt
>>> | |-rhel_compute1-aux1 253:7 0 107.1G 0 lvm /aux1
>>> | |-rhel_compute1-home 253:8 0 20.5G 0 lvm /home
>>> | `-rhel_compute1-aux0 253:9 0 25.2G 0 lvm /aux0
>>> |-sdn4 8:212 0 7.5G 0 part [SWAP]
>>> `-sdn5 8:213 0 4.7G 0 part /var/log/audit
>>> sdo 8:224 0 1.5T 0 disk
>>> `-sdo1 8:225 0 1.5T 0 part
>>> `-rhel_local_VG-localstor 253:1 0 1.5T 0 lvm /localstor
>>> 
>>> 
>> 
>> Swap is an anachronism from decades ago, suggest you not waste any of your small boot drives on it.
>> Also, why carve up the boot drives into so many, small partitions?
>> 
>> I would suggest instead something like
>> 
>> |-sdn1 8:209 0 953M 0 part /boot/efi
>> |-sdn2 8:210 0 953M 0 part /boot
>> |-sdn3 8:211 0 264.5G 0 part
>> | |-rhel_compute1-root 253:0 0 grow 0 lvm /
>> 
>> period. If that makes you nervous, partition 40GB for / and make the rest /home, with a bind mount of /home/varlog onto /var/log.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> -- 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx