Re: 3-node Ceph with DAS storage and multipath

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,


@Anthony


> This helps. So this is a unique type of blade / CI chassis. You describe it as a PoC: would you use similar hardware for production? That Chassis/Frame could have a large blast radius. One of the great things about Ceph of course is that it’s adaptable to a wide variety of hardware, but there are some caveats:
>


Correct this is the next generation of blade chassis provided by HP called Synergy. This is the successor of the old blade chassis (C7000) and blade servers which are already discontinued. Yes, our intention is to use similar (if not the same setup for production platforms). This is why we are trying to configure Ceph 'the proper way' and according to best practices.


> * When using dense hardware that packs a lot into a single chassis, consider what happens when that chassis smokes, is down for maintenance, etc.
> * Large / deep chassis can pull a lot of power, so evaluate your production configuration and the KW available to your racks / PDUs. It is not uncommon for racks with large / dense chassis to only be half filled because of available power limitations, or the weight capacity of a raised floor. I’ve even seen DCs with strict policies that all racks must have front and rear doors, and sometimes deep chassis prevent doors from closing unless the racks are extra deep themselves.
>


You are absolutely right! The Synergy frame (chassis) is with redundant components (power supplies, fans, inter-connect modules, uplinks, etc...). So when there is some maintenance on the frame (like firmware update for example) it is done one by one to avoid outages  We are taking into consideration the needed power KW for the frame and install needed PDUs (from power source A and B for redundancy). Same for the rack static and dynamic load. Our goal is to have 1 synergy frame (10U) and in rear cases max 2 synergy frames in one rack cabinet.

> Something like a DL360 or DL380 is common for Ceph. I’m happy to discuss infrastructure off-list if you like.
>


We are using HP DL380 G10 server as well for our bare metal of VM infra platforms. One of the advantages for Synergy frame is the you have all-in-one 10U box (servers, DAS storage, network modules, power supply, etc...) and this save space in the rack cabinet compared with the rack mountable DL380 G10 servers. The other advantage is that you can expand it pretty easy without laying power cables, network cables, etc.., just plug-in the next synergy compute module. This is why are a focused on HPE Synergy frame currently.


> * Don’t spend extra for multipathing, and disable it on your PoC if you can for simplicity


All the SSD drives in the DAS storage are accessible via two modules I/O adapter 1 and I/O adapter 2. The easiest for me is to pull out one of the I/O adapters of the DAS D3940 stroage and leave only one path or to remove one of the SAS inter-connect modules. My concern is that single I/O adapter inside DAS will be single point of failure, which means that if this I/O module fails for some reason all compute modules in the frame will lose connectivity to the SSD drives in DAS and will cause full outage. So this is why are trying to setup highly-available and redundant configuration without single point of failure.

> * Consider larger SSDs, 3.84TB or 7.6TB, but in a production deployment you also need to consider the discrete number of drives for fault tolerance, balancing, and software bottlenecks.


Thanks, I will check larger SSD drives.  Usually we are sizing the space as per specific requirements which could change from installation to installation.

> * Seriously consider NVMe instead of SAS for a new design. With a judicious design you might be surprised at how competitive the cost can be. SAS market share is progressively declining and the choice available / new drives will continue to shrink
>


The reason we selected SAS is because they were cheaper compared with NVMe (HP prices could be pretty high sometime), but I will check NVMe drives prices to compare them again.


> Also, “mixed use” probably means 3DWPD-class, for Ceph purposes I’ve personally seen 1DWPD-class drives be plenty. ymmv depending on your use-case and intended lifetime


'Mixed use' means that the SSD drive is suitable for writing and for reading which is a balance between both. Our application are quite write intensive so our main goal is to have resilient SSD instead of slow but reliable HDDs.






@Javier,


> Kosta, ¿Can you manage dual paths to disk through multipathd?
>


Yes I can configure the DAS drives with the multipath from RHEL 8.6 OS.
I'm still not doing this because I'm not sure if this is the 'proper' way for Ceph storage.


Regrads,
Kosta  




On Saturday, 21 May 2022 8:45:18 pm (+03:00), Anthony D'Atri wrote:

>
> This helps. So this is a unique type of blade / CI chassis. You describe it as a PoC: would you use similar hardware for production? That Chassis/Frame could have a large blast radius. One of the great things about Ceph of course is that it’s adaptable to a wide variety of hardware, but there are some caveats:
>
> * When using dense hardware that packs a lot into a single chassis, consider what happens when that chassis smokes, is down for maintenance, etc.
> * Large / deep chassis can pull a lot of power, so evaluate your production configuration and the KW available to your racks / PDUs. It is not uncommon for racks with large / dense chassis to only be half filled because of available power limitations, or the weight capacity of a raised floor. I’ve even seen DCs with strict policies that all racks must have front and rear doors, and sometimes deep chassis prevent doors from closing unless the racks are extra deep themselves.
>
> Something like a DL360 or DL380 is common for Ceph. I’m happy to discuss infrastructure off-list if you like.
>
>
> >
> > Hello Anthony,
> > Thanks for the quick feedback and suggestions, really appreciated!
> > Sorry for not being fully clear with my setup.
> >
> > Here is my full PoC HW configuration:
> >
> > 1 x HPE Synergy 12000 Configure-to-order Frame with 10x Fans
> > 2 (for redundancy) x HPE Synergy 12Gb SAS Connection Module with 12 Internal Ports 1 and 4
>
> I’m not familiar with this hardware, but my sense is that Ceph inherently provides redundancy, and that additional redundancy in the drive data plane is at best wasted, and is probably money you don’t need to spend.
>
> > 2 (for redundancy) x HPE Virtual Connect SE 100Gb F32 Module for Synergy
> > 1 x HPE Synergy D3940 12Gb SAS CTO Drive Enclosure (2.5in) Drive Bays with 2 I/O adapters
> > 24 x HPE 1.6TB SAS 12G Mixed Use SFF SC Multi Vendor SSD (Inside D3940 DAS storage)
> > 3 x HPE Synergy 480 Gen10 Compute Module with the below specifications
> > -- 2 x Intel Xeon-Gold 6252 (2.1GHz/24-core/150W) FIO Processor Kit for HPE Synergy 480/660 Gen10.
>
> Those CPUs are previous-generation FWIW, not entirely surprising for a loaner.
> For a greenfield production deployment I’d consider some changes:
>
> * Don’t spend extra for multipathing, and disable it on your PoC if you can for simplicity
> * Consider larger SSDs, 3.84TB or 7.6TB, but in a production deployment you also need to consider the discrete number of drives for fault tolerance, balancing, and software bottlenecks.
> * Seriously consider NVMe instead of SAS for a new design. With a judicious design you might be surprised at how competitive the cost can be. SAS market share is progressively declining and the choice available / new drives will continue to shrink
>
> Also, “mixed use” probably means 3DWPD-class, for Ceph purposes I’ve personally seen 1DWPD-class drives be plenty. ymmv depending on your use-case and intended lifetime.
>
>
> > -- (192 GB RAM per compute module) HPE 16GB (1x16GB) Single Rank x4 DDR4-2933 CAS-21-21-21 Registered Smart Memory Kit
> > -- 2 x HPE 300GB SAS 12G Enterprise 15K SFF (2.5in) SC 3yr Wty Digitally Signed Firmware HDD
> > -- 1 x HPE Smart Array P416ie-m SR Gen10 (8 Int 8 Ext Lanes/2GB Cache) 12G SAS Mezzanine Controller
>
> See the list archives for a litany of reasons why I don’t like RoC HBAs.
>
> > -- 1 x HPE Synergy 6820C 25/50Gb Converged Network Adapter
> >
> >
> > HW was configured as per below:
> >
> > Each compute module:
> > - 2 x 300GB HDD (the internal drives in the Synergy Gen10 servers) configured in RAID 1 array (for OS)
> > - 2 x 1,6TB (from the DAS storage) per each compute module in RAID 1 array (for additional local storage to each compute module)
>
> What are you using these for? You wouldn’t need them for Ceph alone.
>
> > - 6 x 1,6TB (from the DAS storage) per each compute module as JBOD (raw drives, no RAID) (for software-defined storage Ceph)
>
> The default osd_memory_target is 4GB, I like to provision 2x that for peak usage, so 48GB of RAM for OSDs + an allowance for OS & anything else running on the node.
>
> > - 4 network interfaces each with Speed: 25000Mb/s configured in 1 bonding for Ceph communication and 1 bonding for all other communication
> > - Installed OS RHEL 8.6
> >
> >
> >> More RAM and cores than you need, but you probably know that. I’m guessing that this is repurposed hardware?
> >
> > This is a temporary demo HW which we are using to learn what is Ceph and how to configure it and if will suite for our purposes. After the demo period is over we will return it to HP.
> >
> >> You call it DAS but don’t say anything about the interface. SAS? SATA? AoE? NVMe? FC?
> >
> > Sorry for not being clear here. The DAS storage is 'HPE Synergy D3940 12Gb SAS CTO Drive Enclosure (2.5in) Drive Bays with 2 I/O adapters'
> > The drives inside the DAS are 'HPE 1.6TB SAS 12G Mixed Use SFF SC Multi Vendor SSD (Inside D3940 DAS storage)'.
> > They are SAS SSD drives.
> >
> >> Why not try Quincy?
> >
> > Great suggestion indeed, we want to use latest stable Ceph release.
> > I was looking here https://docs.ceph.com/en/quincy/releases/index.html
> > and I saw only 'Pacific' and 'Octopus', so this is the reason I overlooked Quincy.
>
> That is being fixed right now actually.
>
> >
> >> I don’t think Ceph has any concept of multipathing. Since Ceph keeps multiple copies of data, it’s usually overkill.
> >
> > I suspected the same, that Ceph doesn't has any concept for multipathing.
> > Shall I configure the DAS storage drives with RHEL 8.6 multipath and present to Ceph the mod0 (multipath device 0)?
>
> My only experience with multipathing was with FC in a Sun lab years ago, so I can’t answer that confidently, but I can say that you’d probably either want to do that, or I’d configure without multipathing if you can. KISS principle.
>
> >
> >> Swap is an anachronism from decades ago, suggest you not waste any of your small boot drives on it.
> >> Also, why carve up the boot drives into so many, small partitions?
> >>
> >> I would suggest instead something like
> >>
> >> |-sdn1 8:209 0 953M 0 part /boot/efi
> >> |-sdn2 8:210 0 953M 0 part /boot
> >> |-sdn3 8:211 0 264.5G 0 part
> >> | |-rhel_compute1-root 253:0 0 grow 0 lvm /
> >>
> >> period. If that makes you nervous, partition 40GB for / and make the rest /home, with a bind mount of /home/varlog onto /var/log.
> >
> > Thank you, I will reconsider my partition layout again and will take it into account for further improvement.
>
> There’s a lot of personal preference in partitioning, sysadmins are a strongly opinionated breed. The practice of having a small / dates back to the days when drives were things like Fujitsu Eagles, SMD with 14” platters. RAID wasn’t a thing yet, nor were widespread journaling filesytems, so the small / reduced the probability that a medium error would land in the most vital partition. With modern drives, and especially with RAID, the need for this is no longer there. ISTR HPUX for example decades ago defaulting to one drive-spanning filesystem, with maybe some swap carved out.
>
> A common reason for a seperate /var or /var/log is to fence the potential for runaway logs. With modern logrotate and drive sizes, my personal preference is that runaways are rare and I’d rather fix the problem than address the symptom by partitioning, especially since smaller partitions actually increase the likelihood of filling up.
>
> >
> > Regards,
> > Kosta
> >
> > On Saturday, 21 May 2022 10:33:37 am (+03:00), Anthony D'Atri wrote:
> >
> >> inline.
> >>
> >>>
> >>> Hello ceph-users,
> >>>
> >>> Recently I started preparing 3-node Ceph cluster (on bare metal hardware)
> >>> We have the HW configuration ready - 3 servers HPE Synergy 480 Gen10
> >>> Compute Module, each server with 2xCPUs Intel Xeon-Gold 6252
> >>> (2.1GHz/24-core), 192GB RAM
> >>
> >> More RAM and cores than you need, but you probably know that. I’m guessing that this is repurposed hardware?
> >>
> >>> 2x300GB HDD for OS RHEL 8.6 (already
> >>> installed) and we have DAS (direct-attached-storage) with 18 x 1.6TB SSD
> >>> drives inside. I attached 6 x1.6TB SSD from the DAS to each of the 3
> >>> servers (as JBOD).
> >>
> >> You call it DAS but don’t say anything about the interface. SAS? SATA? AoE? NVMe? FC?
> >>
> >>
> >>> Now I can see these 6 SSDs as 12 devices because the DAS storage has two
> >>> paths for redundancy to each of the disks (sda, sdb, sdc, sdd, sde, sdf,
> >>> sdg, sdh, sdi, sdj, sdk, sdl).
> >>> I'm not sure how to handle the DAS storage multipath properly and according
> >>> to best practices.
> >>> For installation I will use cephadm with latest Ceph release Pacific 16.2.7
> >>
> >> Why not try Quincy?
> >>
> >>> My question is shall I configure multipath from RHEL 8.6 OS in advance (for
> >>> example sda+sdbb=md0) or I should leave cephadm to handle the multipath by
> >>> itself?
> >>
> >> I don’t think Ceph has any concept of multipathing. Since Ceph keeps multiple copies of data, it’s usually overkill.
> >>
> >>>
> >>> | |-rhel_compute1-root 253:0 0 18.6G 0 lvm /
> >>> | |-rhel_compute1-var_log 253:2 0 9.3G 0 lvm /var/log
> >>> | |-rhel_compute1-var_tmp 253:3 0 4.7G 0 lvm /var/tmp
> >>> | |-rhel_compute1-tmp 253:4 0 4.7G 0 lvm /tmp
> >>> | |-rhel_compute1-var 253:5 0 37.3G 0 lvm /var
> >>> | |-rhel_compute1-opt 253:6 0 37.3G 0 lvm /opt
> >>> | |-rhel_compute1-aux1 253:7 0 107.1G 0 lvm /aux1
> >>> | |-rhel_compute1-home 253:8 0 20.5G 0 lvm /home
> >>> | `-rhel_compute1-aux0 253:9 0 25.2G 0 lvm /aux0
> >>> |-sdn4 8:212 0 7.5G 0 part [SWAP]
> >>> `-sdn5 8:213 0 4.7G 0 part /var/log/audit
> >>> sdo 8:224 0 1.5T 0 disk
> >>> `-sdo1 8:225 0 1.5T 0 part
> >>> `-rhel_local_VG-localstor 253:1 0 1.5T 0 lvm /localstor
> >>>
> >>>
> >>
> >> Swap is an anachronism from decades ago, suggest you not waste any of your small boot drives on it.
> >> Also, why carve up the boot drives into so many, small partitions?
> >>
> >> I would suggest instead something like
> >>
> >> |-sdn1 8:209 0 953M 0 part /boot/efi
> >> |-sdn2 8:210 0 953M 0 part /boot
> >> |-sdn3 8:211 0 264.5G 0 part
> >> | |-rhel_compute1-root 253:0 0 grow 0 lvm /
> >>
> >> period. If that makes you nervous, partition 40GB for / and make the rest /home, with a bind mount of /home/varlog onto /var/log.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> > --
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
-- 
Sent with Vivaldi Mail. Download Vivaldi for free at vivaldi.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux