Re: New best practices for osds???

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 7/25/19 9:27 PM, Anthony D'Atri wrote:
We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD in order to be able
to use -battery protected- write cache from the RAID controller. It really improves performance, for both
bluestore and filestore OSDs.
Having run something like 6000 HDD-based FileStore OSDs with colo journals on RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.

Details:

* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the OS.  When the architecture was conceived, the HBAs in question didn’t have JBOD/passthrough, though a firmware update shortly thereafter did bring that ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for … three?  Significant cost to RMA or replace them:  time and karma wasted fighting with the system vendor CSO, engineer and remote hands time to take the system down and swap.  And then the connectors for the supercap were touchy; 15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / supercap modules to be falsely reported bad.  The system vendor acted like I was making this up and washed their hands of it, even when I provided them the HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when a system rebooted or lost power.  There was a field notice for this, which required harvesting serial numbers and checking each.  The affected range of serials was quite a bit larger than what the validation tool admitted.  I had to manage the replacement of 302+ of these in production use, each needing engineer time time to manage Ceph, to do the hands work, and hassle with RMA paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard volatile write cache to be silently turned on, despite an HBA config dump showing a setting that should have left it off.  Again data was lost when a node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / preserved cache data after a reboot / power loss if a drive failed or was yanked.  The HBA’s option ROM utility would block booting and wait for input on the console.  One could get in and tell it to discard that cache, but it would not actually do so, instead looping back to the same screen.  The only way to get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD deployment / removal / replacement.  A smartctl hack to access SMART attributes below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, but not with slightly newer servers with the next CPU generation.  They would randomly, on roughly one boot out of five, negotiate PCIe gen3 which they weren’t capable of handling properly, and would silently run at about 20% of normal speed.  Granted this isn’t necessarily specific to an IR HBA.



Add it all up, and my assertion is that the money, time, karma, and user impact you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for OSDs instead.


This is worse than I feared, but very much in the realm of concerns I had with using single-disk RAID0 setups.  Thank you very much for posting your experience!  My money would still be on using *high write endurance* NVMes for DB/WAL and whatever I could afford for block.  I still have vague hopes that in the long run we move away from the idea of of distinct block/db/wal devices and toward pools of resources that the OSD makes it's own decisions about.  I'd like to be able to hand the OSD a pile of hardware and say "go".  That might mean something like an internal caching scheme but with slow eviction and initial placement hints (IE L0 SST files should nearly always end up on fast storage).


If it were structured like the PriorityCacheManager, we'd have SSTs for different column family prefixes (OMAP, onodes, etc) competing for fast BlueFS device storage with bluestore at different priority levels (so for example onode L0 would be very high priority) and independent LRUs for each.  I'm hoping some of Igor's work on SST placement might help make this possible down the road.  On the other hand, maybe crimson, pmem, and cheap high-capacity flash are going to make all of that less necessary. I guess we'll find out. :)


Thanks,

Mark






_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux