Re: Do not use SSDs with (small) SLC cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



A bit late to the game, but I'm not sure if it is your drives. I had a very similar issue to yours on enterprise drives (not that means much outside of support).

What I was seeing is that a rebuild would kick off, PGs would instantly start to become laggy and then our clients (openstack rbd) would start getting hit by a slow-requests since it would start read locking the osd because of the expiring lease. This was felt by the client and was a production issue that cost the company. I spent weeks trying to tune the mClock profile, including the overall io cap, rebuild cap, and recovery cap (clients were unlimited the whole time). None of it really worked, so I switched to wpq. With that one configuration switch, all the problems went away with no real impact to rebuild time.

To clarify I don't really care for fast rebuilds as long as the rebuild time was in a reasonable, but the client impact was just killing us.

I also could never get this to trip unless it was during a recovery or rebuild. I could slam our cluster with 100k iops (random or sequential) from a bunch of different clients, which is about 50x our normal load (yeah, I know this cluster is massively over built in terms of performance), and there were zero issues.

In our use case we have flagged mClock as unstable. Since we want this to work because the concept is awesome, we will retest at the 18.2.

________________________________
From: Michael Wodniok <wodniok@xxxxxxx>
Sent: Tuesday, February 21, 2023 12:53 AM
To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject:  Do not use SSDs with (small) SLC cache

Hi all,

digging around debugging, why our (small: 10 Hosts/~60 OSDs) cluster is so slow even while recovering I found out one of our key issues are some SSDs with SLC cache (in our case Samsung SSD 870 EVO) - which we just recycled from other use cases in the hope to speed up our mainly hdd based cluster. We know it's a little bit random which objects get accelerated when not used as cache.

However the opposite was the case. These type's ssds are only fast when operating in their SLC cache, which is only several Gigabytes in a multi-TB ssd [1]. When doing a big write or a backfill onto these SSDs we got really low IO-rates (around 10 MB/s even with 4M-objects).

But it got even worse. Disclaimer: This is my view as a user, maybe a more technically involved person is able to correct me. Cause seems to be the mclock-scheduler which measures the iops an osd is able to do. As in the blog measured [2], this is usually a good thing as there is done some profiling and queing is done different. But in our case the osd_mclock_max_capacity_iops_ssd for most of the corresponding osds was very low. But not for everyone. I assume that it depends when mclock-scheduler measured the iops capacity. That led to a broken scheduling where backfills were at low speed and the ssd itself had nearly no disk usage because it was operating in it's cache again and could work faster. That issue could be solved by switching back to wpq scheduler for the affected SSDs. This scheduler seems to just queue up ios without throttling because of maximum iops reached. Now we see a still bad IO situation because of the slow SSDs but at least they are operating at their maximum (having typical settings like osd_recovery_max_active and osd_recovery_sleep* tuned).

We are going to replace the SSDs to hopefully more consistent performing ones (even if their peak performance would be not as good).

I hope this may help somebody in the future when being stuck in low performance recoverys.

Refs:

[1] https://www.tomshardware.com/reviews/samsung-870-evo-sata-ssd-review-the-best-just-got-better
[2] https://ceph.io/en/news/blog/2022/mclock-vs-wpq-testing-with-background-ops-part1/

Happy Storing!
Michael Wodniok

--

Michael Wodniok M.Sc.
WorNet AG
Bürgermeister-Graf-Ring 28
82538 Geretsried

Simply42 und SecuMail sind Marken der WorNet AG.
http://www.wor.net/

Handelsregister Amtsgericht München (HRB 129882)
Vorstand: Christian Eich
Aufsichtsratsvorsitzender: Dirk Steinkopf


________________________________

CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux