Hi all, I am running into an odd situation that I cannot easily explain. I am currently in the midst of destroy and rebuild of OSDs from filestore to bluestore. With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing unexpected behavior. The HDDs and SSDs are set in crush accordingly. My path to replacing the OSDs is to set the noout, norecover, norebalance flag, destroy the OSD, create the OSD back, (iterate n times, all within a single failure domain), unset the flags, and let it go. It finishes, rinse, repeat. For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for block.db (previously filestore journals). 2x10GbE networking between the nodes. SATA backplane caps out at around 10 Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2. When the flags are unset, recovery starts and I see a very large rush of traffic, however, after the first machine completed, the performance tapered off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while I will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops from the others that are backfilling. This is a far cry from the more than 15-30k recovery ops that it started off recovering with 1-3k recovery ops from a single OSD to the backfilling OSD(s). And an even farther cry from the >15k recovery ops I was sustaining for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) in a little under an hour, and I could do about 5 at a time and still keep it at roughly an hour to backfill all of them, but then I hit a roadblock after the first machine, when I tried to do 10 at a time (single machine). I am now still experiencing the same thing on the third node, while doing 5 OSDs at a time. The pools associated with these SSDs are cephfs-metadata, as well as a pure rados object pool we use for our own internal applications. Both are size=3, min_size=2. It appears I am not the first to run into this, but it looks like there was no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html Recovery parameters for the OSDs match what was in the previous thread, sans the osd conf block listed. And current osd_max_backfills = 30 and osd_recovery_max_active = 35. Very little activity on the OSDs during this period, so should not be any contention for iops on the SSDs. The only oddity that I can attribute to things is that we had a few periods of time where the disk load on one of the mons was high enough to cause the mon to drop out of quorum for a brief amount of time, a few times. But I wouldn’t think backfills would just get throttled due to mons flapping. Hopefully someone has some experience or can steer me in a path to improve the performance of the backfills so that I’m not stuck in backfill purgatory longer than I need to be. Linking an imgur album with some screen grabs of the recovery ops over time for the first machine, versus the second and third machines to demonstrate the delta between them. Also including a ceph osd df of the SSDs, highlighted in red are the OSDs currently backfilling. Could this possibly be PG overdose? I don’t ever run into ‘stuck activating’ PGs, its just painfully slow backfills, like they are being throttled by ceph, that are causing me to worry. Drives aren’t worn, <30 P/E cycles on the drives, so plenty of life left in them. Thanks, Reed
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com