On Mon, Feb 26, 2018 at 2:23 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
Quick turn around,Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on bluestore opened the floodgates.
Oh right, the OSD does not (think it can) have anything it can really do if you've got a rotational journal and an SSD main device, and since BlueStore was misreporting itself as having a rotational journal the OSD falls back to the hard drive settings. Sorry I didn't work through that ahead of time; glad this works around it for you!
-Greg
pool objects-ssd id 20recovery io 1512 MB/s, 21547 objects/spool fs-metadata-ssd id 16recovery io 0 B/s, 6494 keys/s, 271 objects/sclient io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wrGraph of performance jump. Extremely marked.So at least we now have the gun to go with the smoke.Thanks for the help and appreciate you pointing me in some directions that I was able to use to figure out the issue.Adding to ceph.conf for future OSD conversions.Thanks,ReedOn Feb 26, 2018, at 4:12 PM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:For the record, I am not seeing a demonstrative fix by injecting the value of 0 into the OSDs running.osd_recovery_sleep_hybrid = '0.000000' (not observed, change may require restart)If it does indeed need to be restarted, I will need to wait for the current backfills to finish their process as restarting an OSD would bring me under min_size.However, doing config show on the osd daemon appears to have taken the value of 0.ceph daemon osd.24 config show | grep recovery_sleep"osd_recovery_sleep": "0.000000","osd_recovery_sleep_hdd": "0.100000","osd_recovery_sleep_hybrid": "0.000000","osd_recovery_sleep_ssd": "0.000000",I may take the restart as an opportunity to also move to 12.2.3 at the same time, since it is not expected that that should affect this issue.I could also attempt to change osd_recovery_sleep_hdd as well, since these are ssd osd’s, it shouldn’t make a difference, but its a free move.Thanks,ReedOn Feb 26, 2018, at 3:42 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:On Mon, Feb 26, 2018 at 12:26 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim solution to getting the metadata configured correctly.Yes, that's a good workaround as long as you don't have any actual hybrid OSDs (or aren't worried about them sleeping...I'm not sure if that setting came from experience or not).For reference, here is the complete metadata for osd.24, bluestore SATA SSD with NVMe block.db.{"id": 24,"arch": "x86_64","back_addr": "","back_iface": "bond0","bluefs": "1","bluefs_db_access_mode": "blk","bluefs_db_block_size": "4096","bluefs_db_dev": "259:0","bluefs_db_dev_node": "nvme0n1","bluefs_db_driver": "KernelDevice","bluefs_db_model": "INTEL SSDPEDMD400G4 ","bluefs_db_partition_path": "/dev/nvme0n1p4","bluefs_db_rotational": "0","bluefs_db_serial": " ","bluefs_db_size": "16000221184","bluefs_db_type": "nvme","bluefs_single_shared_device": "0","bluefs_slow_access_mode": "blk","bluefs_slow_block_size": "4096","bluefs_slow_dev": "253:8","bluefs_slow_dev_node": "dm-8","bluefs_slow_driver": "KernelDevice","bluefs_slow_model": "","bluefs_slow_partition_path": "/dev/dm-8","bluefs_slow_rotational": "0","bluefs_slow_size": "1920378863616","bluefs_slow_type": "ssd","bluestore_bdev_access_mode": "blk","bluestore_bdev_block_size": "4096","bluestore_bdev_dev": "253:8","bluestore_bdev_dev_node": "dm-8","bluestore_bdev_driver": "KernelDevice","bluestore_bdev_model": "","bluestore_bdev_partition_path": "/dev/dm-8","bluestore_bdev_rotational": "0","bluestore_bdev_size": "1920378863616","bluestore_bdev_type": "ssd","ceph_version": "ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)","cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz","default_device_class": "ssd","distro": "ubuntu","distro_description": "Ubuntu 16.04.3 LTS","distro_version": "16.04","front_addr": "","front_iface": "bond0","hb_back_addr": "","hb_front_addr": "","hostname": “host00","journal_rotational": "1","kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 2018","kernel_version": "4.13.0-26-generic","mem_swap_kb": "124999672","mem_total_kb": "131914008","os": "Linux","osd_data": "/var/lib/ceph/osd/ceph-24","osd_objectstore": "bluestore","rotational": "0"}So it looks like it correctly guessed(?) the bluestore_bdev_type/default_device_class correctly (though it may have been an inherited value?), as did bluefs_db_type get set to nvme correctly.So I’m not sure why journal_rotational is still showing 1.Maybe something in the ceph-volume lvm piece that isn’t correctly setting that flag on OSD creation?Also seems like the journal_rotational field should have been deprecated in bluestore as bluefs_db_rotational should cover that, and if there were a WAL partition as well, I assume there would be something to the tune of bluefs_wal_rotational or something like that, and journal would never be used for bluestore?Thanks to both of you for helping diagnose this issue. I created a ticket and have a PR up to fix it: http://tracker.ceph.com/issues/23141, https://github.com/ceph/ceph/pull/20602Until that gets backported into another Luminous release you'll need to do some kind of workaround though. :/-GregAppreciate the help.Thanks,ReedOn Feb 26, 2018, at 1:28 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:On Mon, Feb 26, 2018 at 11:21 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:The ‘good perf’ that I reported below was the result of beginning 5 new bluestore conversions which results in a leading edge of ‘good’ performance, before trickling off.This performance lasted about 20 minutes, where it backfilled a small set of PGs off of non-bluestore OSDs.Current performance is now hovering around:pool objects-ssd id 20recovery io 14285 kB/s, 202 objects/spool fs-metadata-ssd id 16recovery io 0 B/s, 262 keys/s, 12 objects/sclient io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wrWhat are you referencing when you talk about recovery ops per second?These are recovery ops as reported by ceph -s or via stats exported via influx plugin in mgr, and via local collectd collection.Also, what are the values for osd_recovery_sleep_hdd and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that your BlueStore SSD OSDs are correctly reporting both themselves and their journals as non-rotational?This yields more interesting results.Pasting results for 3 sets of OSDs in this order{0}hdd+nvme block.db{24}ssd+nvme block.db{59}ssd+nvme journalceph osd metadata | grep 'id\|rotational'"id": 0,"bluefs_db_rotational": "0","bluefs_slow_rotational": "1","bluestore_bdev_rotational": "1","journal_rotational": "1","rotational": “1""id": 24,"bluefs_db_rotational": "0","bluefs_slow_rotational": "0","bluestore_bdev_rotational": "0","journal_rotational": "1","rotational": “0""id": 59,"journal_rotational": "0","rotational": “0"I wonder if it matters/is correct to see "journal_rotational": “1” for the bluestore OSD’s {0,24} with nvme block.db.Hope this may be helpful in determining the root cause.If you have an SSD main store and a hard drive ("rotational") journal, the OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid config option. By default that is .025 (seconds).I believe you can override the setting (I'm not sure how), but you really want to correct that flag at the OS layer. Generally when we see this there's a RAID card or something between the solid-state device and the host which is lying about the state of the world.-GregIf it helps, all of the OSD’s were originally deployed with ceph-deploy, but are now being redone with ceph-volume locally on each host.Thanks,ReedOn Feb 26, 2018, at 1:00 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:On Mon, Feb 26, 2018 at 9:12 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:After my last round of backfills completed, I started 5 more bluestore conversions, which helped me recognize a very specific pattern of performance.pool objects-ssd id 20recovery io 757 MB/s, 10845 objects/spool fs-metadata-ssd id 16recovery io 0 B/s, 36265 keys/s, 1633 objects/sclient io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wrThe “non-throttled” backfills are only coming from filestore SSD OSD’s.When backfilling from bluestore SSD OSD’s, they appear to be throttled at the aforementioned <20 ops per OSD.Wait, is that the current state? What are you referencing when you talk about recovery ops per second?Also, what are the values for osd_recovery_sleep_hdd and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that your BlueStore SSD OSDs are correctly reporting both themselves and their journals as non-rotational?-GregThis would corroborate why the first batch of SSD’s I migrated to bluestore were all at “full” speed, as all of the OSD’s they were backfilling from were filestore based, compared to increasingly bluestore backfill targets, leading to increasingly long backfill times as I move from one host to the next.Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd values across bluestore or filestore OSDs are showing as 0 values, which means no sleep/throttle if I am reading everything correctly.sudo ceph daemon osd.73 config show | grep recovery"osd_allow_recovery_below_min_size": "true","osd_debug_skip_full_check_in_recovery": "false","osd_force_recovery_pg_log_entries_factor": "1.300000","osd_min_recovery_priority": "0","osd_recovery_cost": "20971520","osd_recovery_delay_start": "0.000000","osd_recovery_forget_lost_objects": "false","osd_recovery_max_active": "35","osd_recovery_max_chunk": "8388608","osd_recovery_max_omap_entries_per_chunk": "64000","osd_recovery_max_single_start": "1","osd_recovery_op_priority": "3","osd_recovery_op_warn_multiple": "16","osd_recovery_priority": "5","osd_recovery_retry_interval": "30.000000","osd_recovery_sleep": "0.000000","osd_recovery_sleep_hdd": "0.100000","osd_recovery_sleep_hybrid": "0.025000","osd_recovery_sleep_ssd": "0.000000","osd_recovery_thread_suicide_timeout": "300","osd_recovery_thread_timeout": "30","osd_scrub_during_recovery": "false",As far as I know, the device class is configured correctly as far as I know, it all shows as ssd/hdd correctly in ceph osd tree.So hopefully this may be enough of a smoking gun to help narrow down where this may be stemming from.Thanks,Reed_______________________________________________On Feb 23, 2018, at 10:04 AM, David Turner <drakonstein@xxxxxxxxx> wrote:Here is a [1] link to a ML thread tracking some slow backfilling on bluestore. It came down to the backfill sleep setting for them. Maybe it will help.On Fri, Feb 23, 2018 at 10:46 AM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:Probably unrelated, but I do keep seeing this odd negative objects degraded message on the fs-metadata pool:pool fs-metadata-ssd id 16-34/3 objects degraded (-1133.333%)recovery io 0 B/s, 89 keys/s, 2 objects/sclient io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wrDon’t mean to clutter the ML/thread, however it did seem odd, maybe its a culprit? Maybe its some weird sampling interval issue thats been solved in 12.2.3?Thanks,Reed_______________________________________________On Feb 23, 2018, at 8:26 AM, Reed Dier <reed.dier@xxxxxxxxxxx> wrote:Below is ceph -scluster:id: {id}health: HEALTH_WARNnoout flag(s) set260610/1068004947 objects misplaced (0.024%)Degraded data redundancy: 23157232/1068004947 objects degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersizedservices:mon: 3 daemons, quorum mon02,mon01,mon03mgr: mon03(active), standbys: mon02mds: cephfs-1/1/1 up {0=mon03=up:active}, 1 up:standbyosd: 74 osds: 74 up, 74 in; 332 remapped pgsflags nooutdata:pools: 5 pools, 5316 pgsobjects: 339M objects, 46627 GBusage: 154 TB used, 108 TB / 262 TB availpgs: 23157232/1068004947 objects degraded (2.168%)260610/1068004947 objects misplaced (0.024%)4984 active+clean183 active+undersized+degraded+remapped+backfilling145 active+undersized+degraded+remapped+backfill_wait3 active+remapped+backfill_wait1 active+remapped+backfillingio:client: 8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wrrecovery: 37057 kB/s, 50 keys/s, 217 objects/sAlso the two pools on the SSDs, are the objects pool at 4096 PG, and the fs-metadata pool at 32 PG.Are you sure the recovery is actually going slower, or are the individual ops larger or more expensive?The objects should not vary wildly in size.Even if they were differing in size, the SSDs are roughly idle in their current state of backfilling when examining wait in iotop, or atop, or sysstat/iostat.This compares to when I was fully saturating the SATA backplane with over 1000MB/s of writes to multiple disks when the backfills were going “full speed.”Here is a breakdown of recovery io by pool:pool objects-ssd id 20recovery io 6779 kB/s, 92 objects/sclient io 3071 kB/s rd, 50 op/s rd, 0 op/s wrpool fs-metadata-ssd id 16recovery io 0 B/s, 28 keys/s, 2 objects/sclient io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wrpool cephfs-hdd id 17recovery io 40542 kB/s, 158 objects/sclient io 10056 kB/s rd, 142 op/s rd, 0 op/s wrSo the 24 HDD’s are outperforming the 50 SSD’s for recovery and client traffic at the moment, which seems conspicuous to me.Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with one OSD occasionally spiking up to 300-500 for a few minutes. Stats being pulled by both local CollectD instances on each node, as well as the Influx plugin in MGR as we evaluate that against collectd.Thanks,ReedOn Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:What's the output of "ceph -s" while this is happening?Is there some identifiable difference between these two states, like you get a lot of throughput on the data pools but then metadata recovery is slower?Are you sure the recovery is actually going slower, or are the individual ops larger or more expensive?My WAG is that recovering the metadata pool, composed mostly of directories stored in omap objects, is going much slower for some reason. You can adjust the cost of those individual ops some by changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure which way you want to go or indeed if this has anything to do with the problem you're seeing. (eg, it could be that reading out the omaps is expensive, so you can get higher recovery op numbers by turning down the number of entries per request, but not actually see faster backfilling because you have to issue more requests.)-GregOn Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:Hi all,_______________________________________________I am running into an odd situation that I cannot easily explain.I am currently in the midst of destroy and rebuild of OSDs from filestore to bluestore.With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing unexpected behavior. The HDDs and SSDs are set in crush accordingly.My path to replacing the OSDs is to set the noout, norecover, norebalance flag, destroy the OSD, create the OSD back, (iterate n times, all within a single failure domain), unset the flags, and let it go. It finishes, rinse, repeat.For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for block.db (previously filestore journals).2x10GbE networking between the nodes. SATA backplane caps out at around 10 Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.When the flags are unset, recovery starts and I see a very large rush of traffic, however, after the first machine completed, the performance tapered off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while I will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops from the others that are backfilling.This is a far cry from the more than 15-30k recovery ops that it started off recovering with 1-3k recovery ops from a single OSD to the backfilling OSD(s). And an even farther cry from the >15k recovery ops I was sustaining for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) in a little under an hour, and I could do about 5 at a time and still keep it at roughly an hour to backfill all of them, but then I hit a roadblock after the first machine, when I tried to do 10 at a time (single machine). I am now still experiencing the same thing on the third node, while doing 5 OSDs at a time.The pools associated with these SSDs are cephfs-metadata, as well as a pure rados object pool we use for our own internal applications. Both are size=3, min_size=2.It appears I am not the first to run into this, but it looks like there was no resolution: https://www.spinics.net/lists/ceph-users/msg41493.htmlRecovery parameters for the OSDs match what was in the previous thread, sans the osd conf block listed. And current osd_max_backfills = 30 and osd_recovery_max_active = 35. Very little activity on the OSDs during this period, so should not be any contention for iops on the SSDs.The only oddity that I can attribute to things is that we had a few periods of time where the disk load on one of the mons was high enough to cause the mon to drop out of quorum for a brief amount of time, a few times. But I wouldn’t think backfills would just get throttled due to mons flapping.Hopefully someone has some experience or can steer me in a path to improve the performance of the backfills so that I’m not stuck in backfill purgatory longer than I need to be.Linking an imgur album with some screen grabs of the recovery ops over time for the first machine, versus the second and third machines to demonstrate the delta between them.Also including a ceph osd df of the SSDs, highlighted in red are the OSDs currently backfilling. Could this possibly be PG overdose? I don’t ever run into ‘stuck activating’ PGs, its just painfully slow backfills, like they are being throttled by ceph, that are causing me to worry. Drives aren’t worn, <30 P/E cycles on the drives, so plenty of life left in them.Thanks,Reed$ ceph osd dfID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS24 ssd 1.76109 1.00000 1803G 1094G 708G 60.69 1.08 26025 ssd 1.76109 1.00000 1803G 1136G 667G 63.01 1.12 27126 ssd 1.76109 1.00000 1803G 1018G 785G 56.46 1.01 24327 ssd 1.76109 1.00000 1803G 1065G 737G 59.10 1.05 25328 ssd 1.76109 1.00000 1803G 1026G 776G 56.94 1.02 24529 ssd 1.76109 1.00000 1803G 1132G 671G 62.79 1.12 27030 ssd 1.76109 1.00000 1803G 944G 859G 52.35 0.93 22431 ssd 1.76109 1.00000 1803G 1061G 742G 58.85 1.05 25232 ssd 1.76109 1.00000 1803G 1003G 799G 55.67 0.99 23933 ssd 1.76109 1.00000 1803G 1049G 753G 58.20 1.04 25034 ssd 1.76109 1.00000 1803G 1086G 717G 60.23 1.07 25735 ssd 1.76109 1.00000 1803G 978G 824G 54.26 0.97 23236 ssd 1.76109 1.00000 1803G 1057G 745G 58.64 1.05 25237 ssd 1.76109 1.00000 1803G 1025G 777G 56.88 1.01 24438 ssd 1.76109 1.00000 1803G 1047G 756G 58.06 1.04 25039 ssd 1.76109 1.00000 1803G 1031G 771G 57.20 1.02 24640 ssd 1.76109 1.00000 1803G 1029G 774G 57.07 1.02 24541 ssd 1.76109 1.00000 1803G 1033G 770G 57.28 1.02 24542 ssd 1.76109 1.00000 1803G 993G 809G 55.10 0.98 23643 ssd 1.76109 1.00000 1803G 1072G 731G 59.45 1.06 25644 ssd 1.76109 1.00000 1803G 1039G 763G 57.64 1.03 24845 ssd 1.76109 1.00000 1803G 992G 810G 55.06 0.98 23646 ssd 1.76109 1.00000 1803G 1068G 735G 59.23 1.06 25447 ssd 1.76109 1.00000 1803G 1020G 783G 56.57 1.01 24248 ssd 1.76109 1.00000 1803G 945G 857G 52.44 0.94 22549 ssd 1.76109 1.00000 1803G 649G 1154G 36.01 0.64 13950 ssd 1.76109 1.00000 1803G 426G 1377G 23.64 0.42 8351 ssd 1.76109 1.00000 1803G 610G 1193G 33.84 0.60 13152 ssd 1.76109 1.00000 1803G 558G 1244G 30.98 0.55 11853 ssd 1.76109 1.00000 1803G 731G 1072G 40.54 0.72 16154 ssd 1.74599 1.00000 1787G 859G 928G 48.06 0.86 22955 ssd 1.74599 1.00000 1787G 942G 844G 52.74 0.94 25256 ssd 1.74599 1.00000 1787G 928G 859G 51.94 0.93 24657 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 27758 ssd 1.74599 1.00000 1787G 963G 824G 53.87 0.96 25559 ssd 1.74599 1.00000 1787G 909G 877G 50.89 0.91 24160 ssd 1.74599 1.00000 1787G 1039G 748G 58.15 1.04 27761 ssd 1.74599 1.00000 1787G 892G 895G 49.91 0.89 23862 ssd 1.74599 1.00000 1787G 927G 859G 51.90 0.93 24563 ssd 1.74599 1.00000 1787G 864G 922G 48.39 0.86 22964 ssd 1.74599 1.00000 1787G 968G 819G 54.16 0.97 25765 ssd 1.74599 1.00000 1787G 892G 894G 49.93 0.89 23766 ssd 1.74599 1.00000 1787G 951G 836G 53.23 0.95 25267 ssd 1.74599 1.00000 1787G 878G 908G 49.16 0.88 23268 ssd 1.74599 1.00000 1787G 899G 888G 50.29 0.90 23869 ssd 1.74599 1.00000 1787G 948G 839G 53.04 0.95 25270 ssd 1.74599 1.00000 1787G 914G 873G 51.15 0.91 24671 ssd 1.74599 1.00000 1787G 1004G 782G 56.21 1.00 26672 ssd 1.74599 1.00000 1787G 812G 974G 45.47 0.81 21673 ssd 1.74599 1.00000 1787G 932G 855G 52.15 0.93 247
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com