Hello, I just wanted to share that the following command also helped us move slow used bytes back to the fast device (without using bluefs-bdev-expand), when several compactions couldn't: $ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source /var/lib/ceph/osd/ceph-${osd}/block --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got cleared but 'bluefs stats' is not on par: $ ceph tell osd.451 bluefs stats 1 : device size 0x1effbfe000 : using 0x309600000(12 GiB) 2 : device size 0x746dfc00000 : using 0x3abd77d2000(3.7 TiB) RocksDBBlueFSVolumeSelector Usage Matrix: DEV/LEV WAL DB SLOW * * REAL FILES LOG 0 B 22 MiB 0 B 0 B 0 B 3.9 MiB 1 WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB 2 DB 0 B 12 GiB 0 B 0 B 0 B 12 GiB 196 SLOW 0 B 4 MiB 0 B 0 B 0 B 3.8 MiB 1 TOTAL 0 B 12 GiB 0 B 0 B 0 B 0 B 200 MAXIMUMS: LOG 0 B 22 MiB 0 B 0 B 0 B 17 MiB WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB DB 0 B 24 GiB 0 B 0 B 0 B 24 GiB SLOW 0 B 4 MiB 0 B 0 B 0 B 3.8 MiB TOTAL 0 B 24 GiB 0 B 0 B 0 B 0 B >> SIZE << 0 B 118 GiB 6.9 TiB Any idea? Is this something to worry about? Regards, Frédéric. ----- Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedotov@xxxxxxxx a écrit : > Hi Chris, > > for the first question (osd.76) you might want to try ceph-volume's "lvm > migrate --from data --target <db lvm>" command. Looks like some > persistent DB remnants are still kept at main device causing the alert. > > W.r.t osd.86's question - the line "SLOW 0 B 3.0 GiB > 59 GiB" means that RocksDB higher levels data (usually L3+) are spread > over DB and main (aka slow) devices as 3 GB and 59 GB respectively. > > In other words SLOW row refers to DB data which is originally supposed > to be at SLOW device (due to RocksDB data mapping mechanics). But > improved bluefs logic (introduced by > https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage > for a part of this data. > > Resizing DB volume and following DB compaction should do the trick and > move all the data to DB device. Alternatively ceph-volume's lvm migrate > command should do the same but the result will be rather temporary > without DB volume resizing. > > Hope this helps. > > > Thanks, > > Igor > > On 06/10/2023 06:55, Chris Dunlop wrote: >> Hi, >> >> tl;dr why are my osds still spilling? >> >> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving >> bluefs spillover warnings (due to the "fix spillover alert" per the >> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on >> one of these (there are a few): >> >> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of >> 60 GiB) to slow device >> >> This is a 15T HDD with only a 60G SSD for the db so it's not >> surprising it spilled as it's way below the recommendation for rbd >> usage at db size 1-2% of the storage size. >> >> There was some spare space on the db ssd so I increased the size of >> the db LV up over 400G and did an bluefs-bdev-expand. >> >> However, days later, I'm still getting the spillover warning for that >> osd, including after running a manual compact: >> >> # ceph tell osd.76 compact >> >> See attached perf-dump-76 for the perf dump output: >> >> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq >> -r '.bluefs' >> >> In particular, if my understanding is correct, that's telling me the >> db available size is 487G (i.e. the LV expand worked), of which it's >> using 59G, and there's 128K spilled to the slow device: >> >> "db_total_bytes": 512309059584, # 487G >> "db_used_bytes": 63470305280, # 59G >> "slow_used_bytes": 131072, # 128K >> >> A "bluefs stats" also says the db is using 128K of slow storage >> (although perhaps it's getting the info from the same place as the >> perf dump?): >> >> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using >> 0xea6200000(59 GiB) >> 2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB) >> RocksDBBlueFSVolumeSelector Usage Matrix: >> DEV/LEV WAL DB SLOW * * >> REAL FILES LOG 0 B 10 MiB 0 >> B 0 B 0 B 8.8 MiB 1 WAL 0 >> B 2.5 GiB 0 B 0 B 0 B 751 MiB >> 8 DB 0 B 56 GiB 128 KiB 0 >> B 0 B 50 GiB 842 SLOW 0 B >> 0 B 0 B 0 B 0 B 0 B 0 >> TOTAL 0 B 58 GiB 128 KiB 0 B 0 >> B 0 B 850 MAXIMUMS: >> LOG 0 B 22 MiB 0 B 0 B 0 >> B 18 MiB WAL 0 B 3.9 GiB 0 B >> 0 B 0 B 1.0 GiB DB 0 B 71 >> GiB 282 MiB 0 B 0 B 62 GiB SLOW 0 >> B 0 B 0 B 0 B 0 B 0 B >> TOTAL 0 B 74 GiB 282 MiB 0 B 0 >> B 0 B >>>> SIZE << 0 B 453 GiB 14 TiB >> >> I had a look at the "DUMPING STATS" output in the logs bug I don't >> know how to interpret it. I did try calculating the total of the sizes >> on the "Sum" lines but that comes to 100G so I don't know what that >> all means. See attached log-stats-76. >> >> I also tried "ceph-kvstore-tool bluestore-kv ... stats": >> >> $ { >> cephadm unit --fsid $clusterid --name osd.76 stop >> cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool >> bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm unit --fsid >> $clusterid --name osd.76 start >> } >> >> Output attached as bluestore-kv-stats-76. I can't see anything >> interesting in there, although again I don't really know how to >> interpret it. >> >> So... why is this osd db still spilling onto slow storage, and how do >> I fix things so it's no longer using the slow storage? >> >> >> And a bonus issue... on another osd that hasn't yet been resized >> (i.e. again with a grossly undersized 60G db on SSD with a 15T HDD) >> I'm also getting a spillover warning. The "bluefs stats" seems to be >> saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW >> position in the matrix), but there's "something" currently using 59G >> on the slow device: >> >> $ ceph tell osd.85 bluefs stats >> 1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB) >> 2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB) >> RocksDBBlueFSVolumeSelector Usage Matrix: >> DEV/LEV WAL DB SLOW * * >> REAL FILES LOG 0 B 10 MiB 0 >> B 0 B 0 B 7.4 MiB 1 WAL 0 >> B 564 MiB 0 B 0 B 0 B 132 MiB >> 2 DB 0 B 11 GiB 0 B 0 >> B 0 B 8.1 GiB 177 SLOW 0 B >> 3.0 GiB 59 GiB 0 B 0 B 56 GiB 898 >> TOTAL 0 B 13 GiB 59 GiB 0 B 0 >> B 0 B 1072 MAXIMUMS: >> LOG 0 B 24 MiB 0 B 0 B 0 >> B 20 MiB WAL 0 B 2.8 GiB 0 B >> 0 B 0 B 1.0 GiB DB 0 B 22 >> GiB 448 KiB 0 B 0 B 18 GiB SLOW 0 >> B 3.3 GiB 62 GiB 0 B 0 B 62 GiB >> TOTAL 0 B 27 GiB 62 GiB 0 B 0 >> B 0 B >>>> SIZE << 0 B 57 GiB 14 TiB >> >> Is there anywhere that describes how to interpret this output, and >> specifically, what stuff is going into the SLOW row? Seemingly there's >> 898 "files" there, but not LOG, WAL or DB files - so what are they? >> >> Cheers, >> >> Chris >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx