One more manual compaction updated bluefs stats figures accordingly. So at the end, it is: 1/ ceph orch daemon stop osd.${osd} 2/ cephadm shell --fsid $(ceph fsid) --name osd.${osd} -- ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source /var/lib/ceph/osd/ceph-${osd}/block --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db 3/ ceph orch daemon start osd.${osd} 4/ ceph tell osd.${osd} compact Regards, Frédéric. ----- Le 8 Juil 24, à 17:39, Frédéric Nass frederic.nass@xxxxxxxxxxxxxxxx a écrit : > Hello, > > I just wanted to share that the following command also helped us move slow used > bytes back to the fast device (without using bluefs-bdev-expand), when several > compactions couldn't: > > $ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool > bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source > /var/lib/ceph/osd/ceph-${osd}/block --dev-target > /var/lib/ceph/osd/ceph-${osd}/block.db > > slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got > cleared but 'bluefs stats' is not on par: > > $ ceph tell osd.451 bluefs stats > 1 : device size 0x1effbfe000 : using 0x309600000(12 GiB) > 2 : device size 0x746dfc00000 : using 0x3abd77d2000(3.7 TiB) > RocksDBBlueFSVolumeSelector Usage Matrix: > DEV/LEV WAL DB SLOW * * REAL > FILES > LOG 0 B 22 MiB 0 B 0 B 0 B 3.9 MiB > 1 > WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB > 2 > DB 0 B 12 GiB 0 B 0 B 0 B 12 GiB > 196 > SLOW 0 B 4 MiB 0 B 0 B 0 B 3.8 MiB > 1 > TOTAL 0 B 12 GiB 0 B 0 B 0 B 0 B > 200 > MAXIMUMS: > LOG 0 B 22 MiB 0 B 0 B 0 B 17 MiB > WAL 0 B 33 MiB 0 B 0 B 0 B 32 MiB > DB 0 B 24 GiB 0 B 0 B 0 B 24 GiB > SLOW 0 B 4 MiB 0 B 0 B 0 B 3.8 MiB > TOTAL 0 B 24 GiB 0 B 0 B 0 B 0 B >>> SIZE << 0 B 118 GiB 6.9 TiB > > Any idea? Is this something to worry about? > > Regards, > Frédéric. > > ----- Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedotov@xxxxxxxx a écrit : > >> Hi Chris, >> >> for the first question (osd.76) you might want to try ceph-volume's "lvm >> migrate --from data --target <db lvm>" command. Looks like some >> persistent DB remnants are still kept at main device causing the alert. >> >> W.r.t osd.86's question - the line "SLOW 0 B 3.0 GiB >> 59 GiB" means that RocksDB higher levels data (usually L3+) are spread >> over DB and main (aka slow) devices as 3 GB and 59 GB respectively. >> >> In other words SLOW row refers to DB data which is originally supposed >> to be at SLOW device (due to RocksDB data mapping mechanics). But >> improved bluefs logic (introduced by >> https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage >> for a part of this data. >> >> Resizing DB volume and following DB compaction should do the trick and >> move all the data to DB device. Alternatively ceph-volume's lvm migrate >> command should do the same but the result will be rather temporary >> without DB volume resizing. >> >> Hope this helps. >> >> >> Thanks, >> >> Igor >> >> On 06/10/2023 06:55, Chris Dunlop wrote: >>> Hi, >>> >>> tl;dr why are my osds still spilling? >>> >>> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving >>> bluefs spillover warnings (due to the "fix spillover alert" per the >>> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on >>> one of these (there are a few): >>> >>> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of >>> 60 GiB) to slow device >>> >>> This is a 15T HDD with only a 60G SSD for the db so it's not >>> surprising it spilled as it's way below the recommendation for rbd >>> usage at db size 1-2% of the storage size. >>> >>> There was some spare space on the db ssd so I increased the size of >>> the db LV up over 400G and did an bluefs-bdev-expand. >>> >>> However, days later, I'm still getting the spillover warning for that >>> osd, including after running a manual compact: >>> >>> # ceph tell osd.76 compact >>> >>> See attached perf-dump-76 for the perf dump output: >>> >>> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq >>> -r '.bluefs' >>> >>> In particular, if my understanding is correct, that's telling me the >>> db available size is 487G (i.e. the LV expand worked), of which it's >>> using 59G, and there's 128K spilled to the slow device: >>> >>> "db_total_bytes": 512309059584, # 487G >>> "db_used_bytes": 63470305280, # 59G >>> "slow_used_bytes": 131072, # 128K >>> >>> A "bluefs stats" also says the db is using 128K of slow storage >>> (although perhaps it's getting the info from the same place as the >>> perf dump?): >>> >>> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using >>> 0xea6200000(59 GiB) >>> 2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB) >>> RocksDBBlueFSVolumeSelector Usage Matrix: >>> DEV/LEV WAL DB SLOW * * >>> REAL FILES LOG 0 B 10 MiB 0 >>> B 0 B 0 B 8.8 MiB 1 WAL 0 >>> B 2.5 GiB 0 B 0 B 0 B 751 MiB >>> 8 DB 0 B 56 GiB 128 KiB 0 >>> B 0 B 50 GiB 842 SLOW 0 B >>> 0 B 0 B 0 B 0 B 0 B 0 >>> TOTAL 0 B 58 GiB 128 KiB 0 B 0 >>> B 0 B 850 MAXIMUMS: >>> LOG 0 B 22 MiB 0 B 0 B 0 >>> B 18 MiB WAL 0 B 3.9 GiB 0 B >>> 0 B 0 B 1.0 GiB DB 0 B 71 >>> GiB 282 MiB 0 B 0 B 62 GiB SLOW 0 >>> B 0 B 0 B 0 B 0 B 0 B >>> TOTAL 0 B 74 GiB 282 MiB 0 B 0 >>> B 0 B >>>>> SIZE << 0 B 453 GiB 14 TiB >>> >>> I had a look at the "DUMPING STATS" output in the logs bug I don't >>> know how to interpret it. I did try calculating the total of the sizes >>> on the "Sum" lines but that comes to 100G so I don't know what that >>> all means. See attached log-stats-76. >>> >>> I also tried "ceph-kvstore-tool bluestore-kv ... stats": >>> >>> $ { >>> cephadm unit --fsid $clusterid --name osd.76 stop >>> cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool >>> bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm unit --fsid >>> $clusterid --name osd.76 start >>> } >>> >>> Output attached as bluestore-kv-stats-76. I can't see anything >>> interesting in there, although again I don't really know how to >>> interpret it. >>> >>> So... why is this osd db still spilling onto slow storage, and how do >>> I fix things so it's no longer using the slow storage? >>> >>> >>> And a bonus issue... on another osd that hasn't yet been resized >>> (i.e. again with a grossly undersized 60G db on SSD with a 15T HDD) >>> I'm also getting a spillover warning. The "bluefs stats" seems to be >>> saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW >>> position in the matrix), but there's "something" currently using 59G >>> on the slow device: >>> >>> $ ceph tell osd.85 bluefs stats >>> 1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB) >>> 2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB) >>> RocksDBBlueFSVolumeSelector Usage Matrix: >>> DEV/LEV WAL DB SLOW * * >>> REAL FILES LOG 0 B 10 MiB 0 >>> B 0 B 0 B 7.4 MiB 1 WAL 0 >>> B 564 MiB 0 B 0 B 0 B 132 MiB >>> 2 DB 0 B 11 GiB 0 B 0 >>> B 0 B 8.1 GiB 177 SLOW 0 B >>> 3.0 GiB 59 GiB 0 B 0 B 56 GiB 898 >>> TOTAL 0 B 13 GiB 59 GiB 0 B 0 >>> B 0 B 1072 MAXIMUMS: >>> LOG 0 B 24 MiB 0 B 0 B 0 >>> B 20 MiB WAL 0 B 2.8 GiB 0 B >>> 0 B 0 B 1.0 GiB DB 0 B 22 >>> GiB 448 KiB 0 B 0 B 18 GiB SLOW 0 >>> B 3.3 GiB 62 GiB 0 B 0 B 62 GiB >>> TOTAL 0 B 27 GiB 62 GiB 0 B 0 >>> B 0 B >>>>> SIZE << 0 B 57 GiB 14 TiB >>> >>> Is there anywhere that describes how to interpret this output, and >>> specifically, what stuff is going into the SLOW row? Seemingly there's >>> 898 "files" there, but not LOG, WAL or DB files - so what are they? >>> >>> Cheers, >>> >>> Chris >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx