Hi Chris,
for the first question (osd.76) you might want to try ceph-volume's "lvm
migrate --from data --target <db lvm>" command. Looks like some
persistent DB remnants are still kept at main device causing the alert.
W.r.t osd.86's question - the line "SLOW 0 B 3.0 GiB
59 GiB" means that RocksDB higher levels data (usually L3+) are spread
over DB and main (aka slow) devices as 3 GB and 59 GB respectively.
In other words SLOW row refers to DB data which is originally supposed
to be at SLOW device (due to RocksDB data mapping mechanics). But
improved bluefs logic (introduced by
https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage
for a part of this data.
Resizing DB volume and following DB compaction should do the trick and
move all the data to DB device. Alternatively ceph-volume's lvm migrate
command should do the same but the result will be rather temporary
without DB volume resizing.
Hope this helps.
Thanks,
Igor
On 06/10/2023 06:55, Chris Dunlop wrote:
Hi,
tl;dr why are my osds still spilling?
I've recently upgraded to 16.2.14 from 16.2.9 and started receiving
bluefs spillover warnings (due to the "fix spillover alert" per the
16.2.14 release notes). E.g. from 'ceph health detail', the warning on
one of these (there are a few):
osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of
60 GiB) to slow device
This is a 15T HDD with only a 60G SSD for the db so it's not
surprising it spilled as it's way below the recommendation for rbd
usage at db size 1-2% of the storage size.
There was some spare space on the db ssd so I increased the size of
the db LV up over 400G and did an bluefs-bdev-expand.
However, days later, I'm still getting the spillover warning for that
osd, including after running a manual compact:
# ceph tell osd.76 compact
See attached perf-dump-76 for the perf dump output:
# cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq
-r '.bluefs'
In particular, if my understanding is correct, that's telling me the
db available size is 487G (i.e. the LV expand worked), of which it's
using 59G, and there's 128K spilled to the slow device:
"db_total_bytes": 512309059584, # 487G
"db_used_bytes": 63470305280, # 59G
"slow_used_bytes": 131072, # 128K
A "bluefs stats" also says the db is using 128K of slow storage
(although perhaps it's getting the info from the same place as the
perf dump?):
# ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using
0xea6200000(59 GiB)
2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB SLOW * *
REAL FILES LOG 0 B 10 MiB 0
B 0 B 0 B 8.8 MiB 1 WAL 0
B 2.5 GiB 0 B 0 B 0 B 751 MiB
8 DB 0 B 56 GiB 128 KiB 0
B 0 B 50 GiB 842 SLOW 0 B
0 B 0 B 0 B 0 B 0 B 0
TOTAL 0 B 58 GiB 128 KiB 0 B 0
B 0 B 850 MAXIMUMS:
LOG 0 B 22 MiB 0 B 0 B 0
B 18 MiB WAL 0 B 3.9 GiB 0 B
0 B 0 B 1.0 GiB DB 0 B 71
GiB 282 MiB 0 B 0 B 62 GiB SLOW 0
B 0 B 0 B 0 B 0 B 0 B
TOTAL 0 B 74 GiB 282 MiB 0 B 0
B 0 B
SIZE << 0 B 453 GiB 14 TiB
I had a look at the "DUMPING STATS" output in the logs bug I don't
know how to interpret it. I did try calculating the total of the sizes
on the "Sum" lines but that comes to 100G so I don't know what that
all means. See attached log-stats-76.
I also tried "ceph-kvstore-tool bluestore-kv ... stats":
$ {
cephadm unit --fsid $clusterid --name osd.76 stop
cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool
bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm unit --fsid
$clusterid --name osd.76 start
}
Output attached as bluestore-kv-stats-76. I can't see anything
interesting in there, although again I don't really know how to
interpret it.
So... why is this osd db still spilling onto slow storage, and how do
I fix things so it's no longer using the slow storage?
And a bonus issue... on another osd that hasn't yet been resized
(i.e. again with a grossly undersized 60G db on SSD with a 15T HDD)
I'm also getting a spillover warning. The "bluefs stats" seems to be
saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW
position in the matrix), but there's "something" currently using 59G
on the slow device:
$ ceph tell osd.85 bluefs stats
1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV WAL DB SLOW * *
REAL FILES LOG 0 B 10 MiB 0
B 0 B 0 B 7.4 MiB 1 WAL 0
B 564 MiB 0 B 0 B 0 B 132 MiB
2 DB 0 B 11 GiB 0 B 0
B 0 B 8.1 GiB 177 SLOW 0 B
3.0 GiB 59 GiB 0 B 0 B 56 GiB 898
TOTAL 0 B 13 GiB 59 GiB 0 B 0
B 0 B 1072 MAXIMUMS:
LOG 0 B 24 MiB 0 B 0 B 0
B 20 MiB WAL 0 B 2.8 GiB 0 B
0 B 0 B 1.0 GiB DB 0 B 22
GiB 448 KiB 0 B 0 B 18 GiB SLOW 0
B 3.3 GiB 62 GiB 0 B 0 B 62 GiB
TOTAL 0 B 27 GiB 62 GiB 0 B 0
B 0 B
SIZE << 0 B 57 GiB 14 TiB
Is there anywhere that describes how to interpret this output, and
specifically, what stuff is going into the SLOW row? Seemingly there's
898 "files" there, but not LOG, WAL or DB files - so what are they?
Cheers,
Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx