Re: Fixing BlueFS spillover (pacific 16.2.14)

Igor Fedotov <igor.fedotov@xxxxxxxx> · Mon, 16 Oct 2023 15:46:17 +0300

Hi Chris,

for the first question (osd.76) you might want to try ceph-volume's "lvm 
migrate --from data --target <db lvm>" command. Looks like some 
persistent DB remnants are still kept at main device causing the alert.

W.r.t osd.86's question - the line "SLOW        0 B         3.0 GiB     
59 GiB" means that RocksDB higher levels  data (usually L3+) are spread 
over DB and main (aka slow) devices as 3 GB and 59 GB respectively.

In other words SLOW row refers to DB data which is originally supposed 
to be at SLOW device (due to RocksDB data mapping mechanics). But 
improved bluefs logic (introduced by 
https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage 
for a part of this data.

Resizing DB volume and following DB compaction should do the trick and 
move all the data to DB device. Alternatively ceph-volume's lvm migrate 
command should do the same but the result will be rather temporary 
without DB volume resizing.

Hope this helps.

Thanks,

Igor

On 06/10/2023 06:55, Chris Dunlop wrote:
Hi,

tl;dr why are my osds still spilling?

I've recently upgraded to 16.2.14 from 16.2.9 and started receiving 
bluefs spillover warnings (due to the "fix spillover alert" per the 
16.2.14 release notes). E.g. from 'ceph health detail', the warning on 
one of these (there are a few):

osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of 
60 GiB) to slow device

This is a 15T HDD with only a 60G SSD for the db so it's not 
surprising it spilled as it's way below the recommendation for rbd 
usage at db size 1-2% of the storage size.

There was some spare space on the db ssd so I increased the size of 
the db LV up over 400G and did an bluefs-bdev-expand.

However, days later, I'm still getting the spillover warning for that 
osd, including after running a manual compact:

# ceph tell osd.76 compact

See attached perf-dump-76 for the perf dump output:

# cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq 
-r '.bluefs'

In particular, if my understanding is correct, that's telling me the 
db available size is 487G (i.e. the LV expand worked), of which it's 
using 59G, and there's 128K spilled to the slow device:

"db_total_bytes": 512309059584,  # 487G
"db_used_bytes": 63470305280,    # 59G
"slow_used_bytes": 131072,       # 128K

A "bluefs stats" also says the db is using 128K of slow storage 
(although perhaps it's getting the info from the same place as the 
perf dump?):

# ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using 
0xea6200000(59 GiB)
2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV     WAL         DB          SLOW        * *           
REAL        FILES       LOG         0 B         10 MiB      0 
B         0 B         0 B         8.8 MiB 1           WAL         0 
B         2.5 GiB     0 B         0 B         0 B         751 MiB     
8           DB          0 B         56 GiB      128 KiB     0 
B         0 B         50 GiB      842         SLOW        0 B         
0 B         0 B         0 B         0 B         0 B         0 
TOTAL       0 B         58 GiB      128 KiB     0 B         0 
B         0 B         850         MAXIMUMS:
LOG         0 B         22 MiB      0 B         0 B         0 
B         18 MiB      WAL         0 B         3.9 GiB     0 B         
0 B         0 B         1.0 GiB     DB          0 B         71 
GiB      282 MiB     0 B         0 B         62 GiB      SLOW        0 
B         0 B         0 B         0 B         0 B         0 B         
TOTAL       0 B         74 GiB      282 MiB     0 B         0 
B         0 B
SIZE <<  0 B         453 GiB 14 TiB 

I had a look at the "DUMPING STATS" output in the logs bug I don't 
know how to interpret it. I did try calculating the total of the sizes 
on the "Sum" lines but that comes to 100G so I don't know what that 
all means. See attached log-stats-76.

I also tried "ceph-kvstore-tool bluestore-kv ... stats":

$ {
  cephadm  unit --fsid $clusterid --name osd.76 stop
  cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool 
bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm  unit --fsid 
$clusterid --name osd.76 start
}

Output attached as bluestore-kv-stats-76. I can't see anything 
interesting in there, although again I don't really know how to 
interpret it.

So... why is this osd db still spilling onto slow storage, and how do 
I fix things so it's no longer using the slow storage?

And a bonus issue...  on another osd that hasn't yet been resized 
(i.e.  again with a grossly undersized 60G db on SSD with a 15T HDD) 
I'm also getting a spillover warning. The "bluefs stats" seems to be 
saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW 
position in the matrix), but there's "something" currently using 59G 
on the slow device:

$ ceph tell osd.85 bluefs stats
1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV     WAL         DB          SLOW        * *           
REAL        FILES       LOG         0 B         10 MiB      0 
B         0 B         0 B         7.4 MiB 1           WAL         0 
B         564 MiB     0 B         0 B         0 B         132 MiB     
2           DB          0 B         11 GiB      0 B         0 
B         0 B         8.1 GiB     177         SLOW        0 B         
3.0 GiB     59 GiB      0 B         0 B         56 GiB      898 
TOTAL       0 B         13 GiB      59 GiB      0 B         0 
B         0 B         1072        MAXIMUMS:
LOG         0 B         24 MiB      0 B         0 B         0 
B         20 MiB      WAL         0 B         2.8 GiB     0 B         
0 B         0 B         1.0 GiB     DB          0 B         22 
GiB      448 KiB     0 B         0 B         18 GiB      SLOW        0 
B         3.3 GiB     62 GiB      0 B         0 B         62 GiB      
TOTAL       0 B         27 GiB      62 GiB      0 B         0 
B         0 B
SIZE <<  0 B         57 GiB 14 TiB 

Is there anywhere that describes how to interpret this output, and 
specifically, what stuff is going into the SLOW row? Seemingly there's 
898 "files" there, but not LOG, WAL or DB files - so what are they?

Cheers,

Chris

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx