Re: Fixing BlueFS spillover (pacific 16.2.14)

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Mon, 8 Jul 2024 17:39:50 +0200 (CEST)

Hello,

I just wanted to share that the following command also helped us move slow used bytes back to the fast device (without using bluefs-bdev-expand), when several compactions couldn't:

$ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source /var/lib/ceph/osd/ceph-${osd}/block --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db

slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got cleared but 'bluefs stats' is not on par:

$ ceph tell osd.451 bluefs stats
1 : device size 0x1effbfe000 : using 0x309600000(12 GiB)
2 : device size 0x746dfc00000 : using 0x3abd77d2000(3.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV     WAL         DB          SLOW        *           *           REAL        FILES       
LOG         0 B         22 MiB      0 B         0 B         0 B         3.9 MiB     1           
WAL         0 B         33 MiB      0 B         0 B         0 B         32 MiB      2           
DB          0 B         12 GiB      0 B         0 B         0 B         12 GiB      196         
SLOW        0 B         4 MiB       0 B         0 B         0 B         3.8 MiB     1           
TOTAL       0 B         12 GiB      0 B         0 B         0 B         0 B         200         
MAXIMUMS:
LOG         0 B         22 MiB      0 B         0 B         0 B         17 MiB      
WAL         0 B         33 MiB      0 B         0 B         0 B         32 MiB      
DB          0 B         24 GiB      0 B         0 B         0 B         24 GiB      
SLOW        0 B         4 MiB       0 B         0 B         0 B         3.8 MiB     
TOTAL       0 B         24 GiB      0 B         0 B         0 B         0 B         
>> SIZE <<  0 B         118 GiB     6.9 TiB    

Any idea? Is this something to worry about?

Regards,
Frédéric.

----- Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedotov@xxxxxxxx a écrit :

> Hi Chris,
> 
> for the first question (osd.76) you might want to try ceph-volume's "lvm
> migrate --from data --target <db lvm>" command. Looks like some
> persistent DB remnants are still kept at main device causing the alert.
> 
> W.r.t osd.86's question - the line "SLOW        0 B         3.0 GiB
> 59 GiB" means that RocksDB higher levels  data (usually L3+) are spread
> over DB and main (aka slow) devices as 3 GB and 59 GB respectively.
> 
> In other words SLOW row refers to DB data which is originally supposed
> to be at SLOW device (due to RocksDB data mapping mechanics). But
> improved bluefs logic (introduced by
> https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage
> for a part of this data.
> 
> Resizing DB volume and following DB compaction should do the trick and
> move all the data to DB device. Alternatively ceph-volume's lvm migrate
> command should do the same but the result will be rather temporary
> without DB volume resizing.
> 
> Hope this helps.
> 
> 
> Thanks,
> 
> Igor
> 
> On 06/10/2023 06:55, Chris Dunlop wrote:
>> Hi,
>>
>> tl;dr why are my osds still spilling?
>>
>> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving
>> bluefs spillover warnings (due to the "fix spillover alert" per the
>> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on
>> one of these (there are a few):
>>
>> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of
>> 60 GiB) to slow device
>>
>> This is a 15T HDD with only a 60G SSD for the db so it's not
>> surprising it spilled as it's way below the recommendation for rbd
>> usage at db size 1-2% of the storage size.
>>
>> There was some spare space on the db ssd so I increased the size of
>> the db LV up over 400G and did an bluefs-bdev-expand.
>>
>> However, days later, I'm still getting the spillover warning for that
>> osd, including after running a manual compact:
>>
>> # ceph tell osd.76 compact
>>
>> See attached perf-dump-76 for the perf dump output:
>>
>> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq
>> -r '.bluefs'
>>
>> In particular, if my understanding is correct, that's telling me the
>> db available size is 487G (i.e. the LV expand worked), of which it's
>> using 59G, and there's 128K spilled to the slow device:
>>
>> "db_total_bytes": 512309059584,  # 487G
>> "db_used_bytes": 63470305280,    # 59G
>> "slow_used_bytes": 131072,       # 128K
>>
>> A "bluefs stats" also says the db is using 128K of slow storage
>> (although perhaps it's getting the info from the same place as the
>> perf dump?):
>>
>> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using
>> 0xea6200000(59 GiB)
>> 2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
>> RocksDBBlueFSVolumeSelector Usage Matrix:
>> DEV/LEV     WAL         DB          SLOW        * *
>> REAL        FILES       LOG         0 B         10 MiB      0
>> B         0 B         0 B         8.8 MiB 1           WAL         0
>> B         2.5 GiB     0 B         0 B         0 B         751 MiB
>> 8           DB          0 B         56 GiB      128 KiB     0
>> B         0 B         50 GiB      842         SLOW        0 B
>> 0 B         0 B         0 B         0 B         0 B         0
>> TOTAL       0 B         58 GiB      128 KiB     0 B         0
>> B         0 B         850         MAXIMUMS:
>> LOG         0 B         22 MiB      0 B         0 B         0
>> B         18 MiB      WAL         0 B         3.9 GiB     0 B
>> 0 B         0 B         1.0 GiB     DB          0 B         71
>> GiB      282 MiB     0 B         0 B         62 GiB      SLOW        0
>> B         0 B         0 B         0 B         0 B         0 B
>> TOTAL       0 B         74 GiB      282 MiB     0 B         0
>> B         0 B
>>>> SIZE <<  0 B         453 GiB 14 TiB
>>
>> I had a look at the "DUMPING STATS" output in the logs bug I don't
>> know how to interpret it. I did try calculating the total of the sizes
>> on the "Sum" lines but that comes to 100G so I don't know what that
>> all means. See attached log-stats-76.
>>
>> I also tried "ceph-kvstore-tool bluestore-kv ... stats":
>>
>> $ {
>>   cephadm  unit --fsid $clusterid --name osd.76 stop
>>   cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool
>> bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm  unit --fsid
>> $clusterid --name osd.76 start
>> }
>>
>> Output attached as bluestore-kv-stats-76. I can't see anything
>> interesting in there, although again I don't really know how to
>> interpret it.
>>
>> So... why is this osd db still spilling onto slow storage, and how do
>> I fix things so it's no longer using the slow storage?
>>
>>
>> And a bonus issue...  on another osd that hasn't yet been resized
>> (i.e.  again with a grossly undersized 60G db on SSD with a 15T HDD)
>> I'm also getting a spillover warning. The "bluefs stats" seems to be
>> saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW
>> position in the matrix), but there's "something" currently using 59G
>> on the slow device:
>>
>> $ ceph tell osd.85 bluefs stats
>> 1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
>> 2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
>> RocksDBBlueFSVolumeSelector Usage Matrix:
>> DEV/LEV     WAL         DB          SLOW        * *
>> REAL        FILES       LOG         0 B         10 MiB      0
>> B         0 B         0 B         7.4 MiB 1           WAL         0
>> B         564 MiB     0 B         0 B         0 B         132 MiB
>> 2           DB          0 B         11 GiB      0 B         0
>> B         0 B         8.1 GiB     177         SLOW        0 B
>> 3.0 GiB     59 GiB      0 B         0 B         56 GiB      898
>> TOTAL       0 B         13 GiB      59 GiB      0 B         0
>> B         0 B         1072        MAXIMUMS:
>> LOG         0 B         24 MiB      0 B         0 B         0
>> B         20 MiB      WAL         0 B         2.8 GiB     0 B
>> 0 B         0 B         1.0 GiB     DB          0 B         22
>> GiB      448 KiB     0 B         0 B         18 GiB      SLOW        0
>> B         3.3 GiB     62 GiB      0 B         0 B         62 GiB
>> TOTAL       0 B         27 GiB      62 GiB      0 B         0
>> B         0 B
>>>> SIZE <<  0 B         57 GiB 14 TiB
>>
>> Is there anywhere that describes how to interpret this output, and
>> specifically, what stuff is going into the SLOW row? Seemingly there's
>> 898 "files" there, but not LOG, WAL or DB files - so what are they?
>>
>> Cheers,
>>
>> Chris
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx