Re: Fixing BlueFS spillover (pacific 16.2.14)

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Tue, 9 Jul 2024 14:25:43 +0200 (CEST)

One more manual compaction updated bluefs stats figures accordingly.

So at the end, it is:

1/ ceph orch daemon stop osd.${osd}
2/ cephadm shell --fsid $(ceph fsid) --name osd.${osd} -- ceph-bluestore-tool bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source /var/lib/ceph/osd/ceph-${osd}/block --dev-target /var/lib/ceph/osd/ceph-${osd}/block.db
3/ ceph orch daemon start osd.${osd}
4/ ceph tell osd.${osd} compact

Regards,
Frédéric.

----- Le 8 Juil 24, à 17:39, Frédéric Nass frederic.nass@xxxxxxxxxxxxxxxx a écrit :

> Hello,
> 
> I just wanted to share that the following command also helped us move slow used
> bytes back to the fast device (without using bluefs-bdev-expand), when several
> compactions couldn't:
> 
> $ cephadm shell --fsid $cid --name osd.${osd} -- ceph-bluestore-tool
> bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source
> /var/lib/ceph/osd/ceph-${osd}/block --dev-target
> /var/lib/ceph/osd/ceph-${osd}/block.db
> 
> slow_used_bytes is now back to 0 on perf dump and BLUEFS_SPILLOVER alert got
> cleared but 'bluefs stats' is not on par:
> 
> $ ceph tell osd.451 bluefs stats
> 1 : device size 0x1effbfe000 : using 0x309600000(12 GiB)
> 2 : device size 0x746dfc00000 : using 0x3abd77d2000(3.7 TiB)
> RocksDBBlueFSVolumeSelector Usage Matrix:
> DEV/LEV     WAL         DB          SLOW        *           *           REAL
> FILES
> LOG         0 B         22 MiB      0 B         0 B         0 B         3.9 MiB
> 1
> WAL         0 B         33 MiB      0 B         0 B         0 B         32 MiB
> 2
> DB          0 B         12 GiB      0 B         0 B         0 B         12 GiB
> 196
> SLOW        0 B         4 MiB       0 B         0 B         0 B         3.8 MiB
> 1
> TOTAL       0 B         12 GiB      0 B         0 B         0 B         0 B
> 200
> MAXIMUMS:
> LOG         0 B         22 MiB      0 B         0 B         0 B         17 MiB
> WAL         0 B         33 MiB      0 B         0 B         0 B         32 MiB
> DB          0 B         24 GiB      0 B         0 B         0 B         24 GiB
> SLOW        0 B         4 MiB       0 B         0 B         0 B         3.8 MiB
> TOTAL       0 B         24 GiB      0 B         0 B         0 B         0 B
>>> SIZE <<  0 B         118 GiB     6.9 TiB
> 
> Any idea? Is this something to worry about?
> 
> Regards,
> Frédéric.
> 
> ----- Le 16 Oct 23, à 14:46, Igor Fedotov igor.fedotov@xxxxxxxx a écrit :
> 
>> Hi Chris,
>> 
>> for the first question (osd.76) you might want to try ceph-volume's "lvm
>> migrate --from data --target <db lvm>" command. Looks like some
>> persistent DB remnants are still kept at main device causing the alert.
>> 
>> W.r.t osd.86's question - the line "SLOW        0 B         3.0 GiB
>> 59 GiB" means that RocksDB higher levels  data (usually L3+) are spread
>> over DB and main (aka slow) devices as 3 GB and 59 GB respectively.
>> 
>> In other words SLOW row refers to DB data which is originally supposed
>> to be at SLOW device (due to RocksDB data mapping mechanics). But
>> improved bluefs logic (introduced by
>> https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage
>> for a part of this data.
>> 
>> Resizing DB volume and following DB compaction should do the trick and
>> move all the data to DB device. Alternatively ceph-volume's lvm migrate
>> command should do the same but the result will be rather temporary
>> without DB volume resizing.
>> 
>> Hope this helps.
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 06/10/2023 06:55, Chris Dunlop wrote:
>>> Hi,
>>>
>>> tl;dr why are my osds still spilling?
>>>
>>> I've recently upgraded to 16.2.14 from 16.2.9 and started receiving
>>> bluefs spillover warnings (due to the "fix spillover alert" per the
>>> 16.2.14 release notes). E.g. from 'ceph health detail', the warning on
>>> one of these (there are a few):
>>>
>>> osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of
>>> 60 GiB) to slow device
>>>
>>> This is a 15T HDD with only a 60G SSD for the db so it's not
>>> surprising it spilled as it's way below the recommendation for rbd
>>> usage at db size 1-2% of the storage size.
>>>
>>> There was some spare space on the db ssd so I increased the size of
>>> the db LV up over 400G and did an bluefs-bdev-expand.
>>>
>>> However, days later, I'm still getting the spillover warning for that
>>> osd, including after running a manual compact:
>>>
>>> # ceph tell osd.76 compact
>>>
>>> See attached perf-dump-76 for the perf dump output:
>>>
>>> # cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq
>>> -r '.bluefs'
>>>
>>> In particular, if my understanding is correct, that's telling me the
>>> db available size is 487G (i.e. the LV expand worked), of which it's
>>> using 59G, and there's 128K spilled to the slow device:
>>>
>>> "db_total_bytes": 512309059584,  # 487G
>>> "db_used_bytes": 63470305280,    # 59G
>>> "slow_used_bytes": 131072,       # 128K
>>>
>>> A "bluefs stats" also says the db is using 128K of slow storage
>>> (although perhaps it's getting the info from the same place as the
>>> perf dump?):
>>>
>>> # ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using
>>> 0xea6200000(59 GiB)
>>> 2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
>>> RocksDBBlueFSVolumeSelector Usage Matrix:
>>> DEV/LEV     WAL         DB          SLOW        * *
>>> REAL        FILES       LOG         0 B         10 MiB      0
>>> B         0 B         0 B         8.8 MiB 1           WAL         0
>>> B         2.5 GiB     0 B         0 B         0 B         751 MiB
>>> 8           DB          0 B         56 GiB      128 KiB     0
>>> B         0 B         50 GiB      842         SLOW        0 B
>>> 0 B         0 B         0 B         0 B         0 B         0
>>> TOTAL       0 B         58 GiB      128 KiB     0 B         0
>>> B         0 B         850         MAXIMUMS:
>>> LOG         0 B         22 MiB      0 B         0 B         0
>>> B         18 MiB      WAL         0 B         3.9 GiB     0 B
>>> 0 B         0 B         1.0 GiB     DB          0 B         71
>>> GiB      282 MiB     0 B         0 B         62 GiB      SLOW        0
>>> B         0 B         0 B         0 B         0 B         0 B
>>> TOTAL       0 B         74 GiB      282 MiB     0 B         0
>>> B         0 B
>>>>> SIZE <<  0 B         453 GiB 14 TiB
>>>
>>> I had a look at the "DUMPING STATS" output in the logs bug I don't
>>> know how to interpret it. I did try calculating the total of the sizes
>>> on the "Sum" lines but that comes to 100G so I don't know what that
>>> all means. See attached log-stats-76.
>>>
>>> I also tried "ceph-kvstore-tool bluestore-kv ... stats":
>>>
>>> $ {
>>>   cephadm  unit --fsid $clusterid --name osd.76 stop
>>>   cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool
>>> bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm  unit --fsid
>>> $clusterid --name osd.76 start
>>> }
>>>
>>> Output attached as bluestore-kv-stats-76. I can't see anything
>>> interesting in there, although again I don't really know how to
>>> interpret it.
>>>
>>> So... why is this osd db still spilling onto slow storage, and how do
>>> I fix things so it's no longer using the slow storage?
>>>
>>>
>>> And a bonus issue...  on another osd that hasn't yet been resized
>>> (i.e.  again with a grossly undersized 60G db on SSD with a 15T HDD)
>>> I'm also getting a spillover warning. The "bluefs stats" seems to be
>>> saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW
>>> position in the matrix), but there's "something" currently using 59G
>>> on the slow device:
>>>
>>> $ ceph tell osd.85 bluefs stats
>>> 1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
>>> 2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
>>> RocksDBBlueFSVolumeSelector Usage Matrix:
>>> DEV/LEV     WAL         DB          SLOW        * *
>>> REAL        FILES       LOG         0 B         10 MiB      0
>>> B         0 B         0 B         7.4 MiB 1           WAL         0
>>> B         564 MiB     0 B         0 B         0 B         132 MiB
>>> 2           DB          0 B         11 GiB      0 B         0
>>> B         0 B         8.1 GiB     177         SLOW        0 B
>>> 3.0 GiB     59 GiB      0 B         0 B         56 GiB      898
>>> TOTAL       0 B         13 GiB      59 GiB      0 B         0
>>> B         0 B         1072        MAXIMUMS:
>>> LOG         0 B         24 MiB      0 B         0 B         0
>>> B         20 MiB      WAL         0 B         2.8 GiB     0 B
>>> 0 B         0 B         1.0 GiB     DB          0 B         22
>>> GiB      448 KiB     0 B         0 B         18 GiB      SLOW        0
>>> B         3.3 GiB     62 GiB      0 B         0 B         62 GiB
>>> TOTAL       0 B         27 GiB      62 GiB      0 B         0
>>> B         0 B
>>>>> SIZE <<  0 B         57 GiB 14 TiB
>>>
>>> Is there anywhere that describes how to interpret this output, and
>>> specifically, what stuff is going into the SLOW row? Seemingly there's
>>> 898 "files" there, but not LOG, WAL or DB files - so what are they?
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx