Re: Fixing BlueFS spillover (pacific 16.2.14)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Chris,

for the first question (osd.76) you might want to try ceph-volume's "lvm migrate --from data --target <db lvm>" command. Looks like some persistent DB remnants are still kept at main device causing the alert.

W.r.t osd.86's question - the line "SLOW        0 B         3.0 GiB     59 GiB" means that RocksDB higher levels  data (usually L3+) are spread over DB and main (aka slow) devices as 3 GB and 59 GB respectively.

In other words SLOW row refers to DB data which is originally supposed to be at SLOW device (due to RocksDB data mapping mechanics). But improved bluefs logic (introduced by https://github.com/ceph/ceph/pull/29687) permitted extra DB disk usage for a part of this data.

Resizing DB volume and following DB compaction should do the trick and move all the data to DB device. Alternatively ceph-volume's lvm migrate command should do the same but the result will be rather temporary without DB volume resizing.

Hope this helps.


Thanks,

Igor

On 06/10/2023 06:55, Chris Dunlop wrote:
Hi,

tl;dr why are my osds still spilling?

I've recently upgraded to 16.2.14 from 16.2.9 and started receiving bluefs spillover warnings (due to the "fix spillover alert" per the 16.2.14 release notes). E.g. from 'ceph health detail', the warning on one of these (there are a few):

osd.76 spilled over 128 KiB metadata from 'db' device (56 GiB used of 60 GiB) to slow device

This is a 15T HDD with only a 60G SSD for the db so it's not surprising it spilled as it's way below the recommendation for rbd usage at db size 1-2% of the storage size.

There was some spare space on the db ssd so I increased the size of the db LV up over 400G and did an bluefs-bdev-expand.

However, days later, I'm still getting the spillover warning for that osd, including after running a manual compact:

# ceph tell osd.76 compact

See attached perf-dump-76 for the perf dump output:

# cephadm enter --name 'osd.76' ceph daemon 'osd.76' perf dump" | jq -r '.bluefs'

In particular, if my understanding is correct, that's telling me the db available size is 487G (i.e. the LV expand worked), of which it's using 59G, and there's 128K spilled to the slow device:

"db_total_bytes": 512309059584,  # 487G
"db_used_bytes": 63470305280,    # 59G
"slow_used_bytes": 131072,       # 128K

A "bluefs stats" also says the db is using 128K of slow storage (although perhaps it's getting the info from the same place as the perf dump?):

# ceph tell osd.76 bluefs stats 1 : device size 0x7747ffe000 : using 0xea6200000(59 GiB)
2 : device size 0xe8d7fc00000 : using 0x6554d689000(6.3 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV     WAL         DB          SLOW        * *           REAL        FILES       LOG         0 B         10 MiB      0 B         0 B         0 B         8.8 MiB 1           WAL         0 B         2.5 GiB     0 B         0 B         0 B         751 MiB     8           DB          0 B         56 GiB      128 KiB     0 B         0 B         50 GiB      842         SLOW        0 B         0 B         0 B         0 B         0 B         0 B         0 TOTAL       0 B         58 GiB      128 KiB     0 B         0 B         0 B         850         MAXIMUMS: LOG         0 B         22 MiB      0 B         0 B         0 B         18 MiB      WAL         0 B         3.9 GiB     0 B         0 B         0 B         1.0 GiB     DB          0 B         71 GiB      282 MiB     0 B         0 B         62 GiB      SLOW        0 B         0 B         0 B         0 B         0 B         0 B         TOTAL       0 B         74 GiB      282 MiB     0 B         0 B         0 B
SIZE <<  0 B         453 GiB 14 TiB

I had a look at the "DUMPING STATS" output in the logs bug I don't know how to interpret it. I did try calculating the total of the sizes on the "Sum" lines but that comes to 100G so I don't know what that all means. See attached log-stats-76.

I also tried "ceph-kvstore-tool bluestore-kv ... stats":

$ {
  cephadm  unit --fsid $clusterid --name osd.76 stop
  cephadm shell --fsid $clusterid --name osd.76 -- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-76 stats cephadm  unit --fsid $clusterid --name osd.76 start
}

Output attached as bluestore-kv-stats-76. I can't see anything interesting in there, although again I don't really know how to interpret it.

So... why is this osd db still spilling onto slow storage, and how do I fix things so it's no longer using the slow storage?


And a bonus issue...  on another osd that hasn't yet been resized (i.e.  again with a grossly undersized 60G db on SSD with a 15T HDD) I'm also getting a spillover warning. The "bluefs stats" seems to be saying the db is NOT currently spilling (i.e. "0 B" the DB/SLOW position in the matrix), but there's "something" currently using 59G on the slow device:

$ ceph tell osd.85 bluefs stats
1 : device size 0xeffffe000 : using 0x3a3900000(15 GiB)
2 : device size 0xe8d7fc00000 : using 0x7aea7434000(7.7 TiB)
RocksDBBlueFSVolumeSelector Usage Matrix:
DEV/LEV     WAL         DB          SLOW        * *           REAL        FILES       LOG         0 B         10 MiB      0 B         0 B         0 B         7.4 MiB 1           WAL         0 B         564 MiB     0 B         0 B         0 B         132 MiB     2           DB          0 B         11 GiB      0 B         0 B         0 B         8.1 GiB     177         SLOW        0 B         3.0 GiB     59 GiB      0 B         0 B         56 GiB      898 TOTAL       0 B         13 GiB      59 GiB      0 B         0 B         0 B         1072        MAXIMUMS: LOG         0 B         24 MiB      0 B         0 B         0 B         20 MiB      WAL         0 B         2.8 GiB     0 B         0 B         0 B         1.0 GiB     DB          0 B         22 GiB      448 KiB     0 B         0 B         18 GiB      SLOW        0 B         3.3 GiB     62 GiB      0 B         0 B         62 GiB      TOTAL       0 B         27 GiB      62 GiB      0 B         0 B         0 B
SIZE <<  0 B         57 GiB 14 TiB

Is there anywhere that describes how to interpret this output, and specifically, what stuff is going into the SLOW row? Seemingly there's 898 "files" there, but not LOG, WAL or DB files - so what are they?

Cheers,

Chris

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux