Re: Nautilus: BlueFS spillover

Eugen Block <eblock@xxxxxx> · Fri, 27 Sep 2019 12:24:30 +0000

Hi,

generally expanding existing DB devices isn't enough to immediately  
eliminate the spillover alert. As spilled over data is already there  
and doesn't immediately move by such an expansion. Theoretically  
allert will eventually  disappear after RocksDB completely rewrites  
all the data at slow device.
So you should either wait for a while for things to stabilize. May  
be monitoring spilled over volumes from the alerts which presumably  
should decrease (or at least doesn't grow). Please note this will  
most probably happen under some load only.
Or migrate bluefs data from main device with ceph-bluestore-tool.  
Which I'd recommend in case of emergency only.

I understand. We'll monitor that for a couple of days (or weeks)  
before taking any further steps. It's not an emergency, so the  
migration won't be necessary, but I also thought about that.

Try it with a little bit large db devices. The db devices also holds  
the WAL, which has a default size of 1 GB afaik. And you also need  
to consider gigabyte vs. gibibytes. We ran into the same problem in  
our setup.....
You might also want to increase the size even further, since rocksdb  
needs some free space during compaction. The worst case scenario is  
~60 GB per device to take compaction into account. If you do not  
have an extreme metadata/omap data loaded workload, you won't need  
more capacity for these partitions.

Our workload is not that extreme, the cluster is a backend for  
OpenStack and we also use CephFS. If the warnings don't go away with  
30 GB rocksDB we should have enough space to double it, but that would  
be the limit.

Thanks for your input, I appreciate it!

Regards,
Eugen

Zitat von Igor Fedotov <ifedotov@xxxxxxx>:

Hi Eugen,

generally expanding existing DB devices isn't enough to immediately  
eliminate the spillover alert. As spilled over data is already there  
and doesn't immediately move by such an expansion. Theoretically  
allert will eventually  disappear after RocksDB completely rewrites  
all the data at slow device.

Never experimented how this happens in the reality though...

So you should either wait for a while for things to stabilize. May  
be monitoring spilled over volumes from the alerts which presumably  
should decrease (or at least doesn't grow). Please note this will  
most probably happen under some load only.

Or migrate bluefs data from main device with ceph-bluestore-tool.  
Which I'd recommend in case of emergency only.

Or try to compact DB with ceph-kvstore-tool. Which is unlikely to help.

Thanks,

Igor

On 9/27/2019 11:54 AM, Eugen Block wrote:

Update: I expanded all rocksDB devices, but the warnings still appear:

BLUEFS_SPILLOVER BlueFS spillover detected on 10 OSD(s)
     osd.0 spilled over 2.5 GiB metadata from 'db' device (2.4 GiB  
used of 30 GiB) to slow device
     osd.19 spilled over 66 MiB metadata from 'db' device (818 MiB  
used of 15 GiB) to slow device
     osd.25 spilled over 2.2 GiB metadata from 'db' device (2.6 GiB  
used of 30 GiB) to slow device
     osd.26 spilled over 1.6 GiB metadata from 'db' device (1.9 GiB  
used of 30 GiB) to slow device
     osd.27 spilled over 2.6 GiB metadata from 'db' device (2.5 GiB  
used of 30 GiB) to slow device
     osd.28 spilled over 2.4 GiB metadata from 'db' device (1.3 GiB  
used of 30 GiB) to slow device
     osd.29 spilled over 2.9 GiB metadata from 'db' device (1.7 GiB  
used of 30 GiB) to slow device
     osd.31 spilled over 2.2 GiB metadata from 'db' device (2.7 GiB  
used of 30 GiB) to slow device
     osd.32 spilled over 2.4 GiB metadata from 'db' device (1.7 GiB  
used of 30 GiB) to slow device
     osd.33 spilled over 2.2 GiB metadata from 'db' device (2.0 GiB  
used of 30 GiB) to slow device

OSD.19 can be ignored as it's currently not in use, but the other  
devices have been expanded from 20 to 30 GB (following the  
explanations about the compaction levels).
According to the OSD logs these are the sizes we're dealing with:

Level  Size
L0     31.84
L1     183.86
L2     923.67
L3     3.62
Sum    4.74
Int    0.00

Is there any sign that these OSDs would require even larger bdev  
devices (300GB)? Which would not be possible with the currently  
used SSDs, unfortunately.

Is there anything else I can do without recreating the OSDs?

Thanks,
Eugen

Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:

On 9/26/19 9:45 PM, Eugen Block wrote:
I'm following the discussion for a tracker issue [1] about  
spillover warnings that affects our upgraded Nautilus cluster.
Just to clarify, would a resize of the rocksDB volume (and  
expanding with 'ceph-bluestore-tool bluefs-bdev-expand...')  
resolve that or do we have to recreate every OSD?

Yes, this works since Luminous 12.2.11

k

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx