Re: dealing with spillovers

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 12 Jun 2020 14:56:20 +0300

Hi Reed,

thanks for the log.

Nothing much of interest there though. Just a regular SST file that 
RocksDB instructed to put at "slow" device. Presumably it belongs to a 
higher level hence the desire to put it that "far". Or (which is less 
likely) RocksDB lacked free space when doing compaction at some point 
and spilled some data out. So I was wrong - ceph-kvstore's stats command 
output might be helpful...

Thanks,

Igor

On 6/11/2020 5:14 PM, Reed Dier wrote:
Apologies for the delay Igor,

Hopefully you are still interested in taking a look.

Attached is the bluestore bluefs-log-dump output.
I gzipped it as the log was very large.
Let me know if there is anything else I can do to help track this down.

Thanks,

Reed

On Jun 8, 2020, at 8:04 AM, Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

Reed,

No, "ceph-kvstore-tool stats" isn't be of any interest.

For the sake of better issue understanding it might be interesting to 
have bluefs log dump obtained via ceph-bluestore-tool's 
bluefs-log-dump command. This will give some insight what RocksDB 
files are spilled over.  It's still not clear what's the root cause 
for the issue. It's not that frequent and dangerous though so no 
active investigation on that...

Wondering if migration has helped though?

Thanks,

Igor

On 6/6/2020 8:00 AM, Reed Dier wrote:
The WAL/DB was part of the OSD deployment.

OSD is running 14.2.9.

Would grabbing the ceph-kvstore-tool bluestore-kv <path-to-osd> 
stats as in that ticket be of any usefulness to this?

Thanks,

Reed

On Jun 5, 2020, at 5:27 PM, Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

This might help -see comment #4 at 
https://tracker.ceph.com/issues/44509

And just for the sake of information collection - what Ceph version 
is used in this cluster?

Did you setup DB volume along with OSD deployment or they were 
added later as  was done in the ticket above?

Thanks,

Igor

On 6/6/2020 1:07 AM, Reed Dier wrote:
I'm going to piggy back on this somewhat.

I've battled RocksDB spillovers over the course of the life of the 
cluster since moving to bluestore, however I have always been able 
to compact it well enough.

But now I am stumped at getting this to compact via $ceph tell 
osd.$osd compact, which has always worked in the past.

No matter how many times I compact it, I always spill over exactly 
192KiB.
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
     osd.36 spilled over 192 KiB metadata from 'db' device (26 
GiB used of 34 GiB) to slow device
     osd.36 spilled over 192 KiB metadata from 'db' device (16 
GiB used of 34 GiB) to slow device
     osd.36 spilled over 192 KiB metadata from 'db' device (22 
GiB used of 34 GiB) to slow device
     osd.36 spilled over 192 KiB metadata from 'db' device (13 
GiB used of 34 GiB) to slow device

The multiple entries are from different time trying to compact it.

The OSD is a 1.92TB SATA SSD, the WAL/DB is a 36GB partition on NVMe.
I tailed and tee'd the OSD's logs during a manual compaction here: 
https://pastebin.com/bcpcRGEe
This is with the normal logging level.
I have no idea how to make heads or tails of that log data, but 
maybe someone can figure out why this one OSD just refuses to compact?

OSD is 14.2.9.
OS is U18.04.
Kernel is 4.15.0-96.

I haven't played with ceph-bluestore-tool or ceph-kvstore-tool but 
after seeing the above mention in this thread, I do see 
ceph-kvstore-tool <rocksdb|bluestore-kv?> compact, which sounds 
like it may be the same thing that ceph tell compact does under 
the hood?
compact
Subcommand compact is used to compact all data of kvstore. It 
will open the database, and trigger a database's compaction. 
After compaction, some disk space may be released.

Also, not sure if this is helpful:
osd.36 spilled over 192 KiB metadata from 'db' device (13 GiB 
used of 34 GiB) to slow device
ID   CLASS WEIGHT  REWEIGHT SIZE    RAW USE  DATA  OMAP    META   
 AVAIL   %USE  VAR  PGS STATUS TYPE NAME
  36   ssd   1.77879  1.00000 1.8 TiB  1.2 TiB 1.2 TiB 6.2 GiB 
7.2 GiB 603 GiB 66.88 0.94  85     up             osd.36
You can see the breakdown between OMAP data and META data.

After compacting again:
osd.36 spilled over 192 KiB metadata from 'db' device (26 GiB 
used of 34 GiB) to slow device
ID   CLASS WEIGHT  REWEIGHT SIZE    RAW USE  DATA    OMAP    META 
   AVAIL   %USE  VAR  PGS STATUS TYPE NAME
  36   ssd 1.77879  1.00000 1.8 TiB  1.2 TiB 1.2 TiB 6.2 GiB  20 
GiB 603 GiB 66.88 0.94  85     up       osd.36

So the OMAP size remained the same, while the metadata ballooned 
(while still conspicuously spilling over 192KiB exactly)
These OSDs have a few RBD images, cephfs metadata, and librados 
objects (not RGW) stored.

The breakdown of OMAP size is pretty widely binned, but the GiB 
sizes are definitely the minority.
Looking at the breakdown with some simple bash-fu
KiB = 147
MiB = 105
GiB = 24

To further divide that, all of the GiB sized OMAPs are SSD OSD's:

*SSD*

*HDD*

*TOTAL*
*KiB*

0

147

147
*MiB*

36

69

105
*GiB*

24

0

24

I have no idea if any of these data points are pertinent or 
helpful, but I want to give as clear a picture as possible to 
prevent chasing the wrong thread.
Appreciate any help with this.

Thanks,
Reed

On May 26, 2020, at 9:48 AM, thoralf schulze 
<t.schulze@xxxxxxxxxxxx <mailto:t.schulze@xxxxxxxxxxxx>> wrote:

hi there,

trying to get around my head rocksdb spillovers and how to deal with
them … in particular, i have one osds which does not have any pools
associated (as per ceph pg ls-by-osd $osd ), yet it does show up 
in ceph
health detail as:

    osd.$osd spilled over 2.9 MiB metadata from 'db' device (49 MiB
used of 37 GiB) to slow device

compaction doesn't help. i am well aware of
https://tracker.ceph.com/issues/38745 , yet find it really
counter-intuitive that an empty osd with a more-or-less optimal 
sized db
volume can't fit its rockdb on the former.

is there any way to repair this, apart from re-creating the osd? 
fwiw,
dumping the database with

ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
bluestore_kv.dump

yields a file of less than 100mb in size.

and, while we're at it, a few more related questions:

- am i right to assume that the leveldb and rocksdb arguments to
ceph-kvstore-tool are only relevant for osds with filestore-backend?
- does ceph-kvstore-tool bluestore-kv … also deal with 
rocksdb-items for
osds with bluestore-backend?

thank you very much & with kind regards,
thoralf.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx 
<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx 
<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx