Hi Reed,
you might want to use bluefs-bdev-migrate command which simply moves
BlueFS files from source path to destination. I.e. from main device to
DB in you case.
It needs neither OSD redeployment nor additional/new device creation.
Neither it guarantees that spillover reoccurs one day though.
Thanks,
Igor
On 6/12/2020 6:29 PM, Reed Dier wrote:
Would this imply to backup the current block.db, then re-create the
block.db and move the backup to the new block.db?
Just asking because I have never touched moving the block.db/WAL, and
was actually under the impression that could not be done until the
last few years as more people keep having spillovers.
Previously when I was expanding my block.db, I was just re-paving the
OSD's, which was my likely course of action for this OSD if I was
unsuccessful in clearing this as is.
Would that be bluefs-export and then bluefs-bdev-new-db?
Though that doesn't exactly look like it would work.
I don't think I could do migrate due to not having another block
device to migrate from and to.
Should/could I try bluefs-bdev-expand to see if it sees a bigger
partition and tries to use it?
Otherwise at this point I feel like re-paving may be the best path
forward, I just wanted to provide any possible data points before
doing that.
Thanks again for the help,
Reed
On Jun 12, 2020, at 9:34 AM, Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
hmm, RocksDB reports 13GB at L4:
"": "Level Files Size Score Read(GB) Rn(GB) Rnp1(GB)
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
"": " L0 2/0 29.39 MB 0.5 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " L1 1/0 22.31 MB 0.6 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " L2 2/0 94.03 MB 0.3 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " L3 12/0 273.29 MB 0.3 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " L4 205/0 12.82 GB 0.1 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
"": " Sum 222/0 13.23 GB 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0",
which is unlikely to be correct...
No more ideas but do data migration using ceph-bluestore-tool.
I would appreciate if you share whether it helps in both short- and
long-term. Will this reappear or not?
Thanks,
Igor
On 6/12/2020 5:17 PM, Reed Dier wrote:
Thanks for sticking with me Igor.
Attached is the ceph-kvstore-tool stats output.
Hopefully something interesting in here.
Thanks,
Reed
On Jun 12, 2020, at 6:56 AM, Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
Hi Reed,
thanks for the log.
Nothing much of interest there though. Just a regular SST file that
RocksDB instructed to put at "slow" device. Presumably it belongs
to a higher level hence the desire to put it that "far". Or (which
is less likely) RocksDB lacked free space when doing compaction at
some point and spilled some data out. So I was wrong -
ceph-kvstore's stats command output might be helpful...
Thanks,
Igor
On 6/11/2020 5:14 PM, Reed Dier wrote:
Apologies for the delay Igor,
Hopefully you are still interested in taking a look.
Attached is the bluestore bluefs-log-dump output.
I gzipped it as the log was very large.
Let me know if there is anything else I can do to help track this
down.
Thanks,
Reed
On Jun 8, 2020, at 8:04 AM, Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
Reed,
No, "ceph-kvstore-tool stats" isn't be of any interest.
For the sake of better issue understanding it might be
interesting to have bluefs log dump obtained via
ceph-bluestore-tool's bluefs-log-dump command. This will give
some insight what RocksDB files are spilled over. It's still not
clear what's the root cause for the issue. It's not that frequent
and dangerous though so no active investigation on that...
Wondering if migration has helped though?
Thanks,
Igor
On 6/6/2020 8:00 AM, Reed Dier wrote:
The WAL/DB was part of the OSD deployment.
OSD is running 14.2.9.
Would grabbing the ceph-kvstore-tool bluestore-kv <path-to-osd>
stats as in that ticket be of any usefulness to this?
Thanks,
Reed
On Jun 5, 2020, at 5:27 PM, Igor Fedotov <ifedotov@xxxxxxx
<mailto:ifedotov@xxxxxxx>> wrote:
This might help -see comment #4 at
https://tracker.ceph.com/issues/44509
And just for the sake of information collection - what Ceph
version is used in this cluster?
Did you setup DB volume along with OSD deployment or they were
added later as was done in the ticket above?
Thanks,
Igor
On 6/6/2020 1:07 AM, Reed Dier wrote:
I'm going to piggy back on this somewhat.
I've battled RocksDB spillovers over the course of the life of
the cluster since moving to bluestore, however I have always
been able to compact it well enough.
But now I am stumped at getting this to compact via $ceph tell
osd.$osd compact, which has always worked in the past.
No matter how many times I compact it, I always spill over
exactly 192KiB.
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
osd.36 spilled over 192 KiB metadata from 'db' device (26
GiB used of 34 GiB) to slow device
osd.36 spilled over 192 KiB metadata from 'db' device (16
GiB used of 34 GiB) to slow device
osd.36 spilled over 192 KiB metadata from 'db' device (22
GiB used of 34 GiB) to slow device
osd.36 spilled over 192 KiB metadata from 'db' device (13
GiB used of 34 GiB) to slow device
The multiple entries are from different time trying to compact it.
The OSD is a 1.92TB SATA SSD, the WAL/DB is a 36GB partition
on NVMe.
I tailed and tee'd the OSD's logs during a manual compaction
here: https://pastebin.com/bcpcRGEe
This is with the normal logging level.
I have no idea how to make heads or tails of that log data,
but maybe someone can figure out why this one OSD just refuses
to compact?
OSD is 14.2.9.
OS is U18.04.
Kernel is 4.15.0-96.
I haven't played with ceph-bluestore-tool or ceph-kvstore-tool
but after seeing the above mention in this thread, I do see
ceph-kvstore-tool <rocksdb|bluestore-kv?> compact, which
sounds like it may be the same thing that ceph tell compact
does under the hood?
compact
Subcommand compact is used to compact all data of kvstore. It
will open the database, and trigger a database's compaction.
After compaction, some disk space may be released.
Also, not sure if this is helpful:
osd.36 spilled over 192 KiB metadata from 'db' device (13 GiB
used of 34 GiB) to slow device
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS TYPE NAME
36 ssd 1.77879 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 6.2 GiB
7.2 GiB 603 GiB 66.88 0.94 85 up osd.36
You can see the breakdown between OMAP data and META data.
After compacting again:
osd.36 spilled over 192 KiB metadata from 'db' device (26 GiB
used of 34 GiB) to slow device
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
META AVAIL %USE VAR PGS STATUS TYPE NAME
36 ssd 1.77879 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 6.2 GiB
20 GiB 603 GiB 66.88 0.94 85 up osd.36
So the OMAP size remained the same, while the metadata
ballooned (while still conspicuously spilling over 192KiB exactly)
These OSDs have a few RBD images, cephfs metadata, and
librados objects (not RGW) stored.
The breakdown of OMAP size is pretty widely binned, but the
GiB sizes are definitely the minority.
Looking at the breakdown with some simple bash-fu
KiB = 147
MiB = 105
GiB = 24
To further divide that, all of the GiB sized OMAPs are SSD OSD's:
*SSD*
*HDD*
*TOTAL*
*KiB*
0
147
147
*MiB*
36
69
105
*GiB*
24
0
24
I have no idea if any of these data points are pertinent or
helpful, but I want to give as clear a picture as possible to
prevent chasing the wrong thread.
Appreciate any help with this.
Thanks,
Reed
On May 26, 2020, at 9:48 AM, thoralf schulze
<t.schulze@xxxxxxxxxxxx <mailto:t.schulze@xxxxxxxxxxxx>> wrote:
hi there,
trying to get around my head rocksdb spillovers and how to
deal with
them … in particular, i have one osds which does not have any
pools
associated (as per ceph pg ls-by-osd $osd ), yet it does show
up in ceph
health detail as:
osd.$osd spilled over 2.9 MiB metadata from 'db' device
(49 MiB
used of 37 GiB) to slow device
compaction doesn't help. i am well aware of
https://tracker.ceph.com/issues/38745 , yet find it really
counter-intuitive that an empty osd with a more-or-less
optimal sized db
volume can't fit its rockdb on the former.
is there any way to repair this, apart from re-creating the
osd? fwiw,
dumping the database with
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
bluestore_kv.dump
yields a file of less than 100mb in size.
and, while we're at it, a few more related questions:
- am i right to assume that the leveldb and rocksdb arguments to
ceph-kvstore-tool are only relevant for osds with
filestore-backend?
- does ceph-kvstore-tool bluestore-kv … also deal with
rocksdb-items for
osds with bluestore-backend?
thank you very much & with kind regards,
thoralf.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx
<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx