Thanks Igor,
I did see that L4 sizing and thought it seemed auspicious. Though after looking at a couple other OSD's with this, I saw that I think the sum of L0-L4 appears to match a rounded off version of the metadata size reported in ceph osd df tree. So I'm not sure if thats actually showing the size of the Level store, but just what is stored in each level? No more ideas but do data migration using ceph-bluestore-tool.
Would this imply to backup the current block.db, then re-create the block.db and move the backup to the new block.db? Just asking because I have never touched moving the block.db/WAL, and was actually under the impression that could not be done until the last few years as more people keep having spillovers.
Previously when I was expanding my block.db, I was just re-paving the OSD's, which was my likely course of action for this OSD if I was unsuccessful in clearing this as is.
Would that be bluefs-export and then bluefs-bdev-new-db? Though that doesn't exactly look like it would work.
I don't think I could do migrate due to not having another block device to migrate from and to.
Should/could I try bluefs-bdev-expand to see if it sees a bigger partition and tries to use it?
Otherwise at this point I feel like re-paving may be the best path forward, I just wanted to provide any possible data points before doing that.
Thanks again for the help,
Reed
hmm, RocksDB reports 13GB at L4: "": "Level Files Size Score Read(GB) Rn(GB) Rnp1(GB)
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",
"":
"----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",
"": " L0 2/0 29.39 MB 0.5 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0",
"": " L1 1/0 22.31 MB 0.6 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0",
"": " L2 2/0 94.03 MB 0.3 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0",
"": " L3 12/0 273.29 MB 0.3 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0",
"": " L4 205/0 12.82 GB 0.1 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0",
"": " Sum 222/0 13.23 GB 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.00 0.00 0 0.000 0 0", which is unlikely to be correct... No more ideas but do data migration using ceph-bluestore-tool.
I would appreciate if you share whether it helps in both short-
and long-term. Will this reappear or not?
Thanks, Igor
On 6/12/2020 5:17 PM, Reed Dier wrote:
Thanks for sticking with me Igor.
Attached is the ceph-kvstore-tool stats output.
Hopefully something interesting in here.
Thanks,
Reed
Hi Reed, thanks for the log. Nothing much of interest there though. Just
a regular SST file that RocksDB instructed to put at
"slow" device. Presumably it belongs to a higher level
hence the desire to put it that "far". Or (which is
less likely) RocksDB lacked free space when doing
compaction at some point and spilled some data out. So
I was wrong - ceph-kvstore's stats command output
might be helpful...
Thanks, Igor
On 6/11/2020 5:14 PM, Reed Dier wrote:
Apologies for the delay Igor,
Hopefully you are still interested in
taking a look.
Attached is the bluestore
bluefs-log-dump output.
I gzipped it as the log was very large.
Let me know if there is anything else I
can do to help track this down.
Thanks,
Reed
Reed, No, "ceph-kvstore-tool stats"
isn't be of any interest. For the sake of better issue
understanding it might be interesting to
have bluefs log dump obtained via
ceph-bluestore-tool's bluefs-log-dump
command. This will give some insight what
RocksDB files are spilled over. It's
still not clear what's the root cause for
the issue. It's not that frequent and
dangerous though so no active
investigation on that...
Wondering if migration has
helped though?
Thanks, Igor
On 6/6/2020
8:00 AM, Reed Dier wrote:
The WAL/DB was part of the OSD deployment.
OSD is running 14.2.9.
Would grabbing
the ceph-kvstore-tool bluestore-kv
<path-to-osd> stats as in that
ticket be of any usefulness to this?
Thanks,
Reed
This might help -see
comment #4 at https://tracker.ceph.com/issues/44509
And just for the
sake of information collection
- what Ceph version is used in
this cluster? Did you setup DB
volume along with OSD
deployment or they were added
later as was done in the
ticket above?
Thanks, Igor
On
6/6/2020 1:07 AM, Reed Dier
wrote:
I'm going to piggy back on
this somewhat.
I've battled
RocksDB spillovers over the
course of the life of the
cluster since moving to
bluestore, however I have
always been able to compact
it well enough.
But now I am
stumped at getting this to
compact via $ceph tell
osd.$osd compact, which has
always worked in the past.
No matter how
many times I compact it, I
always spill over exactly
192KiB.
BLUEFS_SPILLOVER
BlueFS spillover
detected on 1 OSD(s)
osd.36
spilled over 192 KiB
metadata from 'db'
device (26 GiB used of
34 GiB) to slow device
osd.36
spilled over 192 KiB
metadata from 'db'
device (16 GiB used of
34 GiB) to slow device
osd.36
spilled over 192 KiB
metadata from 'db'
device (22 GiB used of
34 GiB) to slow device
osd.36
spilled over 192 KiB
metadata from 'db'
device (13 GiB used of
34 GiB) to slow device
The multiple entries are
from different time trying
to compact it.
The OSD is a
1.92TB SATA SSD, the WAL/DB
is a 36GB partition on NVMe.
This is with the
normal logging level.
I have no idea
how to make heads or tails
of that log data, but maybe
someone can figure out why
this one OSD just refuses to
compact?
OSD is 14.2.9.
OS is U18.04.
Kernel is
4.15.0-96.
I haven't played
with ceph-bluestore-tool or
ceph-kvstore-tool but after
seeing the above mention in
this thread, I do see
ceph-kvstore-tool
<rocksdb|bluestore-kv?>
compact, which sounds like
it may be the same thing
that ceph tell compact does
under the hood?
compact
Subcommand
compact is used to
compact all data of
kvstore. It will open
the database, and
trigger a database's
compaction. After
compaction, some disk
space may be released.
Also, not sure
if this is helpful:
osd.36
spilled over 192 KiB
metadata from 'db'
device (13 GiB used
of 34 GiB) to slow
device
ID
CLASS WEIGHT
REWEIGHT SIZE RAW
USE DATA OMAP
META AVAIL %USE
VAR PGS STATUS TYPE
NAME
36 ssd 1.77879
1.00000 1.8 TiB 1.2
TiB 1.2 TiB 6.2 GiB
7.2 GiB 603 GiB 66.88
0.94 85 up
osd.36
You can see
the breakdown between OMAP
data and META data.
After
compacting again:
osd.36
spilled over 192 KiB
metadata from 'db'
device (26 GiB used of
34 GiB) to slow device
ID
CLASS WEIGHT
REWEIGHT SIZE
RAW USE DATA
OMAP META
AVAIL %USE VAR
PGS STATUS TYPE
NAME
36
ssd 1.77879
1.00000 1.8 TiB
1.2 TiB 1.2 TiB 6.2
GiB 20 GiB 603 GiB
66.88 0.94 85
up
osd.36
So the OMAP
size remained the same,
while the metadata
ballooned (while still
conspicuously spilling
over 192KiB exactly)
These OSDs
have a few RBD images,
cephfs metadata, and
librados objects (not RGW)
stored.
The breakdown
of OMAP size is pretty
widely binned, but the GiB
sizes are definitely the
minority.
Looking at the
breakdown with some simple
bash-fu
KiB = 147
MiB = 105
GiB = 24
To further
divide that, all of the
GiB sized OMAPs are SSD
OSD's:
|
SSD
|
HDD
|
TOTAL
|
KiB
|
0
|
147
|
147
|
MiB
|
36
|
69
|
105
|
GiB
|
24
|
0
|
24
|
I have no idea
if any of these data
points are pertinent or
helpful, but I want to
give as clear a picture as
possible to prevent
chasing the wrong thread.
Appreciate any
help with this.
Thanks,
Reed
hi
there,
trying to get around
my head rocksdb
spillovers and how
to deal with
them … in
particular, i have
one osds which does
not have any pools
associated (as per
ceph pg ls-by-osd
$osd ), yet it does
show up in ceph
health detail as:
osd.$osd spilled
over 2.9 MiB
metadata from 'db'
device (49 MiB
used of 37 GiB) to
slow device
compaction doesn't
help. i am well
aware of
https://tracker.ceph.com/issues/38745
, yet find it really
counter-intuitive
that an empty osd
with a more-or-less
optimal sized db
volume can't fit its
rockdb on the
former.
is there any way to
repair this, apart
from re-creating the
osd? fwiw,
dumping the database
with
ceph-kvstore-tool
bluestore-kv
/var/lib/ceph/osd/ceph-$osd
dump >
bluestore_kv.dump
yields a file of
less than 100mb in
size.
and, while we're at
it, a few more
related questions:
- am i right to
assume that the
leveldb and rocksdb
arguments to
ceph-kvstore-tool
are only relevant
for osds with
filestore-backend?
- does
ceph-kvstore-tool
bluestore-kv … also
deal with
rocksdb-items for
osds with
bluestore-backend?
thank you very much
& with kind
regards,
thoralf.
_______________________________________________
ceph-users mailing
list -- ceph-users@xxxxxxx
To unsubscribe send
an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
|