Re: dealing with spillovers

Reed Dier <reed.dier@xxxxxxxxxxx> · Fri, 12 Jun 2020 10:29:54 -0500

Thanks Igor,
I did see that L4 sizing and thought it seemed auspicious.
Though after looking at a couple other OSD's with this, I saw that I think the sum of L0-L4 appears to match a rounded off version of the metadata size reported in ceph osd df tree.
So I'm not sure if thats actually showing the size of the Level store, but just what is stored in each level?
No more ideas but do data migration using ceph-bluestore-tool. 
Would this imply to backup the current block.db, then re-create the block.db and move the backup to the new block.db?
Just asking because I have never touched moving the block.db/WAL, and was actually under the impression that could not be done until the last few years as more people keep having spillovers.

Previously when I was expanding my block.db, I was just re-paving the OSD's, which was my likely course of action for this OSD if I was unsuccessful in clearing this as is.

Would that be bluefs-export and then bluefs-bdev-new-db?
Though that doesn't exactly look like it would work.

I don't think I could do migrate due to not having another block device to migrate from and to.

Should/could I try bluefs-bdev-expand to see if it sees a bigger partition and tries to use it?

Otherwise at this point I feel like re-paving may be the best path forward, I just wanted to provide any possible data points before doing that.

Thanks again for the help,

Reed

On Jun 12, 2020, at 9:34 AM, Igor Fedotov <ifedotov@xxxxxxx> wrote:

  hmm, RocksDB reports 13GB at L4:
 "": "Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB)
      Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec)
      CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop",

          "":
"----------------------------------------------------------------------------------------------------------------------------------------------------------------------------",

          "": "  L0      2/0   29.39 MB   0.5      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",

          "": "  L1      1/0   22.31 MB   0.6      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",

          "": "  L2      2/0   94.03 MB   0.3      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",

          "": "  L3     12/0   273.29 MB   0.3      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",

          "": "  L4    205/0   12.82 GB   0.1      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",

          "": " Sum    222/0   13.23 GB   0.0      0.0     0.0     
      0.0       0.0      0.0       0.0   0.0      0.0      0.0     
      0.00              0.00         0    0.000       0      0",
which is unlikely to be correct...
No more ideas but do data migration using ceph-bluestore-tool. 

I would appreciate if you share whether it helps in both short-
      and long-term. Will this reappear or not?

Thanks,
Igor

    On 6/12/2020 5:17 PM, Reed Dier wrote:

      Thanks for sticking with me Igor.

      Attached is the ceph-kvstore-tool stats output.

      Hopefully something interesting in here.

      Thanks,

      Reed

            On Jun 12, 2020, at 6:56 AM, Igor Fedotov <ifedotov@xxxxxxx> wrote:

              Hi Reed,
thanks for the log.
Nothing much of interest there though. Just
                  a regular SST file that RocksDB instructed to put at
                  "slow" device. Presumably it belongs to a higher level
                  hence the desire to put it that "far". Or (which is
                  less likely) RocksDB lacked free space when doing
                  compaction at some point and spilled some data out. So
                  I was wrong - ceph-kvstore's stats command output
                  might be helpful...

Thanks,
Igor

                On 6/11/2020 5:14 PM, Reed Dier wrote:

                  Apologies for the delay Igor,

                  Hopefully you are still interested in
                    taking a look.

                  Attached is the bluestore
                    bluefs-log-dump output.
                  I gzipped it as the log was very large.
                  Let me know if there is anything else I
                    can do to help track this down.

                  Thanks,

                  Reed

                        On Jun 8, 2020, at 8:04 AM, Igor
                          Fedotov <ifedotov@xxxxxxx>
                          wrote:

                          Reed,
No, "ceph-kvstore-tool stats"
                              isn't be of any interest.
For the sake of better issue
                              understanding it might be interesting to
                              have bluefs log dump obtained via
                              ceph-bluestore-tool's bluefs-log-dump
                              command. This will give some insight what
                              RocksDB files are spilled over.  It's
                              still not clear what's the root cause for
                              the issue. It's not that frequent and
                              dangerous though so no active
                              investigation on that...

Wondering if migration has
                              helped though?

Thanks,
Igor

                            On 6/6/2020
                              8:00 AM, Reed Dier wrote:

                              The WAL/DB was part of the OSD deployment.

                              OSD is running 14.2.9.

                              Would grabbing
                                the ceph-kvstore-tool bluestore-kv
                                <path-to-osd> stats as in that
                                ticket be of any usefulness to this?

                              Thanks,

                              Reed

                                    On Jun 5, 2020, at
                                      5:27 PM, Igor Fedotov <ifedotov@xxxxxxx>
                                      wrote:

                                      This might help -see
                                          comment #4 at https://tracker.ceph.com/issues/44509

And just for the
                                          sake of information collection
                                          - what Ceph version is used in
                                          this cluster?
Did you setup DB
                                          volume along with OSD
                                          deployment or they were added
                                          later as  was done in the
                                          ticket above?

Thanks,
Igor

                                        On
                                          6/6/2020 1:07 AM, Reed Dier
                                          wrote:

                                          I'm going to piggy back on
                                          this somewhat.

                                          I've battled
                                            RocksDB spillovers over the
                                            course of the life of the
                                            cluster since moving to
                                            bluestore, however I have
                                            always been able to compact
                                            it well enough.

                                          But now I am
                                            stumped at getting this to
                                            compact via $ceph tell
                                            osd.$osd compact, which has
                                            always worked in the past.

                                          No matter how
                                            many times I compact it, I
                                            always spill over exactly
                                            192KiB.

                                              BLUEFS_SPILLOVER
                                                BlueFS spillover
                                                detected on 1 OSD(s)
                                                   osd.36
                                                spilled over 192 KiB
                                                metadata from 'db'
                                                device (26 GiB used of
                                                34 GiB) to slow device
                                                   osd.36
                                                spilled over 192 KiB
                                                metadata from 'db'
                                                device (16 GiB used of
                                                34 GiB) to slow device
                                                   osd.36
                                                spilled over 192 KiB
                                                metadata from 'db'
                                                device (22 GiB used of
                                                34 GiB) to slow device
                                                   osd.36
                                                spilled over 192 KiB
                                                metadata from 'db'
                                                device (13 GiB used of
                                                34 GiB) to slow device

                                            The multiple entries are
                                            from different time trying
                                            to compact it.

                                          The OSD is a
                                            1.92TB SATA SSD, the WAL/DB
                                            is a 36GB partition on NVMe.
                                          I tailed and
                                            tee'd the OSD's logs during
                                            a manual compaction here: https://pastebin.com/bcpcRGEe
                                          This is with the
                                            normal logging level.
                                          I have no idea
                                            how to make heads or tails
                                            of that log data, but maybe
                                            someone can figure out why
                                            this one OSD just refuses to
                                            compact?

                                          OSD is 14.2.9.
                                          OS is U18.04.
                                          Kernel is
                                            4.15.0-96.

                                          I haven't played
                                            with ceph-bluestore-tool or
                                            ceph-kvstore-tool but after
                                            seeing the above mention in
                                            this thread, I do see
                                            ceph-kvstore-tool
                                            <rocksdb|bluestore-kv?>
                                            compact, which sounds like
                                            it may be the same thing
                                            that ceph tell compact does
                                            under the hood?

                                              compact
                                              Subcommand
                                                compact is used to
                                                compact all data of
                                                kvstore. It will open
                                                the database, and
                                                trigger a database's
                                                compaction. After
                                                compaction, some disk
                                                space may be released.

                                            Also, not sure
                                              if this is helpful:

                                                osd.36
                                                    spilled over 192 KiB
                                                    metadata from 'db'
                                                    device (13 GiB used
                                                    of 34 GiB) to slow
                                                    device

                                              ID
                                                    CLASS WEIGHT  
                                                   REWEIGHT SIZE    RAW
                                                  USE  DATA    OMAP  
                                                   META    AVAIL   %USE
                                                   VAR  PGS STATUS TYPE
                                                  NAME

                                                  36   ssd   1.77879
                                                   1.00000 1.8 TiB  1.2
                                                  TiB 1.2 TiB 6.2 GiB
                                                  7.2 GiB 603 GiB 66.88
                                                  0.94  85     up      
                                                        osd.36

                                            You can see
                                              the breakdown between OMAP
                                              data and META data.

                                            After
                                              compacting again:
                                            osd.36
                                                spilled over 192 KiB
                                                metadata from 'db'
                                                device (26 GiB used of
                                                34 GiB) to slow device

                                                ID  
                                                    CLASS WEIGHT  
                                                     REWEIGHT SIZE  
                                                     RAW USE  DATA  
                                                     OMAP    META  
                                                     AVAIL   %USE  VAR
                                                     PGS STATUS TYPE
                                                    NAME
                                                  36  
                                                    ssd   1.77879
                                                     1.00000 1.8 TiB
                                                     1.2 TiB 1.2 TiB 6.2
                                                    GiB  20 GiB 603 GiB
                                                    66.88 0.94  85    
                                                    up            
                                                    osd.36

                                            So the OMAP
                                              size remained the same,
                                              while the metadata
                                              ballooned (while still
                                              conspicuously spilling
                                              over 192KiB exactly)
                                            These OSDs
                                              have a few RBD images,
                                              cephfs metadata, and
                                              librados objects (not RGW)
                                              stored.

                                            The breakdown
                                              of OMAP size is pretty
                                              widely binned, but the GiB
                                              sizes are definitely the
                                              minority.
                                            Looking at the
                                              breakdown with some simple
                                              bash-fu
                                            KiB = 147
                                            MiB = 105
                                            GiB = 24

                                            To further
                                              divide that, all of the
                                              GiB sized OMAPs are SSD
                                              OSD's:

                                                      SSD

                                                      HDD

                                                      TOTAL

                                                      KiB

                                                      0

                                                      147

                                                      147

                                                      MiB

                                                      36

                                                      69

                                                      105

                                                      GiB

                                                      24

                                                      0

                                                      24

                                            I have no idea
                                              if any of these data
                                              points are pertinent or
                                              helpful, but I want to
                                              give as clear a picture as
                                              possible to prevent
                                              chasing the wrong thread.
                                            Appreciate any
                                              help with this.

                                            Thanks,
                                            Reed

                                                On May 26,
                                                  2020, at 9:48 AM,
                                                  thoralf schulze <t.schulze@xxxxxxxxxxxx>
                                                  wrote:

                                                  hi
                                                    there,

                                                    trying to get around
                                                    my head rocksdb
                                                    spillovers and how
                                                    to deal with

                                                    them … in
                                                    particular, i have
                                                    one osds which does
                                                    not have any pools

                                                    associated (as per
                                                    ceph pg ls-by-osd
                                                    $osd ), yet it does
                                                    show up in ceph

                                                    health detail as:

                                                        osd.$osd spilled
                                                    over 2.9 MiB
                                                    metadata from 'db'
                                                    device (49 MiB

                                                    used of 37 GiB) to
                                                    slow device

                                                    compaction doesn't
                                                    help. i am well
                                                    aware of

                                                    https://tracker.ceph.com/issues/38745
                                                    , yet find it really

                                                    counter-intuitive
                                                    that an empty osd
                                                    with a more-or-less
                                                    optimal sized db

                                                    volume can't fit its
                                                    rockdb on the
                                                    former.

                                                    is there any way to
                                                    repair this, apart
                                                    from re-creating the
                                                    osd? fwiw,

                                                    dumping the database
                                                    with

                                                    ceph-kvstore-tool
                                                    bluestore-kv
                                                    /var/lib/ceph/osd/ceph-$osd
                                                    dump >

                                                    bluestore_kv.dump

                                                    yields a file of
                                                    less than 100mb in
                                                    size.

                                                    and, while we're at
                                                    it, a few more
                                                    related questions:

                                                    - am i right to
                                                    assume that the
                                                    leveldb and rocksdb
                                                    arguments to

                                                    ceph-kvstore-tool
                                                    are only relevant
                                                    for osds with
                                                    filestore-backend?

                                                    - does
                                                    ceph-kvstore-tool
                                                    bluestore-kv … also
                                                    deal with
                                                    rocksdb-items for

                                                    osds with
                                                    bluestore-backend?

                                                    thank you very much
                                                    & with kind
                                                    regards,

                                                    thoralf.

_______________________________________________

                                                    ceph-users mailing
                                                    list -- ceph-users@xxxxxxx

                                                    To unsubscribe send
                                                    an email to ceph-users-leave@xxxxxxx

                                          _______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx