Re: Ceph OSD's take 10+ minutes to start on reboot

Igor Fedotov <igor.fedotov@xxxxxxxx> · Tue, 22 Mar 2022 17:34:42 +0300

Yes that's apparently true, if you set them through config file not 
through monitor config DB (i.e. via ceph config cmd).

Just in case I would though recommend to restart OSD one by one and make 
sure specific OSD starts properly before proceeding to another one. Who 
knows what bug/issue might come up with such a change.

Thanks,

Igor

On 3/22/2022 5:27 PM, Chris Page wrote:
Thanks Igor,

Hopefully this is the cause of the problem!

Is it safe to simply remove the 'bluestore_rocksdb_options' from my 
[osd] portion of the ceph config file and restart the OSD's?

Many thanks,
Chris.

On Tue, 22 Mar 2022 at 14:25, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

    Chirs,

    yeah, this apparently reveals the root cause to a major degree:
    wal files aren't recycled properly. And RocksDB replays them on
    startup.

    At this point I'm pretty sure your rocksdb settings are the
    culprit. So please remove these custome settings and revert back
    to defaults. Then restart all the OSDs and monitor how it's going
    for some time.

    Thanks,

    Igor

    On 3/22/2022 5:07 PM, Chris Page wrote:
    Hi,

    I have reduced the log levels slightly and restarted an OSD that
    had 40GB of metadata. This seems follow what I have been seeing.

    Interestingly I get this near to the start procedure -

    image.png

    So it looks as though there are a large number of write ahead
    logs, and if the size value is in bytes then some of these are
    100MB+ in size.

    It then proceeds to spend a long time recovering these logs -

    2022-03-22T13:59:20.772+0000 7f23eeabcf00  4 rocksdb:
    [db_impl/db_impl_open.cc:758] Recovering log #8277 mode 2
    2022-03-22T13:59:20.784+0000 7f23eeabcf00  4 rocksdb:
    [db_impl/db_impl_open.cc:758] Recovering log #8278 mode 2
    2022-03-22T13:59:20.824+0000 7f23eeabcf00  4 rocksdb:
    [db_impl/db_impl_open.cc:758] Recovering log #8280 mode 2
    2022-03-22T13:59:21.188+0000 7f23eeabcf00  4 rocksdb:
    [db_impl/db_impl_open.cc:758] Recovering log #8282 mode 2
    2022-03-22T13:59:21.412+0000 7f23eeabcf00  4 rocksdb:
    [db_impl/db_impl_open.cc:758] Recovering log #8284 mode 2

    Followed by a lot of compaction stats -

    image.png

    I hope this reveals the nature of the issue.

    Thanks,
    Chris.

    On Tue, 22 Mar 2022 at 13:43, Chris Page <sirhc.page@xxxxxxxxx>
    wrote:

        Hi Igor,

        Thanks for your email and your assistance.

        > And IIUC you've got custom rocksdb settings, right? What's
        the rationale for that? I would strongly discourage to alter
        them without deep understanding of the consequences...

        This was a recommended configuration which I must admit I
        didn't have enough knowledge on to be applying.

        - - - - - - - -

        > Could you please share the output for the following command:
        ceph tell osd.1 bluefs stats

        1 : device size 0x37e3ec00000 : using 0x61230f9000(389 GiB)
        wal_total:0, db_total:3648715856281, slow_total:0

        - - - - - - - -

        > Additionally you might want to share rocksdb stats, this to
        be collected on an offline OSD:
        ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-1 stats

        I've attached bluestore-kv.txt

        > Then please set debug-rocksdb & debug-bluestore to 10 and
        bring up osd.1 again. Which apparently will need some time. 
        What's in OSD log then?

        The log raced up to 285mb in a matter of a minute or so.
        Would you like me to send this over a WeTransfer link?
        However the restart was quick - most probably because OSD 1
        was restarted last week and had only generated 4G of
        metadata. Some of the OSD's seem to have maintained a
        small-ish metadata size while others are back up at ~40GB or
        larger (one is 185GB!)

        > Once restarted - please collect a fresh report from 'bluefs
        stats' command and share the results. It appears I'm getting
        the same output, although the size has dropped by 4GB (the
        meta was only at 4G when I restarted)

        1 : device size 0x37e3ec00000 : using 0x6033ade000(385 GiB)
        wal_total:0, db_total:3648715856281, slow_total:0

        On Mon, 21 Mar 2022 at 13:11, Igor Fedotov
        <igor.fedotov@xxxxxxxx> wrote:

            Hi Chris,

            Such meta growth is completely unexpected to me.

            And IIUC you've got custom rocksdb settings, right?
            What's the rationale for that? I would strongly
            discourage to alter them without deep understanding of
            the consequences...

            My current working hypothesis is that DB compaction is
            not performed properly during regular operation and is
            postponed till OSD restart. Let's try to confirm that.

            Could you please share the output for the following command:

            ceph tell osd.1 bluefs stats

            Additionally you might want to share rocksdb stats, this
            to be collected on an offline OSD:

            ceph-kvstore-tool bluestore_kv /var/lib/ceph/osd/ceph-1 stats

            Then please set debug-rocksdb & debug-bluestore to 10 and
            bring up osd.1 again. Which apparently will need some
            time. What's in OSD log then?

            Once restarted - please collect a fresh report from
            'bluefs stats' command and share the results.

            And finally I would suggest to leave other OSDs (as well
            as rocksdb settings) intact for a while to be able to
            troubleshoot the issue to the end..

            Thanks,

            Igor

            On 3/18/2022 5:38 PM, Chris Page wrote:
            This certainly seems to be the case as running a manual
            compaction and restarting works.

            And `ceph tell osd.0 compact` reduces metadata
            consumption from ~160GB of metadata (for 380GB worth of
            data) to just 750MB. Below is a snippet of my osd stats -

            image.png

            OSD Is this expected behaviour or is my metadata growing
            abnormally? OSD's 1, 4 & 11 haven't been restarted in a
            couple of weeks.

            Here's my rocksdb settings -

            bluestore_rocksdb_options =
            compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,write_buffer_size=64M,compaction_readahead_size=2M

            I hope you can help with this one - I'm at a bit of a loss!

            Thanks,
            Chris.

            On Fri, 18 Mar 2022 at 14:25, Chris Page
            <sirhc.page@xxxxxxxxx> wrote:

                Hi,

                Following up from this, is it just normal for them
                to take a while? I notice that once I have restarted
                an OSD, the 'meta' value drops right down to empty
                and slowly builds back up. The restarted OSD's start
                with just 1gb or so of metadata and increase over
                time to 160/170GB of metadata.

                So perhaps the delay is just the rebuilding of this
                metadata pool?

                Thanks,
                Chris.

            -- 
            Igor Fedotov
            Ceph Lead Developer

            Looking for help with your Ceph cluster? Contact us athttps://croit.io

            croit GmbH, Freseniusstr. 31h, 81247 Munich
            CEO: Martin Verges - VAT-ID: DE310638492
            Com. register: Amtsgericht Munich HRB 231263
            Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

    -- 
    Igor Fedotov
    Ceph Lead Developer

    Looking for help with your Ceph cluster? Contact us athttps://croit.io

    croit GmbH, Freseniusstr. 31h, 81247 Munich
    CEO: Martin Verges - VAT-ID: DE310638492
    Com. register: Amtsgericht Munich HRB 231263
    Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us athttps://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx