Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 10 May 2023 10:15:25 +0300

Hey Zakhar,

You do need to restart OSDs to bring performance back to normal anyway, 
don't you? So yeah, we're not aware of better way so far - all the 
information I  have is from you and Nikola. And you both tell us about 
the need for restart.

Apparently there is no need to restart every OSD but "degraded/slow" 
ones only. We actually need to verify that. So please indicate the 
slowest OSDs (in terms of subop_w_lat) and do restart for them first. 
Hopefully just a fraction of your OSDs would require this.

Thanks,
Igor

On 5/10/2023 6:01 AM, Zakhar Kirpichenko wrote:
Thank you, Igor. I will try to see how to collect the perf values. Not 
sure about restarting all OSDs as it's a production cluster, is there 
a less invasive way?

/Z

On Tue, 9 May 2023 at 23:58, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:

    Hi Zakhar,

    Let's leave questions regarding cache usage/tuning to a different
    topic for now. And concentrate on performance drop.

    Could you please do the same experiment I asked from Nikola once
    your cluster reaches "bad performance" state (Nikola, could you
    please use this improved scenario as well?):

    - collect perf counters for every OSD

    - reset perf counters for every OSD

    -  leave the cluster running for 10 mins and collect perf counters
    again.

    - Then restart OSDs one-by-one starting with the worst OSD (in
    terms of subop_w_lat from the prev step). Wouldn't be sufficient
    to reset just a few OSDs before the cluster is back to normal?

    - if partial OSD restart is sufficient - please leave the
    remaining OSDs run as-is without reboot.

    - after the restart (no matter partial or complete one - the key
    thing it's should successful) reset all the perf counters and
    leave the cluster run for 30 mins and collect perf counters again.

    - wait 24 hours and collect the counters one more time

    - share all four counters snapshots.

    Thanks,

    Igor

    On 5/8/2023 11:31 PM, Zakhar Kirpichenko wrote:
    Don't mean to hijack the thread, but I may be observing something
    similar with 16.2.12: OSD performance noticeably peaks after OSD
    restart and then gradually reduces over 10-14 days, while commit
    and apply latencies increase across the board.

    Non-default settings are:

            "bluestore_cache_size_hdd": {
                "default": "1073741824",
                "mon": "4294967296",
                "final": "4294967296"
            },
            "bluestore_cache_size_ssd": {
                "default": "3221225472",
                "mon": "4294967296",
                "final": "4294967296"
            },
    ...
            "osd_memory_cache_min": {
                "default": "134217728",
                "mon": "2147483648",
                "final": "2147483648"
            },
            "osd_memory_target": {
                "default": "4294967296",
                "mon": "17179869184",
                "final": "17179869184"
            },
            "osd_scrub_sleep": {
                "default": 0,
                "mon": 0.10000000000000001,
                "final": 0.10000000000000001
            },
            "rbd_balance_parent_reads": {
                "default": false,
                "mon": true,
                "final": true
            },

    All other settings are default, the usage is rather simple
    Openstack / RBD.

    I also noticed that OSD cache usage doesn't increase over time
    (see my message "Ceph 16.2.12, bluestore cache doesn't seem to be
    used much" dated 26 April 2023, which received no comments),
    despite OSDs are being used rather heavily and there's plenty of
    host and OSD cache / target memory available. It may be worth
    checking if available memory is being used in a good way.

    /Z

    On Mon, 8 May 2023 at 22:35, Igor Fedotov <igor.fedotov@xxxxxxxx>
    wrote:

        Hey Nikola,

        On 5/8/2023 10:13 PM, Nikola Ciprich wrote:
        > OK, starting collecting those for all OSDs..
        > I have hour samples of all OSDs perf dumps loaded in DB, so
        I can easily examine,
        > sort, whatever..
        >
        You didn't reset the counters every hour, do you? So having
        average
        subop_w_latency growing that way means the current values
        were much
        higher than before.

        Curious if subop latencies were growing for every OSD or just
        a subset
        (may be even just a single one) of them?

        Next time you reach the bad state please do the following if
        possible:

        - reset perf counters for every OSD

        -  leave the cluster running for 10 mins and collect perf
        counters again.

        - Then start restarting OSD one-by-one starting with the
        worst OSD (in
        terms of subop_w_lat from the prev step). Wouldn't be
        sufficient to
        reset just a few OSDs before the cluster is back to normal?

        >> currently values for avgtime are around 0.0003 for
        subop_w_lat and 0.001-0.002
        >> for op_w_lat
        > OK, so there is no visible trend on op_w_lat, still between
        0.001 and 0.002
        >
        > subop_w_lat seems to have increased since yesterday though!
        I see values from
        > 0.0004 to as high as 0.001
        >
        > If some other perf data might be interesting, please let me
        know..
        >
        > During OSD restarts, I noticed strange thing - restarts on
        first 6 machines
        > went smooth, but then on another 3, I saw rocksdb logs
        recovery on all SSD
        > OSDs. but first didn't see any mention of daemon crash in
        ceph -s
        >
        > later, crash info appeared, but only about 3 daemons (in
        total, at least 20
        > of them crashed though)
        >
        > crash report was similar for all three OSDs:
        >
        > [root@nrbphav4a ~]# ceph crash info
        2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3
        > {
        >      "backtrace": [
        >          "/lib64/libc.so.6(+0x54d90) [0x7f64a6323d90]",
        > "(BlueStore::_txc_create(BlueStore::Collection*,
        BlueStore::OpSequencer*, std::__cxx11::list<Context*,
        std::allocator<Context*> >*,
        boost::intrusive_ptr<TrackedOp>)+0x413) [0x55a1c9d07c43]",
        >
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
        std::vector<ceph::os::Transaction,
        std::allocator<ceph::os::Transaction> >&,
        boost::intrusive_ptr<TrackedOp>,
        ThreadPool::TPHandle*)+0x22b) [0x55a1c9d27e9b]",
        > "(ReplicatedBackend::submit_transaction(hobject_t const&,
        object_stat_sum_t const&, eversion_t const&,
        std::unique_ptr<PGTransaction,
        std::default_delete<PGTransaction> >&&, eversion_t const&,
        eversion_t const&, std::vector<pg_log_entry_t,
        std::allocator<pg_log_entry_t> >&&,
        std::optional<pg_hit_set_history_t>&, Context*, unsigned
        long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x8ad)
        [0x55a1c9bbcfdd]",
        > "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
        PrimaryLogPG::OpContext*)+0x38f) [0x55a1c99d1cbf]",
        >
        "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext,
        std::default_delete<PrimaryLogPG::OpContext> >)+0x57)
        [0x55a1c99d6777]",
        >
        "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb73)
        [0x55a1c99da883]",
        >          "/usr/bin/ceph-osd(+0x58794e) [0x55a1c992994e]",
        > "(CommonSafeTimer<std::mutex>::timer_thread()+0x11a)
        [0x55a1c9e226aa]",
        >          "/usr/bin/ceph-osd(+0xa80eb1) [0x55a1c9e22eb1]",
        >          "/lib64/libc.so.6(+0x9f802) [0x7f64a636e802]",
        >          "/lib64/libc.so.6(+0x3f450) [0x7f64a630e450]"
        >      ],
        >      "ceph_version": "17.2.6",
        >      "crash_id":
        "2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3",
        >      "entity_name": "osd.98",
        >      "os_id": "almalinux",
        >      "os_name": "AlmaLinux",
        >      "os_version": "9.0 (Emerald Puma)",
        >      "os_version_id": "9.0",
        >      "process_name": "ceph-osd",
        >      "stack_sig":
        "b1a1c5bd45e23382497312202e16cfd7a62df018c6ebf9ded0f3b3ca3c1dfa66",
        >      "timestamp": "2023-05-08T17:45:47.056675Z",
        >      "utsname_hostname": "nrbphav4h",
        >      "utsname_machine": "x86_64",
        >      "utsname_release": "5.15.90lb9.01",
        >      "utsname_sysname": "Linux",
        >      "utsname_version": "#1 SMP Fri Jan 27 15:52:13 CET 2023"
        > }
        >
        >
        > I was trying to figure out why this particular 3 nodes
        could behave differently
        > and found out from colleagues, that those 3 nodes were
        added to cluster lately
        > with direct install of 17.2.5 (others were installed
        15.2.16 and later upgraded)
        >
        > not sure whether this is related to our problem though..
        >
        > I see very similar crash reported
        here:https://tracker.ceph.com/issues/56346
        > so I'm not reporting..
        >
        > Do you think this might somehow be the cause of the
        problem? Anything else I should
        > check in perf dumps or elsewhere?

        Hmm... don't know yet. Could you please last 20K lines prior
        the crash
        from e.g two sample OSDs?

        And the crash isn't permanent, OSDs are able to start after the
        second(?) shot, aren't they?

        > with best regards
        >
        > nik
        >
        >
        >
        >
        >
        >
        -- 
        Igor Fedotov
        Ceph Lead Developer
        --
        croit GmbH, Freseniusstr. 31h, 81247 Munich
        CEO: Martin Verges - VAT-ID: DE310638492
        Com. register: Amtsgericht Munich HRB 231263
        Web <https://croit.io/> | LinkedIn
        <http://linkedin.com/company/croit> |
        Youtube
        <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> |
        Twitter <https://twitter.com/croit_io>

        Meet us at the SC22 Conference! Learn more
        <https://croit.io/croit-sc22>
        Technology Fast50 Award Winner by Deloitte
        <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

        <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        To unsubscribe send an email to ceph-users-leave@xxxxxxx

    -- 
    Igor Fedotov
    Ceph Lead Developer
    --
    croit GmbH, Freseniusstr. 31h, 81247 Munich
    CEO: Martin Verges - VAT-ID: DE310638492
    Com. register: Amtsgericht Munich HRB 231263
    Web <https://croit.io/> | LinkedIn
    <http://linkedin.com/company/croit> | Youtube
    <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> |
    Twitter <https://twitter.com/croit_io>

    Meet us at the SC22 Conference! Learn more
    <https://croit.io/croit-sc22>
    Technology Fast50 Award Winner by Deloitte
    <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

    <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx