Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Mon, 8 May 2023 21:13:20 +0200

Hello Igor,

so I was checking the performance every day since Tuesday.. every day it seemed
to be the same - ~ 60-70kOPS on random write from single VM
yesterday it finally dropped to 20kOPS
today to 10kOPS. I also tried with newly created volume, the result (after prefill)
is the same, so it doesn't make any difference..

so I reverted all mentioned options to their defaults and restarted all OSDs.
performance immediately returned to better values (I suppose this is again
caused by the restart only)

good news is, that setting osd_fast_shutdown_timeout to 0 really helped with
OSD crashes during restarts, which speeds it up a lot.. but I have some new
crashes, more on this later..

> > I'd suggest to start monitoring perf counters for your osds.
> > op_w_lat/subop_w_lat ones specifically. I presume they raise eventually,
> > don't they?
> OK, starting collecting those for all OSDs..
I have hour samples of all OSDs perf dumps loaded in DB, so I can easily examine,
sort, whatever..

> 
> currently values for avgtime are around 0.0003 for subop_w_lat and 0.001-0.002
> for op_w_lat
OK, so there is no visible trend on op_w_lat, still between 0.001 and 0.002

subop_w_lat seems to have increased since yesterday though! I see values from
0.0004 to as high as 0.001

If some other perf data might be interesting, please let me know..

During OSD restarts, I noticed strange thing - restarts on first 6 machines
went smooth, but then on another 3, I saw rocksdb logs recovery on all SSD
OSDs. but first didn't see any mention of daemon crash in ceph -s

later, crash info appeared, but only about 3 daemons (in total, at least 20
of them crashed though)

crash report was similar for all three OSDs:

[root@nrbphav4a ~]# ceph crash info 2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54d90) [0x7f64a6323d90]",
        "(BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list<Context*, std::allocator<Context*> >*, boost::intrusive_ptr<TrackedOp>)+0x413) [0x55a1c9d07c43]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x22b) [0x55a1c9d27e9b]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x8ad) [0x55a1c9bbcfdd]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x38f) [0x55a1c99d1cbf]",
        "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x57) [0x55a1c99d6777]",
        "(PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0xb73) [0x55a1c99da883]",
        "/usr/bin/ceph-osd(+0x58794e) [0x55a1c992994e]",
        "(CommonSafeTimer<std::mutex>::timer_thread()+0x11a) [0x55a1c9e226aa]",
        "/usr/bin/ceph-osd(+0xa80eb1) [0x55a1c9e22eb1]",
        "/lib64/libc.so.6(+0x9f802) [0x7f64a636e802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f64a630e450]"
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-05-08T17:45:47.056675Z_a5759fe9-60c6-423a-88fc-57663f692bd3",
    "entity_name": "osd.98",
    "os_id": "almalinux",
    "os_name": "AlmaLinux",
    "os_version": "9.0 (Emerald Puma)",
    "os_version_id": "9.0",
    "process_name": "ceph-osd",
    "stack_sig": "b1a1c5bd45e23382497312202e16cfd7a62df018c6ebf9ded0f3b3ca3c1dfa66",
    "timestamp": "2023-05-08T17:45:47.056675Z",
    "utsname_hostname": "nrbphav4h",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.90lb9.01",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Fri Jan 27 15:52:13 CET 2023"
}

I was trying to figure out why this particular 3 nodes could behave differently
and found out from colleagues, that those 3 nodes were added to cluster lately
with direct install of 17.2.5 (others were installed 15.2.16 and later upgraded)

not sure whether this is related to our problem though..

I see very similar crash reported here: https://tracker.ceph.com/issues/56346
so I'm not reporting..

Do you think this might somehow be the cause of the problem? Anything else I should
check in perf dumps or elsewhere?

with best regards

nik

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx