Hi Igor, Thanks for the valuable advice! I just wanted to provide feedback that it was indeed one single OSD causing the issues which I could triangulate as you said. After removing this OSD, the slow ops haven't occurred anymore. Best regards, Tim > On 1 Oct 2024, at 12:42, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote: > > Hi Tim, > > first of all - given the provided logs - all the slow operastions are stuck in 'waiting for sub ops' state. > > Which apparently means that reported OSDs aren't suffering from local issues but stuck on replication operations to their peer OSDs. > > From my experince even a single "faulty" osd could cause such issues to multiple other daemons. And the way to troubleshoot is to find out what are the actual culprit OSD(s). > > To do that one might try to use the following approach: > > 1. When (or shortly after) the issue is happening - run 'ceph daemon osd.N dump_historic_ops' (or even 'dump_ops_in_flight') command against OSDs reporting slow operations. > > 2. From the above reports choose operations with extraordinary high duration, e.g. > 5 seconds and learn PG ids they've been run against, e.g. PG = 1.a in the following sample: > > "description": "osd_op(client.24184.0:23 >>>>1.a<<<<< 1:54253539:::benchmark_data_coalmon_70932_object22:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e19)", > > 3. For affected PG(s) learn which OSDs are backing specific it. E.g. by running ceph pg map <pgid> > > 4. If different PGs from the above step use specific OSD which is common to all (the majority) of them - higly likely it's a good candidate for additional investigation - partcularly relevant OSD logs inspection. > > > Thanks, > > Igor _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx