Re: CephFS constant high write I/O to the metadata pool

Olli Rajala <olli.rajala@xxxxxxxx> · Tue, 8 Nov 2022 21:18:18 +0200

Hi Milind,

Here's are the output of top and a pstack backtrace:
https://gist.github.com/olliRJL/5f483c6bc4ad50178c8c9871370b26d3
https://gist.github.com/olliRJL/b83a743eca098c05d244e5c1def9046c

I uploaded the debug log using ceph-post-file - hope someone can access
that :)
ceph-post-file: 30f9b38b-a62c-44bb-9e00-53edf483a415

Tnx!
---------------------------
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---------------------------

On Mon, Nov 7, 2022 at 2:30 PM Milind Changire <mchangir@xxxxxxxxxx> wrote:

> maybe,
>
>    - use the top program to look at a threaded listing of the ceph-mds
>    process and see which thread(s) are consuming the most cpu
>    - use gstack to attach to the ceph-mds process and dump the backtrace
>    into a file; we can then map the thread with highest cpu consumption to the
>    gstack output
>    - enable debug logs (level 20) for the ceph-mds process for a few
>    seconds and look at what's happening in there or share the logs with the
>    team here
>
> But I wonder if you could do this on your production system.
>
>
>
> On Mon, Nov 7, 2022 at 4:34 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>
>> I might have spoken too soon :(
>>
>> Now about 60h after dropping the caches the write bandwidth has gone up
>> linearly from those initial hundreds of kB/s to now nearly 10MB/s.
>>
>> I don't think this could be caused by the cache just filling up again
>> either. After dropping the cache I tested if filling up the cache would
>> show any bw increase by running "tree" at the root of one of the mounts
>> and
>> it didn't affect anything at the time. So basically the cache has been
>> fully saturated all this time now.
>>
>> Boggled,
>> ---------------------------
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---------------------------
>>
>>
>> On Sat, Nov 5, 2022 at 12:47 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>>
>> > Oh Lordy,
>> >
>> > Seems like I finally got this resolved. And all it needed in the end was
>> > to drop the mds caches with:
>> > ceph tell mds.`hostname` cache drop
>> >
>> > The funny thing is that whatever the issue with the cache was it had
>> > persisted through several Ceph upgrades and node reboots. It's been a
>> live
>> > production system so I guess that there just never has been a moment
>> where
>> > all mds would have been down and thus make it fully rebuild the cache...
>> > maybe :|
>> >
>> > Unfortunately I don't remember when this issue arose and my metrics
>> don't
>> > reach far back enough... but I wonder if this could have started already
>> > when I did Octopus->Pacific upgrade...
>> >
>> > Cheers,
>> > ---------------------------
>> > Olli Rajala - Lead TD
>> > Anima Vitae Ltd.
>> > www.anima.fi
>> > ---------------------------
>> >
>> >
>> > On Mon, Oct 24, 2022 at 9:36 PM Olli Rajala <olli.rajala@xxxxxxxx>
>> wrote:
>> >
>> >> I tried my luck and upgraded to 17.2.4 but unfortunately that didn't
>> make
>> >> any difference here either.
>> >>
>> >> I also looked more again at all kinds of client op and request stats
>> and
>> >> wotnot which only made me even more certain that this io is not caused
>> by
>> >> any clients.
>> >>
>> >> What internal mds operation or mechanism could cause such high idle
>> write
>> >> io? I've tried to fiddle a bit with some of the mds cache trim and
>> memory
>> >> settings but I haven't noticed any effect there. Any pointers
>> appreciated.
>> >>
>> >> Cheers,
>> >> ---------------------------
>> >> Olli Rajala - Lead TD
>> >> Anima Vitae Ltd.
>> >> www.anima.fi
>> >> ---------------------------
>> >>
>> >>
>> >> On Mon, Oct 17, 2022 at 10:28 AM Olli Rajala <olli.rajala@xxxxxxxx>
>> >> wrote:
>> >>
>> >>> Hi Patrick,
>> >>>
>> >>> With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops" and/or
>> >>> "ceph tell mds.pve-core-1 objecter_requests"? Both these show very few
>> >>> requests/ops - many times just returning empty lists. I'm pretty sure
>> that
>> >>> this I/O isn't generated by any clients - I've earlier tried to
>> isolate
>> >>> this by shutting down all cephfs clients and this didn't have any
>> >>> noticeable effect.
>> >>>
>> >>> I tried to watch what is going on with that "perf dump" but to be
>> honest
>> >>> all I can see is some numbers going up in the different sections :)
>> >>> ...don't have a clue what to focus on and how to interpret that.
>> >>>
>> >>> Here's a perf dump if you or anyone could make something out of that:
>> >>> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
>> >>>
>> >>> Tnx!
>> >>> o.
>> >>>
>> >>> ---------------------------
>> >>> Olli Rajala - Lead TD
>> >>> Anima Vitae Ltd.
>> >>> www.anima.fi
>> >>> ---------------------------
>> >>>
>> >>>
>> >>> On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly <pdonnell@xxxxxxxxxx
>> >
>> >>> wrote:
>> >>>
>> >>>> Hello Olli,
>> >>>>
>> >>>> On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <olli.rajala@xxxxxxxx>
>> >>>> wrote:
>> >>>> >
>> >>>> > Hi,
>> >>>> >
>> >>>> > I'm seeing constant 25-50MB/s writes to the metadata pool even when
>> >>>> all
>> >>>> > clients and the cluster is idling and in clean state. This surely
>> >>>> can't be
>> >>>> > normal?
>> >>>> >
>> >>>> > There's no apparent issues with the performance of the cluster but
>> >>>> this
>> >>>> > write rate seems excessive and I don't know where to look for the
>> >>>> culprit.
>> >>>> >
>> >>>> > The setup is Ceph 16.2.9 running in hyperconverged 3 node core
>> >>>> cluster and
>> >>>> > 6 hdd osd nodes.
>> >>>> >
>> >>>> > Here's typical status when pretty much all clients are idling. Most
>> >>>> of that
>> >>>> > write bandwidth and maybe fifth of the write iops is hitting the
>> >>>> > metadata pool.
>> >>>> >
>> >>>> >
>> >>>>
>> ---------------------------------------------------------------------------------------------------
>> >>>> > root@pve-core-1:~# ceph -s
>> >>>> >   cluster:
>> >>>> >     id:     2088b4b1-8de1-44d4-956e-aa3d3afff77f
>> >>>> >     health: HEALTH_OK
>> >>>> >
>> >>>> >   services:
>> >>>> >     mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age
>> 2w)
>> >>>> >     mgr: pve-core-1(active, since 4w), standbys: pve-core-2,
>> >>>> pve-core-3
>> >>>> >     mds: 1/1 daemons up, 2 standby
>> >>>> >     osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
>> >>>> >
>> >>>> >   data:
>> >>>> >     volumes: 1/1 healthy
>> >>>> >     pools:   10 pools, 625 pgs
>> >>>> >     objects: 70.06M objects, 46 TiB
>> >>>> >     usage:   95 TiB used, 182 TiB / 278 TiB avail
>> >>>> >     pgs:     625 active+clean
>> >>>> >
>> >>>> >   io:
>> >>>> >     client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
>> >>>> >
>> >>>>
>> ---------------------------------------------------------------------------------------------------
>> >>>> >
>> >>>> > Here's some daemonperf dump:
>> >>>> >
>> >>>> >
>> >>>>
>> ---------------------------------------------------------------------------------------------------
>> >>>> > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
>> >>>> >
>> >>>>
>> ----------------------------------------mds-----------------------------------------
>> >>>> > --mds_cache--- ------mds_log------ -mds_mem-
>> -------mds_server-------
>> >>>> mds_
>> >>>> > -----objecter------ purg
>> >>>> > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa cfa
>> hcc
>> >>>> hccd
>> >>>> > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr  hcs
>> hsr
>> >>>> cre
>> >>>> >  cat |sess|actv rd   wr   rdwr|purg|
>> >>>> >  40    0    0  767k  78k   0    0    0    1    6    1    0    0
>> 5
>> >>>>   5
>> >>>> >  3    7 |1.1k   0    0 | 17  3.7k 134    0 |767k 767k| 40    5    0
>> >>>>   0
>> >>>> >  0 |110 |  4    2   21    0 |  2
>> >>>> >  57    2    0  767k  78k   0    0    0    3   16    3    0    0
>>  11
>> >>>>  11
>> >>>> >  0   17 |1.1k   0    0 | 45  3.7k 137    0 |767k 767k| 57    8    0
>> >>>>   0
>> >>>> >  0 |110 |  0    2   28    0 |  4
>> >>>> >  57    4    0  767k  78k   0    0    0    4   34    4    0    0
>>  34
>> >>>>  33
>> >>>> >  2   26 |1.0k   0    0 |134  3.9k 139    0 |767k 767k| 57   13    0
>> >>>>   0
>> >>>> >  0 |110 |  0    2  112    0 | 19
>> >>>> >  67    3    0  767k  78k   0    0    0    6   32    6    0    0
>>  22
>> >>>>  22
>> >>>> >  0   32 |1.1k   0    0 | 78  3.9k 141    0 |767k 768k| 67    4    0
>> >>>>   0
>> >>>> >  0 |110 |  0    2   56    0 |  2
>> >>>> >
>> >>>>
>> ---------------------------------------------------------------------------------------------------
>> >>>> > Any ideas where to look at?
>> >>>>
>> >>>> Check the perf dump output of the mds:
>> >>>>
>> >>>> ceph tell mds.<fs_name>:0 perf dump
>> >>>>
>> >>>> over a period of time to identify what's going on. You can also look
>> >>>> at the objecter_ops (another tell command) for the MDS.
>> >>>>
>> >>>> --
>> >>>> Patrick Donnelly, Ph.D.
>> >>>> He / Him / His
>> >>>> Principal Software Engineer
>> >>>> Red Hat, Inc.
>> >>>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>> >>>>
>> >>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>
> --
> Milind
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx