Re: CephFS constant high write I/O to the metadata pool

Olli Rajala <olli.rajala@xxxxxxxx> · Mon, 7 Nov 2022 13:03:38 +0200

I might have spoken too soon :(

Now about 60h after dropping the caches the write bandwidth has gone up
linearly from those initial hundreds of kB/s to now nearly 10MB/s.

I don't think this could be caused by the cache just filling up again
either. After dropping the cache I tested if filling up the cache would
show any bw increase by running "tree" at the root of one of the mounts and
it didn't affect anything at the time. So basically the cache has been
fully saturated all this time now.

Boggled,
---------------------------
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---------------------------

On Sat, Nov 5, 2022 at 12:47 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:

> Oh Lordy,
>
> Seems like I finally got this resolved. And all it needed in the end was
> to drop the mds caches with:
> ceph tell mds.`hostname` cache drop
>
> The funny thing is that whatever the issue with the cache was it had
> persisted through several Ceph upgrades and node reboots. It's been a live
> production system so I guess that there just never has been a moment where
> all mds would have been down and thus make it fully rebuild the cache...
> maybe :|
>
> Unfortunately I don't remember when this issue arose and my metrics don't
> reach far back enough... but I wonder if this could have started already
> when I did Octopus->Pacific upgrade...
>
> Cheers,
> ---------------------------
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---------------------------
>
>
> On Mon, Oct 24, 2022 at 9:36 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>
>> I tried my luck and upgraded to 17.2.4 but unfortunately that didn't make
>> any difference here either.
>>
>> I also looked more again at all kinds of client op and request stats and
>> wotnot which only made me even more certain that this io is not caused by
>> any clients.
>>
>> What internal mds operation or mechanism could cause such high idle write
>> io? I've tried to fiddle a bit with some of the mds cache trim and memory
>> settings but I haven't noticed any effect there. Any pointers appreciated.
>>
>> Cheers,
>> ---------------------------
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---------------------------
>>
>>
>> On Mon, Oct 17, 2022 at 10:28 AM Olli Rajala <olli.rajala@xxxxxxxx>
>> wrote:
>>
>>> Hi Patrick,
>>>
>>> With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops" and/or
>>> "ceph tell mds.pve-core-1 objecter_requests"? Both these show very few
>>> requests/ops - many times just returning empty lists. I'm pretty sure that
>>> this I/O isn't generated by any clients - I've earlier tried to isolate
>>> this by shutting down all cephfs clients and this didn't have any
>>> noticeable effect.
>>>
>>> I tried to watch what is going on with that "perf dump" but to be honest
>>> all I can see is some numbers going up in the different sections :)
>>> ...don't have a clue what to focus on and how to interpret that.
>>>
>>> Here's a perf dump if you or anyone could make something out of that:
>>> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
>>>
>>> Tnx!
>>> o.
>>>
>>> ---------------------------
>>> Olli Rajala - Lead TD
>>> Anima Vitae Ltd.
>>> www.anima.fi
>>> ---------------------------
>>>
>>>
>>> On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly <pdonnell@xxxxxxxxxx>
>>> wrote:
>>>
>>>> Hello Olli,
>>>>
>>>> On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <olli.rajala@xxxxxxxx>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I'm seeing constant 25-50MB/s writes to the metadata pool even when
>>>> all
>>>> > clients and the cluster is idling and in clean state. This surely
>>>> can't be
>>>> > normal?
>>>> >
>>>> > There's no apparent issues with the performance of the cluster but
>>>> this
>>>> > write rate seems excessive and I don't know where to look for the
>>>> culprit.
>>>> >
>>>> > The setup is Ceph 16.2.9 running in hyperconverged 3 node core
>>>> cluster and
>>>> > 6 hdd osd nodes.
>>>> >
>>>> > Here's typical status when pretty much all clients are idling. Most
>>>> of that
>>>> > write bandwidth and maybe fifth of the write iops is hitting the
>>>> > metadata pool.
>>>> >
>>>> >
>>>> ---------------------------------------------------------------------------------------------------
>>>> > root@pve-core-1:~# ceph -s
>>>> >   cluster:
>>>> >     id:     2088b4b1-8de1-44d4-956e-aa3d3afff77f
>>>> >     health: HEALTH_OK
>>>> >
>>>> >   services:
>>>> >     mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age 2w)
>>>> >     mgr: pve-core-1(active, since 4w), standbys: pve-core-2,
>>>> pve-core-3
>>>> >     mds: 1/1 daemons up, 2 standby
>>>> >     osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
>>>> >
>>>> >   data:
>>>> >     volumes: 1/1 healthy
>>>> >     pools:   10 pools, 625 pgs
>>>> >     objects: 70.06M objects, 46 TiB
>>>> >     usage:   95 TiB used, 182 TiB / 278 TiB avail
>>>> >     pgs:     625 active+clean
>>>> >
>>>> >   io:
>>>> >     client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
>>>> >
>>>> ---------------------------------------------------------------------------------------------------
>>>> >
>>>> > Here's some daemonperf dump:
>>>> >
>>>> >
>>>> ---------------------------------------------------------------------------------------------------
>>>> > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
>>>> >
>>>> ----------------------------------------mds-----------------------------------------
>>>> > --mds_cache--- ------mds_log------ -mds_mem- -------mds_server-------
>>>> mds_
>>>> > -----objecter------ purg
>>>> > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa cfa  hcc
>>>> hccd
>>>> > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr  hcs  hsr
>>>> cre
>>>> >  cat |sess|actv rd   wr   rdwr|purg|
>>>> >  40    0    0  767k  78k   0    0    0    1    6    1    0    0    5
>>>>   5
>>>> >  3    7 |1.1k   0    0 | 17  3.7k 134    0 |767k 767k| 40    5    0
>>>>   0
>>>> >  0 |110 |  4    2   21    0 |  2
>>>> >  57    2    0  767k  78k   0    0    0    3   16    3    0    0   11
>>>>  11
>>>> >  0   17 |1.1k   0    0 | 45  3.7k 137    0 |767k 767k| 57    8    0
>>>>   0
>>>> >  0 |110 |  0    2   28    0 |  4
>>>> >  57    4    0  767k  78k   0    0    0    4   34    4    0    0   34
>>>>  33
>>>> >  2   26 |1.0k   0    0 |134  3.9k 139    0 |767k 767k| 57   13    0
>>>>   0
>>>> >  0 |110 |  0    2  112    0 | 19
>>>> >  67    3    0  767k  78k   0    0    0    6   32    6    0    0   22
>>>>  22
>>>> >  0   32 |1.1k   0    0 | 78  3.9k 141    0 |767k 768k| 67    4    0
>>>>   0
>>>> >  0 |110 |  0    2   56    0 |  2
>>>> >
>>>> ---------------------------------------------------------------------------------------------------
>>>> > Any ideas where to look at?
>>>>
>>>> Check the perf dump output of the mds:
>>>>
>>>> ceph tell mds.<fs_name>:0 perf dump
>>>>
>>>> over a period of time to identify what's going on. You can also look
>>>> at the objecter_ops (another tell command) for the MDS.
>>>>
>>>> --
>>>> Patrick Donnelly, Ph.D.
>>>> He / Him / His
>>>> Principal Software Engineer
>>>> Red Hat, Inc.
>>>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>>>>
>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx