Re: CephFS constant high write I/O to the metadata pool

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Tue, 2 Jul 2024 11:44:41 -0400

This was common in the NFS days, and some Linux distribution deliberately slew the execution time.  find over an NFS mount was a sure-fire way to horque the server. (e.g. Convex C1)

IMHO since the tool relies on a static index it isn't very useful, and I routinely remove any variant from my systems.

ymmv

> On Jul 2, 2024, at 10:20, Olli Rajala <olli.rajala@xxxxxxxx> wrote:
> 
> Hi - mostly as a note to future me and if anyone else looking for the same
> issue...
> 
> I finally solved this a couple of months ago. No idea what is wrong with
> Ceph but the root cause that was triggering this MDS issue was that I had
> several workstations and a couple servers where the updatedb of "locate"
> was getting run by daily cron exactly the same time every night causing
> high momentary strain on the MDS which then somehow screwed up the metadata
> caching and flushing creating this cumulative write io.
> 
> The thing to note here is that there's a difference with "locate" and
> "mlocate" packages. The default config (on Ubuntu atleast) of updatedb for
> "mlocate" does skip scanning cephfs filesystems but not so for "locate"
> which happily ventures onto all of your cephfs mounts :|
> 
> ---------------------------
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---------------------------
> 
> 
> On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
> 
>> Hi,
>> 
>> One thing I now noticed in the mds logs is that there's a ton of entries
>> like this:
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
>> [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
>> [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
>> [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
>> [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
>> 694=484+210)
>> 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
>> [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
>> 695=484+211)
>> 
>> ...and after dropping the caches considerably less of those - normal,
>> abnormal, typical, atypical? ...or is that just something that starts
>> happening after the cache gets filled?
>> 
>> Tnx,
>> ---------------------------
>> Olli Rajala - Lead TD
>> Anima Vitae Ltd.
>> www.anima.fi
>> ---------------------------
>> 
>> 
>> On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>> 
>>> Hi,
>>> 
>>> I'm still totally lost with this issue. And now lately I've had a couple
>>> of incidents where the write bw has suddenly jumped to even crazier levels.
>>> See the graph here:
>>> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
>>> 
>>> The points where it drops to something manageable again are when I have
>>> dropped the mds caches. Usually after the drop there is steady rise but now
>>> these sudden jumps are something new and even more scary :E
>>> 
>>> Here's a fresh 2sec level 20 mds log:
>>> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
>>> 
>>> Any help and ideas greatly appreciated. Is there any tool or procedure to
>>> safely check or rebuild the mds data? ...if this behaviour could be caused
>>> by some hidden issue with the data itself.
>>> 
>>> Tnx,
>>> ---------------------------
>>> Olli Rajala - Lead TD
>>> Anima Vitae Ltd.
>>> www.anima.fi
>>> ---------------------------
>>> 
>>> 
>>> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar <vshankar@xxxxxxxxxx>
>>> wrote:
>>> 
>>>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala <olli.rajala@xxxxxxxx>
>>>> wrote:
>>>>> 
>>>>> Hi Venky,
>>>>> 
>>>>> I have indeed observed the output of the different sections of perf
>>>> dump like so:
>>>>> watch -n 1 ceph tell mds.`hostname` perf dump objecter
>>>>> watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
>>>>> ...etc...
>>>>> 
>>>>> ...but without any proper understanding of what is a normal rate for
>>>> some number to go up it's really difficult to make anything from that.
>>>>> 
>>>>> btw - is there some convenient way to capture this kind of temporal
>>>> output for others to view. Sure, I could just dump once a second to a file
>>>> or sequential files but is there some tool or convention that is easy to
>>>> look at and analyze?
>>>> 
>>>> Not really - you'd have to do it yourself.
>>>> 
>>>>> 
>>>>> Tnx,
>>>>> ---------------------------
>>>>> Olli Rajala - Lead TD
>>>>> Anima Vitae Ltd.
>>>>> www.anima.fi
>>>>> ---------------------------
>>>>> 
>>>>> 
>>>>> On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar <vshankar@xxxxxxxxxx>
>>>> wrote:
>>>>>> 
>>>>>> Hi Olli,
>>>>>> 
>>>>>> On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala <olli.rajala@xxxxxxxx>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi Patrick,
>>>>>>> 
>>>>>>> With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops"
>>>> and/or
>>>>>>> "ceph tell mds.pve-core-1 objecter_requests"? Both these show very
>>>> few
>>>>>>> requests/ops - many times just returning empty lists. I'm pretty
>>>> sure that
>>>>>>> this I/O isn't generated by any clients - I've earlier tried to
>>>> isolate
>>>>>>> this by shutting down all cephfs clients and this didn't have any
>>>>>>> noticeable effect.
>>>>>>> 
>>>>>>> I tried to watch what is going on with that "perf dump" but to be
>>>> honest
>>>>>>> all I can see is some numbers going up in the different sections :)
>>>>>>> ...don't have a clue what to focus on and how to interpret that.
>>>>>>> 
>>>>>>> Here's a perf dump if you or anyone could make something out of
>>>> that:
>>>>>>> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
>>>>>> 
>>>>>> You'd need to capture this over a period of time to see what ops might
>>>>>> be going through and what the mds is doing.
>>>>>> 
>>>>>>> 
>>>>>>> Tnx!
>>>>>>> o.
>>>>>>> 
>>>>>>> ---------------------------
>>>>>>> Olli Rajala - Lead TD
>>>>>>> Anima Vitae Ltd.
>>>>>>> www.anima.fi
>>>>>>> ---------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly <
>>>> pdonnell@xxxxxxxxxx>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hello Olli,
>>>>>>>> 
>>>>>>>> On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <olli.rajala@xxxxxxxx>
>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I'm seeing constant 25-50MB/s writes to the metadata pool even
>>>> when all
>>>>>>>>> clients and the cluster is idling and in clean state. This
>>>> surely can't
>>>>>>>> be
>>>>>>>>> normal?
>>>>>>>>> 
>>>>>>>>> There's no apparent issues with the performance of the cluster
>>>> but this
>>>>>>>>> write rate seems excessive and I don't know where to look for
>>>> the
>>>>>>>> culprit.
>>>>>>>>> 
>>>>>>>>> The setup is Ceph 16.2.9 running in hyperconverged 3 node core
>>>> cluster
>>>>>>>> and
>>>>>>>>> 6 hdd osd nodes.
>>>>>>>>> 
>>>>>>>>> Here's typical status when pretty much all clients are idling.
>>>> Most of
>>>>>>>> that
>>>>>>>>> write bandwidth and maybe fifth of the write iops is hitting the
>>>>>>>>> metadata pool.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>> ---------------------------------------------------------------------------------------------------
>>>>>>>>> root@pve-core-1:~# ceph -s
>>>>>>>>>  cluster:
>>>>>>>>>    id:     2088b4b1-8de1-44d4-956e-aa3d3afff77f
>>>>>>>>>    health: HEALTH_OK
>>>>>>>>> 
>>>>>>>>>  services:
>>>>>>>>>    mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3
>>>> (age 2w)
>>>>>>>>>    mgr: pve-core-1(active, since 4w), standbys: pve-core-2,
>>>> pve-core-3
>>>>>>>>>    mds: 1/1 daemons up, 2 standby
>>>>>>>>>    osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
>>>>>>>>> 
>>>>>>>>>  data:
>>>>>>>>>    volumes: 1/1 healthy
>>>>>>>>>    pools:   10 pools, 625 pgs
>>>>>>>>>    objects: 70.06M objects, 46 TiB
>>>>>>>>>    usage:   95 TiB used, 182 TiB / 278 TiB avail
>>>>>>>>>    pgs:     625 active+clean
>>>>>>>>> 
>>>>>>>>>  io:
>>>>>>>>>    client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
>>>>>>>>> 
>>>>>>>> 
>>>> ---------------------------------------------------------------------------------------------------
>>>>>>>>> 
>>>>>>>>> Here's some daemonperf dump:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>> ---------------------------------------------------------------------------------------------------
>>>>>>>>> root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
>>>>>>>>> 
>>>>>>>> 
>>>> ----------------------------------------mds-----------------------------------------
>>>>>>>>> --mds_cache--- ------mds_log------ -mds_mem-
>>>> -------mds_server-------
>>>>>>>> mds_
>>>>>>>>> -----objecter------ purg
>>>>>>>>> req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa
>>>> cfa  hcc
>>>>>>>> hccd
>>>>>>>>> hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr
>>>> hcs  hsr  cre
>>>>>>>>> cat |sess|actv rd   wr   rdwr|purg|
>>>>>>>>> 40    0    0  767k  78k   0    0    0    1    6    1    0
>>>> 0    5    5
>>>>>>>>> 3    7 |1.1k   0    0 | 17  3.7k 134    0 |767k 767k| 40    5
>>>>  0    0
>>>>>>>>> 0 |110 |  4    2   21    0 |  2
>>>>>>>>> 57    2    0  767k  78k   0    0    0    3   16    3    0
>>>> 0   11   11
>>>>>>>>> 0   17 |1.1k   0    0 | 45  3.7k 137    0 |767k 767k| 57    8
>>>>  0    0
>>>>>>>>> 0 |110 |  0    2   28    0 |  4
>>>>>>>>> 57    4    0  767k  78k   0    0    0    4   34    4    0
>>>> 0   34   33
>>>>>>>>> 2   26 |1.0k   0    0 |134  3.9k 139    0 |767k 767k| 57   13
>>>>  0    0
>>>>>>>>> 0 |110 |  0    2  112    0 | 19
>>>>>>>>> 67    3    0  767k  78k   0    0    0    6   32    6    0
>>>> 0   22   22
>>>>>>>>> 0   32 |1.1k   0    0 | 78  3.9k 141    0 |767k 768k| 67    4
>>>>  0    0
>>>>>>>>> 0 |110 |  0    2   56    0 |  2
>>>>>>>>> 
>>>>>>>> 
>>>> ---------------------------------------------------------------------------------------------------
>>>>>>>>> Any ideas where to look at?
>>>>>>>> 
>>>>>>>> Check the perf dump output of the mds:
>>>>>>>> 
>>>>>>>> ceph tell mds.<fs_name>:0 perf dump
>>>>>>>> 
>>>>>>>> over a period of time to identify what's going on. You can also
>>>> look
>>>>>>>> at the objecter_ops (another tell command) for the MDS.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Patrick Donnelly, Ph.D.
>>>>>>>> He / Him / His
>>>>>>>> Principal Software Engineer
>>>>>>>> Red Hat, Inc.
>>>>>>>> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>>>>>>>> 
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Cheers,
>>>>>> Venky
>>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Cheers,
>>>> Venky
>>>> 
>>>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx