Re: CephFS constant high write I/O to the metadata pool

Olli Rajala <olli.rajala@xxxxxxxx> · Thu, 4 Jul 2024 18:23:14 +0300

Hi Venky,

Yep that's what I figured out too - but it also must have triggered some
underlying issue where the mds goes into some state where each of these
updatedb runs accumulate more and more of that constant write io to the
metadata pool which never settles. Feels like it was constantly flushing
the mds cache even tho there is no equivalent amount of read io to fill up
the cache.

In this graph that I posted earlier:
https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d

...you can see those updatedb runs as those spikes on the read side - but
other than those the read io is almost zero. And those points where the
write io drops were times when I dopped the mds caches.

---------------------------
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---------------------------

On Wed, Jul 3, 2024 at 7:49 PM Venky Shankar <vshankar@xxxxxxxxxx> wrote:

> Hi Olli,
>
> On Tue, Jul 2, 2024 at 7:51 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
> >
> > Hi - mostly as a note to future me and if anyone else looking for the
> same
> > issue...
> >
> > I finally solved this a couple of months ago. No idea what is wrong with
> > Ceph but the root cause that was triggering this MDS issue was that I had
> > several workstations and a couple servers where the updatedb of "locate"
> > was getting run by daily cron exactly the same time every night causing
> > high momentary strain on the MDS which then somehow screwed up the
> metadata
> > caching and flushing creating this cumulative write io.
>
> That's probably due to updatedb walking the ceph file system and also
> possibly triggering cap recalls from other clients.
>
> >
> > The thing to note here is that there's a difference with "locate" and
> > "mlocate" packages. The default config (on Ubuntu atleast) of updatedb
> for
> > "mlocate" does skip scanning cephfs filesystems but not so for "locate"
> > which happily ventures onto all of your cephfs mounts :|
> >
> > ---------------------------
> > Olli Rajala - Lead TD
> > Anima Vitae Ltd.
> > www.anima.fi
> > ---------------------------
> >
> >
> > On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala <olli.rajala@xxxxxxxx>
> wrote:
> >
> > > Hi,
> > >
> > > One thing I now noticed in the mds logs is that there's a ton of
> entries
> > > like this:
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > > [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > > 694=484+210)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > > [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > > 695=484+211)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > > [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > > 694=484+210)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > > [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > > 695=484+211)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > > [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > > 694=484+210)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > > [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > > 695=484+211)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > > 694=484+210)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > > 695=484+211)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > > 694=484+210)
> > > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > > 695=484+211)
> > >
> > > ...and after dropping the caches considerably less of those - normal,
> > > abnormal, typical, atypical? ...or is that just something that starts
> > > happening after the cache gets filled?
> > >
> > > Tnx,
> > > ---------------------------
> > > Olli Rajala - Lead TD
> > > Anima Vitae Ltd.
> > > www.anima.fi
> > > ---------------------------
> > >
> > >
> > > On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala <olli.rajala@xxxxxxxx>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm still totally lost with this issue. And now lately I've had a
> couple
> > >> of incidents where the write bw has suddenly jumped to even crazier
> levels.
> > >> See the graph here:
> > >> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
> > >>
> > >> The points where it drops to something manageable again are when I
> have
> > >> dropped the mds caches. Usually after the drop there is steady rise
> but now
> > >> these sudden jumps are something new and even more scary :E
> > >>
> > >> Here's a fresh 2sec level 20 mds log:
> > >> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
> > >>
> > >> Any help and ideas greatly appreciated. Is there any tool or
> procedure to
> > >> safely check or rebuild the mds data? ...if this behaviour could be
> caused
> > >> by some hidden issue with the data itself.
> > >>
> > >> Tnx,
> > >> ---------------------------
> > >> Olli Rajala - Lead TD
> > >> Anima Vitae Ltd.
> > >> www.anima.fi
> > >> ---------------------------
> > >>
> > >>
> > >> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar <vshankar@xxxxxxxxxx>
> > >> wrote:
> > >>
> > >>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala <olli.rajala@xxxxxxxx>
> > >>> wrote:
> > >>> >
> > >>> > Hi Venky,
> > >>> >
> > >>> > I have indeed observed the output of the different sections of perf
> > >>> dump like so:
> > >>> > watch -n 1 ceph tell mds.`hostname` perf dump objecter
> > >>> > watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
> > >>> > ...etc...
> > >>> >
> > >>> > ...but without any proper understanding of what is a normal rate
> for
> > >>> some number to go up it's really difficult to make anything from
> that.
> > >>> >
> > >>> > btw - is there some convenient way to capture this kind of temporal
> > >>> output for others to view. Sure, I could just dump once a second to
> a file
> > >>> or sequential files but is there some tool or convention that is
> easy to
> > >>> look at and analyze?
> > >>>
> > >>> Not really - you'd have to do it yourself.
> > >>>
> > >>> >
> > >>> > Tnx,
> > >>> > ---------------------------
> > >>> > Olli Rajala - Lead TD
> > >>> > Anima Vitae Ltd.
> > >>> > www.anima.fi
> > >>> > ---------------------------
> > >>> >
> > >>> >
> > >>> > On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar <vshankar@xxxxxxxxxx
> >
> > >>> wrote:
> > >>> >>
> > >>> >> Hi Olli,
> > >>> >>
> > >>> >> On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala <olli.rajala@xxxxxxxx
> >
> > >>> wrote:
> > >>> >> >
> > >>> >> > Hi Patrick,
> > >>> >> >
> > >>> >> > With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops"
> > >>> and/or
> > >>> >> > "ceph tell mds.pve-core-1 objecter_requests"? Both these show
> very
> > >>> few
> > >>> >> > requests/ops - many times just returning empty lists. I'm pretty
> > >>> sure that
> > >>> >> > this I/O isn't generated by any clients - I've earlier tried to
> > >>> isolate
> > >>> >> > this by shutting down all cephfs clients and this didn't have
> any
> > >>> >> > noticeable effect.
> > >>> >> >
> > >>> >> > I tried to watch what is going on with that "perf dump" but to
> be
> > >>> honest
> > >>> >> > all I can see is some numbers going up in the different
> sections :)
> > >>> >> > ...don't have a clue what to focus on and how to interpret that.
> > >>> >> >
> > >>> >> > Here's a perf dump if you or anyone could make something out of
> > >>> that:
> > >>> >> >
> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
> > >>> >>
> > >>> >> You'd need to capture this over a period of time to see what ops
> might
> > >>> >> be going through and what the mds is doing.
> > >>> >>
> > >>> >> >
> > >>> >> > Tnx!
> > >>> >> > o.
> > >>> >> >
> > >>> >> > ---------------------------
> > >>> >> > Olli Rajala - Lead TD
> > >>> >> > Anima Vitae Ltd.
> > >>> >> > www.anima.fi
> > >>> >> > ---------------------------
> > >>> >> >
> > >>> >> >
> > >>> >> > On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly <
> > >>> pdonnell@xxxxxxxxxx>
> > >>> >> > wrote:
> > >>> >> >
> > >>> >> > > Hello Olli,
> > >>> >> > >
> > >>> >> > > On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <
> olli.rajala@xxxxxxxx>
> > >>> wrote:
> > >>> >> > > >
> > >>> >> > > > Hi,
> > >>> >> > > >
> > >>> >> > > > I'm seeing constant 25-50MB/s writes to the metadata pool
> even
> > >>> when all
> > >>> >> > > > clients and the cluster is idling and in clean state. This
> > >>> surely can't
> > >>> >> > > be
> > >>> >> > > > normal?
> > >>> >> > > >
> > >>> >> > > > There's no apparent issues with the performance of the
> cluster
> > >>> but this
> > >>> >> > > > write rate seems excessive and I don't know where to look
> for
> > >>> the
> > >>> >> > > culprit.
> > >>> >> > > >
> > >>> >> > > > The setup is Ceph 16.2.9 running in hyperconverged 3 node
> core
> > >>> cluster
> > >>> >> > > and
> > >>> >> > > > 6 hdd osd nodes.
> > >>> >> > > >
> > >>> >> > > > Here's typical status when pretty much all clients are
> idling.
> > >>> Most of
> > >>> >> > > that
> > >>> >> > > > write bandwidth and maybe fifth of the write iops is
> hitting the
> > >>> >> > > > metadata pool.
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>>
> ---------------------------------------------------------------------------------------------------
> > >>> >> > > > root@pve-core-1:~# ceph -s
> > >>> >> > > >   cluster:
> > >>> >> > > >     id:     2088b4b1-8de1-44d4-956e-aa3d3afff77f
> > >>> >> > > >     health: HEALTH_OK
> > >>> >> > > >
> > >>> >> > > >   services:
> > >>> >> > > >     mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3
> > >>> (age 2w)
> > >>> >> > > >     mgr: pve-core-1(active, since 4w), standbys: pve-core-2,
> > >>> pve-core-3
> > >>> >> > > >     mds: 1/1 daemons up, 2 standby
> > >>> >> > > >     osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
> > >>> >> > > >
> > >>> >> > > >   data:
> > >>> >> > > >     volumes: 1/1 healthy
> > >>> >> > > >     pools:   10 pools, 625 pgs
> > >>> >> > > >     objects: 70.06M objects, 46 TiB
> > >>> >> > > >     usage:   95 TiB used, 182 TiB / 278 TiB avail
> > >>> >> > > >     pgs:     625 active+clean
> > >>> >> > > >
> > >>> >> > > >   io:
> > >>> >> > > >     client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s
> wr
> > >>> >> > > >
> > >>> >> > >
> > >>>
> ---------------------------------------------------------------------------------------------------
> > >>> >> > > >
> > >>> >> > > > Here's some daemonperf dump:
> > >>> >> > > >
> > >>> >> > > >
> > >>> >> > >
> > >>>
> ---------------------------------------------------------------------------------------------------
> > >>> >> > > > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
> > >>> >> > > >
> > >>> >> > >
> > >>>
> ----------------------------------------mds-----------------------------------------
> > >>> >> > > > --mds_cache--- ------mds_log------ -mds_mem-
> > >>> -------mds_server-------
> > >>> >> > > mds_
> > >>> >> > > > -----objecter------ purg
> > >>> >> > > > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa
> > >>> cfa  hcc
> > >>> >> > > hccd
> > >>> >> > > > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr
> > >>> hcs  hsr  cre
> > >>> >> > > >  cat |sess|actv rd   wr   rdwr|purg|
> > >>> >> > > >  40    0    0  767k  78k   0    0    0    1    6    1    0
> > >>> 0    5    5
> > >>> >> > > >  3    7 |1.1k   0    0 | 17  3.7k 134    0 |767k 767k| 40
>   5
> > >>>   0    0
> > >>> >> > > >  0 |110 |  4    2   21    0 |  2
> > >>> >> > > >  57    2    0  767k  78k   0    0    0    3   16    3    0
> > >>> 0   11   11
> > >>> >> > > >  0   17 |1.1k   0    0 | 45  3.7k 137    0 |767k 767k| 57
>   8
> > >>>   0    0
> > >>> >> > > >  0 |110 |  0    2   28    0 |  4
> > >>> >> > > >  57    4    0  767k  78k   0    0    0    4   34    4    0
> > >>> 0   34   33
> > >>> >> > > >  2   26 |1.0k   0    0 |134  3.9k 139    0 |767k 767k| 57
>  13
> > >>>   0    0
> > >>> >> > > >  0 |110 |  0    2  112    0 | 19
> > >>> >> > > >  67    3    0  767k  78k   0    0    0    6   32    6    0
> > >>> 0   22   22
> > >>> >> > > >  0   32 |1.1k   0    0 | 78  3.9k 141    0 |767k 768k| 67
>   4
> > >>>   0    0
> > >>> >> > > >  0 |110 |  0    2   56    0 |  2
> > >>> >> > > >
> > >>> >> > >
> > >>>
> ---------------------------------------------------------------------------------------------------
> > >>> >> > > > Any ideas where to look at?
> > >>> >> > >
> > >>> >> > > Check the perf dump output of the mds:
> > >>> >> > >
> > >>> >> > > ceph tell mds.<fs_name>:0 perf dump
> > >>> >> > >
> > >>> >> > > over a period of time to identify what's going on. You can
> also
> > >>> look
> > >>> >> > > at the objecter_ops (another tell command) for the MDS.
> > >>> >> > >
> > >>> >> > > --
> > >>> >> > > Patrick Donnelly, Ph.D.
> > >>> >> > > He / Him / His
> > >>> >> > > Principal Software Engineer
> > >>> >> > > Red Hat, Inc.
> > >>> >> > > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> > >>> >> > >
> > >>> >> > >
> > >>> >> > _______________________________________________
> > >>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> > >>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>> >> >
> > >>> >>
> > >>> >>
> > >>> >> --
> > >>> >> Cheers,
> > >>> >> Venky
> > >>> >>
> > >>>
> > >>>
> > >>> --
> > >>> Cheers,
> > >>> Venky
> > >>>
> > >>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> --
> Cheers,
> Venky
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx