Re: CephFS constant high write I/O to the metadata pool

Venky Shankar <vshankar@xxxxxxxxxx> · Wed, 3 Jul 2024 22:18:40 +0530

Hi Olli,

On Tue, Jul 2, 2024 at 7:51 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>
> Hi - mostly as a note to future me and if anyone else looking for the same
> issue...
>
> I finally solved this a couple of months ago. No idea what is wrong with
> Ceph but the root cause that was triggering this MDS issue was that I had
> several workstations and a couple servers where the updatedb of "locate"
> was getting run by daily cron exactly the same time every night causing
> high momentary strain on the MDS which then somehow screwed up the metadata
> caching and flushing creating this cumulative write io.

That's probably due to updatedb walking the ceph file system and also
possibly triggering cap recalls from other clients.

>
> The thing to note here is that there's a difference with "locate" and
> "mlocate" packages. The default config (on Ubuntu atleast) of updatedb for
> "mlocate" does skip scanning cephfs filesystems but not so for "locate"
> which happily ventures onto all of your cephfs mounts :|
>
> ---------------------------
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---------------------------
>
>
> On Wed, Dec 14, 2022 at 7:41 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
>
> > Hi,
> >
> > One thing I now noticed in the mds logs is that there's a ton of entries
> > like this:
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > [d345,d346] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > 694=484+210)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > [d345,d346] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > 695=484+211)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > [d343,d344] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > 694=484+210)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > [d343,d344] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > 695=484+211)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > [d341,d342] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > 694=484+210)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > [d341,d342] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > 695=484+211)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > 694=484+210)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > [d33f,d340] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > 695=484+211)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache  projecting to
> > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.317400+0200 b787972591
> > 694=484+210)
> > 2022-12-11T18:20:49.321+0200 7fdd0edde700 20 mds.0.cache         result
> > [d33d,d33e] n(v1638 rc2022-12-11T18:20:49.321400+0200 b787972591
> > 695=484+211)
> >
> > ...and after dropping the caches considerably less of those - normal,
> > abnormal, typical, atypical? ...or is that just something that starts
> > happening after the cache gets filled?
> >
> > Tnx,
> > ---------------------------
> > Olli Rajala - Lead TD
> > Anima Vitae Ltd.
> > www.anima.fi
> > ---------------------------
> >
> >
> > On Sun, Dec 11, 2022 at 9:07 PM Olli Rajala <olli.rajala@xxxxxxxx> wrote:
> >
> >> Hi,
> >>
> >> I'm still totally lost with this issue. And now lately I've had a couple
> >> of incidents where the write bw has suddenly jumped to even crazier levels.
> >> See the graph here:
> >> https://gist.github.com/olliRJL/3e97e15a37e8e801a785a1bd5358120d
> >>
> >> The points where it drops to something manageable again are when I have
> >> dropped the mds caches. Usually after the drop there is steady rise but now
> >> these sudden jumps are something new and even more scary :E
> >>
> >> Here's a fresh 2sec level 20 mds log:
> >> https://gist.github.com/olliRJL/074bec65787085e70db8af0ec35f8148
> >>
> >> Any help and ideas greatly appreciated. Is there any tool or procedure to
> >> safely check or rebuild the mds data? ...if this behaviour could be caused
> >> by some hidden issue with the data itself.
> >>
> >> Tnx,
> >> ---------------------------
> >> Olli Rajala - Lead TD
> >> Anima Vitae Ltd.
> >> www.anima.fi
> >> ---------------------------
> >>
> >>
> >> On Fri, Nov 11, 2022 at 9:14 AM Venky Shankar <vshankar@xxxxxxxxxx>
> >> wrote:
> >>
> >>> On Fri, Nov 11, 2022 at 3:06 AM Olli Rajala <olli.rajala@xxxxxxxx>
> >>> wrote:
> >>> >
> >>> > Hi Venky,
> >>> >
> >>> > I have indeed observed the output of the different sections of perf
> >>> dump like so:
> >>> > watch -n 1 ceph tell mds.`hostname` perf dump objecter
> >>> > watch -n 1 ceph tell mds.`hostname` perf dump mds_cache
> >>> > ...etc...
> >>> >
> >>> > ...but without any proper understanding of what is a normal rate for
> >>> some number to go up it's really difficult to make anything from that.
> >>> >
> >>> > btw - is there some convenient way to capture this kind of temporal
> >>> output for others to view. Sure, I could just dump once a second to a file
> >>> or sequential files but is there some tool or convention that is easy to
> >>> look at and analyze?
> >>>
> >>> Not really - you'd have to do it yourself.
> >>>
> >>> >
> >>> > Tnx,
> >>> > ---------------------------
> >>> > Olli Rajala - Lead TD
> >>> > Anima Vitae Ltd.
> >>> > www.anima.fi
> >>> > ---------------------------
> >>> >
> >>> >
> >>> > On Thu, Nov 10, 2022 at 8:18 AM Venky Shankar <vshankar@xxxxxxxxxx>
> >>> wrote:
> >>> >>
> >>> >> Hi Olli,
> >>> >>
> >>> >> On Mon, Oct 17, 2022 at 1:08 PM Olli Rajala <olli.rajala@xxxxxxxx>
> >>> wrote:
> >>> >> >
> >>> >> > Hi Patrick,
> >>> >> >
> >>> >> > With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops"
> >>> and/or
> >>> >> > "ceph tell mds.pve-core-1 objecter_requests"? Both these show very
> >>> few
> >>> >> > requests/ops - many times just returning empty lists. I'm pretty
> >>> sure that
> >>> >> > this I/O isn't generated by any clients - I've earlier tried to
> >>> isolate
> >>> >> > this by shutting down all cephfs clients and this didn't have any
> >>> >> > noticeable effect.
> >>> >> >
> >>> >> > I tried to watch what is going on with that "perf dump" but to be
> >>> honest
> >>> >> > all I can see is some numbers going up in the different sections :)
> >>> >> > ...don't have a clue what to focus on and how to interpret that.
> >>> >> >
> >>> >> > Here's a perf dump if you or anyone could make something out of
> >>> that:
> >>> >> > https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
> >>> >>
> >>> >> You'd need to capture this over a period of time to see what ops might
> >>> >> be going through and what the mds is doing.
> >>> >>
> >>> >> >
> >>> >> > Tnx!
> >>> >> > o.
> >>> >> >
> >>> >> > ---------------------------
> >>> >> > Olli Rajala - Lead TD
> >>> >> > Anima Vitae Ltd.
> >>> >> > www.anima.fi
> >>> >> > ---------------------------
> >>> >> >
> >>> >> >
> >>> >> > On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly <
> >>> pdonnell@xxxxxxxxxx>
> >>> >> > wrote:
> >>> >> >
> >>> >> > > Hello Olli,
> >>> >> > >
> >>> >> > > On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala <olli.rajala@xxxxxxxx>
> >>> wrote:
> >>> >> > > >
> >>> >> > > > Hi,
> >>> >> > > >
> >>> >> > > > I'm seeing constant 25-50MB/s writes to the metadata pool even
> >>> when all
> >>> >> > > > clients and the cluster is idling and in clean state. This
> >>> surely can't
> >>> >> > > be
> >>> >> > > > normal?
> >>> >> > > >
> >>> >> > > > There's no apparent issues with the performance of the cluster
> >>> but this
> >>> >> > > > write rate seems excessive and I don't know where to look for
> >>> the
> >>> >> > > culprit.
> >>> >> > > >
> >>> >> > > > The setup is Ceph 16.2.9 running in hyperconverged 3 node core
> >>> cluster
> >>> >> > > and
> >>> >> > > > 6 hdd osd nodes.
> >>> >> > > >
> >>> >> > > > Here's typical status when pretty much all clients are idling.
> >>> Most of
> >>> >> > > that
> >>> >> > > > write bandwidth and maybe fifth of the write iops is hitting the
> >>> >> > > > metadata pool.
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> ---------------------------------------------------------------------------------------------------
> >>> >> > > > root@pve-core-1:~# ceph -s
> >>> >> > > >   cluster:
> >>> >> > > >     id:     2088b4b1-8de1-44d4-956e-aa3d3afff77f
> >>> >> > > >     health: HEALTH_OK
> >>> >> > > >
> >>> >> > > >   services:
> >>> >> > > >     mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3
> >>> (age 2w)
> >>> >> > > >     mgr: pve-core-1(active, since 4w), standbys: pve-core-2,
> >>> pve-core-3
> >>> >> > > >     mds: 1/1 daemons up, 2 standby
> >>> >> > > >     osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
> >>> >> > > >
> >>> >> > > >   data:
> >>> >> > > >     volumes: 1/1 healthy
> >>> >> > > >     pools:   10 pools, 625 pgs
> >>> >> > > >     objects: 70.06M objects, 46 TiB
> >>> >> > > >     usage:   95 TiB used, 182 TiB / 278 TiB avail
> >>> >> > > >     pgs:     625 active+clean
> >>> >> > > >
> >>> >> > > >   io:
> >>> >> > > >     client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
> >>> >> > > >
> >>> >> > >
> >>> ---------------------------------------------------------------------------------------------------
> >>> >> > > >
> >>> >> > > > Here's some daemonperf dump:
> >>> >> > > >
> >>> >> > > >
> >>> >> > >
> >>> ---------------------------------------------------------------------------------------------------
> >>> >> > > > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
> >>> >> > > >
> >>> >> > >
> >>> ----------------------------------------mds-----------------------------------------
> >>> >> > > > --mds_cache--- ------mds_log------ -mds_mem-
> >>> -------mds_server-------
> >>> >> > > mds_
> >>> >> > > > -----objecter------ purg
> >>> >> > > > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa
> >>> cfa  hcc
> >>> >> > > hccd
> >>> >> > > > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr
> >>> hcs  hsr  cre
> >>> >> > > >  cat |sess|actv rd   wr   rdwr|purg|
> >>> >> > > >  40    0    0  767k  78k   0    0    0    1    6    1    0
> >>> 0    5    5
> >>> >> > > >  3    7 |1.1k   0    0 | 17  3.7k 134    0 |767k 767k| 40    5
> >>>   0    0
> >>> >> > > >  0 |110 |  4    2   21    0 |  2
> >>> >> > > >  57    2    0  767k  78k   0    0    0    3   16    3    0
> >>> 0   11   11
> >>> >> > > >  0   17 |1.1k   0    0 | 45  3.7k 137    0 |767k 767k| 57    8
> >>>   0    0
> >>> >> > > >  0 |110 |  0    2   28    0 |  4
> >>> >> > > >  57    4    0  767k  78k   0    0    0    4   34    4    0
> >>> 0   34   33
> >>> >> > > >  2   26 |1.0k   0    0 |134  3.9k 139    0 |767k 767k| 57   13
> >>>   0    0
> >>> >> > > >  0 |110 |  0    2  112    0 | 19
> >>> >> > > >  67    3    0  767k  78k   0    0    0    6   32    6    0
> >>> 0   22   22
> >>> >> > > >  0   32 |1.1k   0    0 | 78  3.9k 141    0 |767k 768k| 67    4
> >>>   0    0
> >>> >> > > >  0 |110 |  0    2   56    0 |  2
> >>> >> > > >
> >>> >> > >
> >>> ---------------------------------------------------------------------------------------------------
> >>> >> > > > Any ideas where to look at?
> >>> >> > >
> >>> >> > > Check the perf dump output of the mds:
> >>> >> > >
> >>> >> > > ceph tell mds.<fs_name>:0 perf dump
> >>> >> > >
> >>> >> > > over a period of time to identify what's going on. You can also
> >>> look
> >>> >> > > at the objecter_ops (another tell command) for the MDS.
> >>> >> > >
> >>> >> > > --
> >>> >> > > Patrick Donnelly, Ph.D.
> >>> >> > > He / Him / His
> >>> >> > > Principal Software Engineer
> >>> >> > > Red Hat, Inc.
> >>> >> > > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >>> >> > >
> >>> >> > >
> >>> >> > _______________________________________________
> >>> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >>> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>> >> >
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Cheers,
> >>> >> Venky
> >>> >>
> >>>
> >>>
> >>> --
> >>> Cheers,
> >>> Venky
> >>>
> >>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx