Re: Help needed, ceph fs down due to large stray dir

Frank Schilder <frans@xxxxxx> · Sat, 11 Jan 2025 01:36:32 +0000

Hi all,

here an update. The MDS got stuck again doing nothing. Could it be blocklisting? The MDS IP address is in the blocklist together with a bunch of others (see blocklist below). Could this have anything to do with my observation of the MDS coming up but not doing anything?

Anyways, following on the suggestions I got I did another restart with these recovery options:

global  advanced  mds_beacon_grace                     600000.000000
mon     advanced  mds_beacon_grace                     600000.000000
mon     advanced  mds_heartbeat_reset_grace            14400
mds     advanced  mds_heartbeat_reset_grace            14400

All of these were set already in my previous attempt except for mds_heartbeat_reset_grace on the MONs. The options preventing client (re-)connections are not present in pacific. How can I prevent clients from connecting? Well, to the result: the MDS gets stuck again. Here is the log snippet at the moment when it all goes bad:

2025-01-11T02:12:18.930+0100 7f35d0b8c700  1 mds.ceph-12 Updating MDS map to version 1103287 from mon.1
2025-01-11T02:12:31.061+0100 7f35d0b8c700  1 mds.ceph-12 Updating MDS map to version 1103288 from mon.1
2025-01-11T02:12:50.282+0100 7f35ceb88700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2025-01-11T02:12:50.282+0100 7f35ceb88700  0 mds.beacon.ceph-12 Skipping beacon heartbeat to monitors (last acked 3.99997s ago); MDS internal heartbeat is not healthy!
2025-01-11T02:12:50.782+0100 7f35ceb88700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2025-01-11T02:12:50.782+0100 7f35ceb88700  0 mds.beacon.ceph-12 Skipping beacon heartbeat to monitors (last acked 4.49996s ago); MDS internal heartbeat is not healthy!

This may or may not coincide with the start of using swap. I have it on a really fast enterprise SSD though, so this should be very unlikely. How can I change this internal heartbeat timeout? It seems to be the source of all evil here.

After changing the grace-parameters (before restart), why is it still not sending beacons to the MONs?? What timeout is relevant here or is it indeed blocklisting? There has to be something. Can anyone explain the meaning of these log messages please?

I installed perf in the container, but I get the error "No permission to enable cycles:u event.". Can't install the ceph symbols on the host either, its CentOS7.

Here the blocklist, all addresses are in the public network, the *.76 address is the host with the stubborn MDS. All of these addresses are ceph-osd/mds hosts.

# ceph osd blocklist ls
192.168.32.76:6801/1430498156 2025-01-12T01:31:20.754963+0100
192.168.32.76:6800/132383718 2025-01-11T23:27:19.299845+0100
192.168.32.76:6801/132383718 2025-01-11T23:27:19.299845+0100
192.168.32.76:6801/2655583055 2025-01-11T18:47:45.777618+0100
192.168.32.76:6800/2655583055 2025-01-11T18:47:45.777618+0100
192.168.32.76:6801/26503869 2025-01-11T18:36:15.545825+0100
192.168.32.81:6801/1860839812 2025-01-11T19:01:30.106032+0100
192.168.32.81:6800/1860839812 2025-01-11T19:01:30.106032+0100
192.168.32.76:6800/26503869 2025-01-11T18:36:15.545825+0100
192.168.32.81:6801/3624695074 2025-01-11T18:20:10.720157+0100
192.168.32.76:6800/169617095 2025-01-11T18:04:24.331676+0100
192.168.32.76:6801/64253026 2025-01-11T19:57:05.362481+0100
192.168.32.77:6800/4046036080 2025-01-11T14:21:01.619782+0100
192.168.32.72:6800/1597941661 2025-01-11T17:34:26.953025+0100
192.168.32.77:6801/4046036080 2025-01-11T14:21:01.619782+0100
192.168.32.76:6800/237383530 2025-01-11T17:44:22.000558+0100
192.168.32.88:6800/3234284305 2025-01-11T17:30:21.984212+0100
192.168.32.88:6801/3234284305 2025-01-11T17:30:21.984212+0100
192.168.32.79:6800/2508491163 2025-01-11T17:42:51.567239+0100
192.168.32.80:6800/1795253695 2025-01-11T17:32:01.876755+0100
192.168.32.72:6801/2852823438 2025-01-11T17:36:06.126374+0100
192.168.32.81:6800/2603013093 2025-01-11T18:40:25.487782+0100
192.168.32.80:6801/1795253695 2025-01-11T17:32:01.876755+0100
192.168.32.78:6800/436883098 2025-01-11T17:37:19.465988+0100
192.168.32.81:6800/1214560246 2025-01-11T20:08:00.161280+0100
192.168.32.75:6800/2981193183 2025-01-11T17:33:21.493087+0100
192.168.32.75:6801/2981193183 2025-01-11T17:33:21.493087+0100
192.168.32.81:6801/2790245426 2025-01-11T17:35:33.591919+0100
192.168.32.81:6801/803708946 2025-01-11T17:52:45.777544+0100
192.168.32.76:6800/64253026 2025-01-11T19:57:05.362481+0100
192.168.32.72:6801/1597941661 2025-01-11T17:34:26.953025+0100
192.168.32.76:6801/169617095 2025-01-11T18:04:24.331676+0100
192.168.32.81:6800/2790245426 2025-01-11T17:35:33.591919+0100
192.168.32.76:6800/1430498156 2025-01-12T01:31:20.754963+0100
192.168.32.81:6801/2603013093 2025-01-11T18:40:25.487782+0100
192.168.32.72:6800/2852823438 2025-01-11T17:36:06.126374+0100
192.168.32.76:6801/871505229 2025-01-11T21:58:22.567016+0100
192.168.32.79:6801/2508491163 2025-01-11T17:42:51.567239+0100
192.168.32.81:6800/3624695074 2025-01-11T18:20:10.720157+0100
192.168.32.78:6801/436883098 2025-01-11T17:37:19.465988+0100
192.168.32.76:6800/871505229 2025-01-11T21:58:22.567016+0100
192.168.32.81:6801/1214560246 2025-01-11T20:08:00.161280+0100
192.168.32.76:6801/237383530 2025-01-11T17:44:22.000558+0100
192.168.32.81:6800/803708946 2025-01-11T17:52:45.777544+0100
listed 44 entries

The current instance of ceph-12 seems not to be listed though (if I interpret he nonces right), it is:

# ceph tell mds.ceph-12 status
{
    "cluster_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
    "whoami": 2,
    "id": 436483146,
    "want_state": "up:active",
    "state": "up:active",
    "fs_name": "con-fs2",
    "rank_uptime": 2408.8316334639999,
    "mdsmap_epoch": 1103281,
    "osdmap_epoch": 3235353,
    "osdmap_epoch_barrier": 3235353,
    "uptime": 2409.181400937
}

This status output was taken before the MDS stopped responding.

The only change from restart to restart is that the MDS seems to load about 100.000 fewer DNS/INOS into cache. So maybe there is some progress with trimming the stray items? However, I can't do 850 restarts in this fashion, there has to be another way.

I would be really grateful for any help regarding getting he system in a stable state for further trouble shooting. I would really block all client access to the fs. In addition, any hints as to how to get the MDS stay in the system and trim the stray items is dearly needed. Alternatively, is there a way to do off-line trimming?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Friday, January 10, 2025 11:32 PM
To: Frank Schilder
Cc: Bailey Allison; ceph-users@xxxxxxx
Subject: Re:  Re: Help needed, ceph fs down due to large stray dir

Hi Frank,

Can you try `perf top` to find out what the ceph-mds process is doing
with that CPU time?
Also Mark's profiler is super useful to find those busy loops:
https://github.com/markhpc/uwpmp

Cheers, Dan

--
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx

On Fri, Jan 10, 2025 at 2:06 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Bailey,
>
> I already set that value very high:
>
> # ceph config get mds.ceph-12 mds_beacon_grace
> 600000.000000
>
> To no avail. The 15s heartbeat timeout comes from somewhere else. What I observe is that the MDS loads the stray buckets (up to 87Mio DNS/INOS) and as soon as that happened it seems to start doing something (RAM usage grows without the DNS/INOS changing any more). However, shortly after the timeout happens, everything comes to a standstill. I think the MONs keep the MDS assigned but its no longer part of the file system or the actual MDS worker thread terminates with this timeout.
>
> Its reported as up and active, but this report seems just outdated as all status queries to the MDS just hang. My suspicion is that the MONs don't kick it out yet (no fail-over triggered), but the rank is actually not really active. The report just doesn't update.
>
> I'm stuck here and am out of ideas what to do about it. Increasing the thread timeout would probably help, but I can't find a config option for that.
>
> I'm afraid I need to take a break. I will be looking at my e-mail in about 4h again. Would be great if there are some further ideas for how to proceed.
>
> Thanks so far and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Bailey Allison <ballison@xxxxxxxxxxxx>
> Sent: Friday, January 10, 2025 10:23 PM
> To: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  Re: Help needed, ceph fs down due to large stray dir
>
> Hi Frank,
>
> The value for that is mds_beacon_grace. Default is 15 but you can jack
> it up. Apply it to the monitor or global to take effect.
>
> Just to clarify too, does the MDS daemon come into up:active ? If it
> does, are you able to also access that portion of the filesystem in that
> time?
>
> If you can access the filesystem, try running a stat on that portion
> with something like 'find . -ls' in a directory and see if the strays
> decrease.
>
> Regards,
>
> Bailey Allison
> Service Team Lead
> 45Drives, Ltd.
> 866-594-7199 x868
>
> On 1/10/25 17:18, Frank Schilder wrote:
> > Hi Bailey,
> >
> > thanks for your response. The MDS was actually unresponsive and I had to restart it (ceph tell and ceph daemon commands were hanging, except for "help"). Its currently in clientreplay and loading all the stuff again. I'm really worried that this here is the rescue killer:
> >
> >    heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
> >
> > Do you have any idea how to deal with this timeout? Somewhere in he process the MDS seems to become unresponsive for too long and seems to become unresponsive after that.
> >
> > I have 4T swap now and the MDS comes up to the point where it actually reports back a number for the stray items. However, some time after it becomes unresponsive and the heartbeat messages start showing up. I don't know how to get past this point.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Bailey Allison <ballison@xxxxxxxxxxxx>
> > Sent: Friday, January 10, 2025 10:05 PM
> > To: ceph-users@xxxxxxx; Frank Schilder
> > Subject: Re:  Re: Help needed, ceph fs down due to large stray dir
> >
> > Frank,
> >
> > You mentioned previously a large number of strays on the mds rank. Are
> > you able to check the rank again to see how many strays there are again?
> > We've previously had a similar issue, and once the MDS came back up we
> > had to stat the filesystem to decrease the number of strays, and which
> > doing so everything returned to normal.
> >
> > ceph tell mds.X perf dump | jq .mds_cache
> >
> > Bailey Allison
> > Service Team Lead
> > 45Drives, Ltd.
> > 866-594-7199 x868
> >
> > On 1/10/25 16:42, Frank Schilder wrote:
> >> Hi all,
> >>
> >> I got the MDS up. however, after quite some time its sitting with almost no CPU load:
> >>
> >> top - 21:40:02 up  2:49,  1 user,  load average: 0.00, 0.02, 0.34
> >> Tasks: 606 total,   1 running, 247 sleeping,   0 stopped,   0 zombie
> >> %Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> >> GiB Mem :    503.7 total,     12.3 free,    490.3 used,      1.1 buff/cache
> >> GiB Swap:   3577.0 total,   3367.0 free,    210.0 used.      2.9 avail Mem
> >>
> >>       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> >>     59495 ceph      20   0  685.8g 477.9g   0.0g S   1.0 94.9  53:47.57 ceph-mds
> >>
> >> I'm not sure if its doing anything at all. Only messages like these keep showing up in the log:
> >>
> >> 2025-01-10T21:38:08.459+0100 7f87ccd5f700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
> >> 2025-01-10T21:38:08.459+0100 7f87ccd5f700  0 mds.beacon.ceph-12 Skipping beacon heartbeat to monitors (last acked 3019.23s ago); MDS internal heartbeat is not healthy!
> >>
> >> The MDS cluster looks healthy from this output:
> >>
> >> # ceph fs status
> >> con-fs2 - 1554 clients
> >> =======
> >> RANK  STATE     MDS       ACTIVITY     DNS    INOS   DIRS   CAPS
> >>    0    active  ceph-15  Reqs:    0 /s   255k   248k  5434   1678
> >>    1    active  ceph-14  Reqs:    2 /s   402k   396k  26.7k   144k
> >>    2    active  ceph-12  Reqs:    0 /s  86.9M  86.9M  46.2k  3909
> >>    3    active  ceph-08  Reqs:    0 /s   637k   630k  2663   7457
> >>    4    active  ceph-11  Reqs:    0 /s  1496k  1492k   113k   103k
> >>    5    active  ceph-16  Reqs:    2 /s   775k   769k  65.3k  12.9k
> >>    6    active  ceph-24  Reqs:    0 /s   130k   113k  7294   8670
> >>    7    active  ceph-13  Reqs:   65 /s  3619k  3609k   469k  47.2k
> >>           POOL           TYPE     USED  AVAIL
> >>      con-fs2-meta1     metadata  4078G  7269G
> >>      con-fs2-meta2       data       0   7258G
> >>       con-fs2-data       data    1225T  2476T
> >> con-fs2-data-ec-ssd    data     794G  22.6T
> >>      con-fs2-data2       data    5747T  2253T
> >> STANDBY MDS
> >>     ceph-09
> >>     ceph-10
> >>     ceph-23
> >>     ceph-17
> >> MDS version: ceph version 16.2.15 (618f440892089921c3e944a991122ddc44e60516) pacific (stable)
> >>
> >> Did it mark itself out of the cluster and is waiting for the MON to fail it?? Please help.
> >>
> >> Best regards,
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >>
> >> ________________________________________
> >> From: Frank Schilder <frans@xxxxxx>
> >> Sent: Friday, January 10, 2025 8:51 PM
> >> To: Spencer Macphee
> >> Cc: ceph-users@xxxxxxx
> >> Subject: Re:  Help needed, ceph fs down due to large stray dir
> >>
> >> Hi all,
> >>
> >> I seem to have gotten the MDS up to the point that it reports stats. Does this mean anything:
> >>
> >> 2025-01-10T20:50:25.256+0100 7f87ccd5f700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
> >> 2025-01-10T20:50:25.256+0100 7f87ccd5f700  0 mds.beacon.ceph-12 Skipping beacon heartbeat to monitors (last acked 156.027s ago); MDS internal heartbeat is not healthy!
> >>
> >> I hope it doesn't get failed by some king of timeout now.
> >>
> >> Best regards,
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >>
> >> ________________________________________
> >> From: Spencer Macphee <spencerofsydney@xxxxxxxxx>
> >> Sent: Friday, January 10, 2025 7:16 PM
> >> To: Frank Schilder
> >> Cc: ceph-users@xxxxxxx
> >> Subject: Re:  Help needed, ceph fs down due to large stray dir
> >>
> >> I had a similar issue some months ago that ended up using around 300 gigabytes of RAM for a similar number of strays.
> >>
> >> You can get an idea of the strays kicking around by checking the omapkeys of the stray objects in the cephfs metadata pool. Strays are tracked in objects: 600.00000000, 601.00000000, 602.00000000, etc... etc... That would also give you an indication if it's progressing at each restart.
> >>
> >> On Fri, Jan 10, 2025 at 1:30 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> >> Hi all,
> >>
> >> we seem to have a serious issue with our file system, ceph version is pacific latest. After a large cleanup operation we had an MDS rank with 100Mio stray entries (yes, one hundred million). Today we restarted this daemon, which cleans up the stray entries. It seems that this leads to a restart loop due to OOM. The rank becomes active and then starts pulling in DNS and INOS entries until all memory is exhausted.
> >>
> >> I have no idea if there is at least progress removing the stray items or if it starts from scratch every time. If it needs to pull as many DNS/INOS into cache as there are stray items, we don't have a server at hand with enough RAM.
> >>
> >> Q1: Is the MDS at least making progress in every restart iteration?
> >> Q2: If not, how do we get this rank up again?
> >> Q3: If we can't get this rank up soon, can we at least move directories away from this rank by pinning it to another rank?
> >>
> >> Currently, the rank in question reports .mds_cache.num_strays=0 in perf dump.
> >>
> >> =================
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx