Re: MDS stuck ops

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Reed,

forget what I wrote about pinning, you use only 1 MDS, so it won't change anything. I think the problem you are facing is with the standby-replay daemon mode. I used that in the past too, but found out that it actually didn't help with fail-over speed to begin with. On top of that, the replay seems not to be rock-solid and ops got stuck.

In the end I reverted to the simple active+standby daemons and never had problems again. My impression is that fail-over is actually faster to a normal standby than to a standby-replay daemon. I'm not sure in which scenario standby-replay improves things, I just never saw a benefit on our cluster.

During out-of-office hours I usually go straight for MDS fail in case of problems. During work hours I make an attempt to be nice before failing an MDS. On our cluster though we have 8 active MDS daemons and everything pinned to ranks. If I fail an MDS, its only 1/8th of users noticing (except maybe rank 0). The fail-over is usually fast enough that I don't get complaints. We have ca. 1700 kernel clients, it takes a few minutes for the new MS to become active.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Reed Dier <reed.dier@xxxxxxxxxxx>
Sent: 28 November 2022 22:43:12
To: ceph-users
Cc: Venky Shankar; Frank Schilder
Subject: Re:  MDS stuck ops

So, ironically, I did try and take some of these approaches here.

I first moved the nearfull goalpost to see if that made a difference, it did for client writes, but not for the metadata to unstick.

I did some hunting for some hung/waiting processes on some of the client nodes, and was able to whack a few of those.
Then, finding the stuck ops in flight, taking the client ID’s, and looping through with client evict, followed by 3 blocklist clears with a 1s sleep between each blocklist clear.
It got through about 6 or 7 of the clients, which appeared to handle reconnecting with the quick blocklist clear, before the MDS died and failed to the standby-replay.
The good and bad part here is that at this point, everything unstuck.
All of the slow/stuck ops in flight disappeared, and a few stuck processes appeared to spring back to life now that io was flowing.
Both MDS started trimming, and all was well.
The bad part is that the “solution" was to just bounce the MDS it appears, which didn’t instinctively feel like the right hammer to swing, but alas.
And of course revert the nearfull ratio.

That said, I did upload the crash report: "2022-11-28T21:02:12.655542Z_c1fcfca7-bd08-4da8-abcd-f350cc59fb80”

Appreciate everyone’s input.

Thanks,
Reed

On Nov 28, 2022, at 1:02 PM, Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:

Hi Reed,

I sometimes stuck had MDS ops as well, making the journal trim stop and the meta data pool running full slowly. Its usually a race condition in the MDS ops queue and re-scheduling the OPS in the MDS queue resolves it. To achieve that, I usually try in escalating order:

- Find the client causing the oldest stuck OP. Try a dropcaches and/or mount -o remount. This is the least disruptive but does not work often. If the blocked ops count goes down, proceed with the next client, if necessary.

- Kill the process that submitted the stuck OP (a process in D-state on the client, can be difficult to get it to die, I usually succeed by killing its parent). If this works, it usually helps, but does terminate a user process.

- Try to evict the client on the MDS side, but allow it to rejoin. This may require to clear the OSD blacklist fast after eviction. This tends to help but might lead to the client not being able to join, which in turn means a reboot.

- Fail the MDS with the oldest stuck OP/the dirfrag OP. This has resolved it for me in 100% of cases, but causes a short period of unavailable FS. The newly started MDS will have to replay the entire MDS journal, which in your case is a lot. I also have the meta data pool on SSD, but I had the pool full once and it took like 20 minutes to replay the journal (was way over 1 or 2TB by that time). In my case it didn't matter any more as the FS was unavailable any ways.

I used to have a lot of problems with dirfrags all the time as well. They seem to cause race conditions. I got out of this by pinning directories to MDS ranks. You find my experience in the recent thread "MDS internal op exportdir despite ephemeral pinning". Since I pinned everything all problems are gone and performance is boosted. We are also on octopus.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Reed Dier <reed.dier@xxxxxxxxxxx<mailto:reed.dier@xxxxxxxxxxx>>
Sent: 28 November 2022 19:14:55
To: Venky Shankar
Cc: ceph-users
Subject:  Re: MDS stuck ops

Hi Venky,

Thanks for responding.

A good chunk of those are waiting for the directory to finish
fragmentation (split). I think those ops are not progressing since
fragmentation involves creating more objects in the metadata pool.

Update ops will involve appending to the mds journal consuming disk
space which you are already running out of.

So the metadata pool is on SSD’s, which are not nearful.
So I don’t believe that space should be an issue.

POOL                   ID  PGS   STORED   OBJECTS  USED     %USED  MAX AVAIL
fs-metadata            16    32   84 GiB   10.63M  251 GiB   0.74     11 TiB


But in the past I feel like all OSDs got implicated in the nearful penalty.
Assuming that to be true, could the dirfrag split be slowed by the nearful sync writes?
If so, maybe moving the nearful needle temporarily could get the dirfrag split across the finish line, and then I can retreat to nearful safety?
Is there a way to monitor dirfrag progress?

If you have snapshots that are no longer required, maybe consider
deleting those?

There are actually no snapshots on cephfs, so that shouldn’t be an issue either.

# ceph fs get cephfs
Filesystem 'cephfs' (1)
fs_name cephfs
epoch   1081642
flags   30
created 2016-12-01T12:02:37.528559-0500
modified        2022-11-28T13:03:52.630590-0500
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
min_compat_client       0 (unknown)
last_failure    0
last_failure_osd_epoch  0
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {0=2824746206}
failed
damaged
stopped
data_pools      [17,37,40]
metadata_pool   16
inline_data     disabled
balancer
standby_count_wanted    1

Including the fs info in case there is a compat issue that stands out?
Only a single rank, with active/standby-replay MDS.

I also don’t have any MDS specific configs set, outside of mds_cache_memory_limit and mds_standby_replay,
So all of the mds_bal_* values should be defaults.

Again, appreciate the pointers.

Thanks,
Reed


On Nov 28, 2022, at 11:41 AM, Venky Shankar <vshankar@xxxxxxxxxx<mailto:vshankar@xxxxxxxxxx>> wrote:

On Mon, Nov 28, 2022 at 10:19 PM Reed Dier <reed.dier@xxxxxxxxxxx<mailto:reed.dier@xxxxxxxxxxx> <mailto:reed.dier@xxxxxxxxxxx>> wrote:

Hopefully someone will be able to point me in the right direction here:

Cluster is Octopus/15.2.17 on Ubuntu 20.04.
All are kernel cephfs clients, either 5.4.0-131-generic or 5.15.0-52-generic.
Cluster is nearful, and more storage is coming, but still 2-4 weeks out from delivery.

HEALTH_WARN 1 clients failing to respond to capability release; 1 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests; 2 MDSs behind on trimming; 28 nearfull osd(s); 8 pool(s) nearfull; (muted: MDS_CLIENT_RECALL POOL_TOO_FEW_PGS POOL_TOO_MANY_PGS)
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
  mds.mds1(mds.0): Client $client1 failing to respond to capability release client_id: 2825526519
[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid
  mds.mds1(mds.0): Client $client2 failing to advance its oldest client/flush tid.  client_id: 2825533964
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
  mds.mds1(mds.0): 4 slow requests are blocked > 30 secs
[WRN] MDS_TRIM: 2 MDSs behind on trimming
  mds.mds1(mds.0): Behind on trimming (13258/128) max_segments: 128, num_segments: 13258
  mds.mds2(mds.0): Behind on trimming (13260/128) max_segments: 128, num_segments: 13260
[WRN] OSD_NEARFULL: 28 nearfull osd(s)

cephfs - 121 clients
======
RANK      STATE       MDS      ACTIVITY     DNS    INOS
0        active       mds1   Reqs: 4303 /s  5905k  5880k
0-s   standby-replay   mds2   Evts:  244 /s  1483k   586k
  POOL       TYPE     USED  AVAIL
fs-metadata  metadata   243G  11.0T
 fs-hd3      data    3191G  12.0T
 fs-ec73     data     169T  25.3T
 fs-ec82     data     211T  28.9T
MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

Pastebin of mds ops-in-flight: https://pastebin.com/5DqBDynj <https://pastebin.com/5DqBDynj> <https://pastebin.com/5DqBDynj <https://pastebin.com/5DqBDynj>>

A good chunk of those are waiting for the directory to finish
fragmentation (split). I think those ops are not progressing since
fragmentation involves creating more objects in the metadata pool.


I seem to have about 43 mds ops that are just stuck and not progressing, and I’m unsure how to unstick the ops and get everything back to a healthy state.
Comparing the client ID’s for the stuck ops against ceph tell mds.$mds client ls, I don’t see any patterns for a specific problematic client(s) or kernel version(s).
The fs-metadata pool is on SSDs, while the data pools are on HDD’s in various replication/EC configs.

I decreased the mds_cache_trim_decay_rate down to 0.9, but the num_segments just continues to climb.
I suspect that trimming may be queued behind some operation that is stuck.

Update ops will involve appending to the mds journal consuming disk
space which you are already running out of.


I’ve considered bumping up the nearful ratio up to try and see if getting out of synchronous writes penalty makes any difference, but I assume something may be more deeply unhappy than just that.

Appreciate any pointers anyone can give.

If you have snapshots that are no longer required, maybe consider
deleting those?


Thanks,
Reed
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> <mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx> <mailto:ceph-users-leave@xxxxxxx>



--
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux