Re: Help needed, ceph fs down due to large stray dir

Eugen Block <eblock@xxxxxx> · Thu, 23 Jan 2025 22:08:12 +0000

That’s great to hear, Frank! I’m glad your CephFS is back up! Thanks  
for the update!

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

with the help of Croit we got back on our feet. I will post a  
detailed post-mortem later this month including information about  
how to check if a cluster is in the same situation.

Long story short, we hit a deadlock due to competition between MDS  
cache trimming and purging stale strays. "Disabling" cache trimming  
by setting a ridiculously high mds_memory_limit on the bad rank did  
the trick. Purging 100Mio strays is actually no problem and doesn't  
require much if any RAM by itself (I mean here the purge that  
happens on MDS restart, I don't know if the forward-scrub purge  
behaves the same). Our cluster managed to purge about 10K items/s  
and after a few hours everything was cleaned out. While purging it  
was serving client IO, so the FS is up right away.

A big thank you to everyone who helped with this case.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Sunday, January 19, 2025 5:35 PM
To: ceph-users@xxxxxxx
Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Spencer Macphee
Subject:  Re: Help needed, ceph fs down due to large stray dir

Hi all,

we took a log with setting debug_journaler=20 and managed to track  
the deadlock down to line  
https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583  
in Journaler::append_entry(bufferlist& bl):

  // append
  size_t delta = bl.length() + journal_stream.get_envelope_size();
  // write_buf space is nearly full
  if (!write_buf_throttle.get_or_fail(delta)) {
    l.unlock();
    ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
    write_buf_throttle.get(delta);  //<<<<<<<<< The MDS is stuck  
here <<<<<<<<<
    l.lock();
  }
  ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;

This is indicated by the last message in the log before the lock up,  
which reads

  mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101

and is generated by the line above the call  
write_buf_throttle.get(delta). All log messages messages before  
start with "write_buf_throttle get, delta", which means these did  
not go into the if-statement.

Obvious question is, which parameter influences the maximum capacity  
of Journaler::write_buffer  
(https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the  
class definition of class Journaler? Increasing this limit should  
get us past the deadlock.

Note that all the relevant code is identical to branch main, which  
means that all versions since pacific are affected.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Saturday, January 18, 2025 2:21 PM
To: Frédéric Nass; ceph-users@xxxxxxx
Cc: Dan van der Ster; Patrick Donnelly; Bailey Allison; Spencer Macphee
Subject:  Re: Help needed, ceph fs down due to large stray dir

Hi all,

looking at the log data (see snippet at end) we suspect a classic  
"producer–consumer" deadlock since it seems that the same thread  
that is filling the purge queue at  
PurgeQueue.cc:L335:journaler.append_entry(bl) in function  
PurgeQueue::push is also responsible for processing it but the call  
to scan_stray_dir() at the end of MDCache::populate_mydir() is  
atomic and blocks processing until it completes.

I'm looking through the code right now to find what parameter is  
used in constructing this queue that allows adjusting its size. I  
need to triple it in size at least to get scan_stray_dir() to  
complete and see what happens then.

This would, of course, go much faster if someone familiar with the  
code would tell me what that parameter is.

Here the log snippet when the purge queue runs full:

2025-01-16T14:06:00.490+0100 7f25cfa31700 10  
mds.2.cache.ino(0x20028c29e69) purge_stale_snap_data
2025-01-16T14:06:00.490+0100 7f25cfa31700 10  
mds.2.cache.ino(0x20028c29e69)  purging old_inode [c7d,dae]
2025-01-16T14:06:00.490+0100 7f25cfa31700 10 mds.2.cache  
clear_dirty_bits_for_stray [inode 0x20028c29e69 [...daf,head]  
~mds2/stray7/20028c29e69/ auth v2199183125 snaprealm=0x562c9fb6e600  
f(v0 m2024-12-28T20:17:59.487395+0100) n(v0  
rc2024-12-28T20:17:59.487395+0100 1=0+1) (iversion lock)  
0x562c9fb70b00]
2025-01-16T14:06:00.490+0100 7f25cfa31700 20 mds.2.cache.strays  
enqueue: purging dn: [dentry #0x102/stray7/20028c29e69 [daf,head]  
auth (dversion lock) pv=0 v=2199200546 ino=0x20028c29e69  
state=1073741844 0x562c9fb6db80]
2025-01-16T14:06:00.490+0100 7f25cfa31700 10  
mds.2.cache.dir(0x61b.000100011011*) auth_pin by 0x55e0f256ff80 on  
[dir 0x61b.000100011011* ~mds2/stray7/ [2,head] auth v=2199200547  
cv=2199200546/2199200546 ap=1928+0 state=1610620929|complete|sticky  
f(v23 m2025-01-10T13:30:41.594215+0100 2568=0+2568)/f(v23  
m2025-01-10T13:30:41.594215+0100 2569=1+2568) n(v2110  
rc2025-01-10T13:30:41.594215+0100 2568=0+2568)/n(v2110  
rc2025-01-10T13:30:41.594215+0100 b7068 2569=1+2568)  
hs=2568+0,ss=0+0 | child=1 sticky=1 dirty=1 waiter=0 authpin=1  
0x562c98bf0880] count now 1928
2025-01-16T14:06:00.490+0100 7f25cfa31700 10 mds.2.cache.strays  
purge [dentry #0x102/stray7/20028c29e69 [daf,head] auth (dversion  
lock) pv=0 v=2199200546 ino=0x20028c29e69 state=1073741876 |  
purging=1 0x562c9fb6db80] [inode 0x20028c29e69 [...daf,head]  
~mds2/stray7/20028c29e69/ auth v2199183125 snaprealm=0x562c9fb6e600  
f(v0 m2024-12-28T20:17:59.487395+0100) n(v0  
rc2024-12-28T20:17:59.487395+0100 1=0+1) (iversion lock)  
0x562c9fb70b00]
2025-01-16T14:06:00.490+0100 7f25cfa31700  4 mds.2.purge_queue push:  
pushing inode 0x20028c29e69
2025-01-16T14:06:00.537+0100 7f25d7240700 20 --  
[v2:192.168.32.85:6800/845135215,v1:192.168.32.85:6801/845135215] >>  
v1:192.168.48.132:0/2119497078 conn(0x55e0f3d3e400  
legacy=0x55e0f3d34000 unknown :6801 s=STATE_CONNECTION_ESTABLISHED  
l=0).process
2025-01-16T14:06:00.537+0100 7f25d7240700 20 --  
[v2:192.168.32.85:6800/845135215,v1:192.168.32.85:6801/845135215] >>  
v1:192.168.48.132:0/2119497078 conn(0x55e0f3d3e400  
legacy=0x55e0f3d34000 unknown :6801 s=STATE_CONNECTION_ESTABLISHED  
l=0).read continue len=1
2025-01-16T14:06:00.537+0100 7f25d7240700 20 --1-  
[v2:192.168.32.85:6800/845135215,v1:192.168.32.85:6801/845135215] >>  
v1:192.168.48.132:0/2119497078 conn(0x55e0f3d3e400 0x55e0f3d34000  
:6801 s=OPENED pgs=828965 cs=1 l=0).handle_message r=0
2025-01-16T14:06:00.537+0100 7f25d7240700 20 --1-  
[v2:192.168.32.85:6800/845135215,v1:192.168.32.85:6801/845135215] >>  
v1:192.168.48.132:0/2119497078 conn(0x55e0f3d3e400 0x55e0f3d34000  
:6801 s=OPENED pgs=828965 cs=1 l=0).handle_message process tag 14

Best regards and have a good weekend.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx