On 4/8/24 12:32, Erich Weiler wrote:
Ah, I see. Yes, we are already running version 18.2.1 on the server side (we just installed this cluster a few weeks ago from scratch). So I guess if the fix has already been backported to that version, then we still have a problem.
Dos that mean it could be the locker order bug (https://tracker.ceph.com/issues/62123) as Xiubo suggested?
I have raised one PR to fix the lock order issue, if possible please
have a try to see could it resolve this issue.
Thanks
- Xiubo
Thanks again,
Erich
On Apr 7, 2024, at 9:00 PM, Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote:
Hi Erich,
On Mon, Apr 8, 2024 at 11:51 AM Erich Weiler <weiler@xxxxxxxxxxxx> wrote:
Hi Xiubo,
Thanks for your logs, and it should be the same issue with
https://tracker.ceph.com/issues/62052, could you try to test with this
fix again ?
This sounds good - but I'm not clear on what I should do? I see a patch
in that tracker page, is that what you are referring to? If so, how
would I apply such a patch? Or is there simply a binary update I can
apply somehow to the MDS server software?
The backport of this patch (https://github.com/ceph/ceph/pull/53241)
was merged on October 18, 2023, and Ceph 18.2.1 was released on
December 18, 2023. Therefore, if you are running Ceph 18.2.1 on the
server side, you already have the fix. If you are already running
version 18.2.1 or 18.2.2 (to which you should upgrade anyway), please
complain, as the purported fix is then ineffective.
Thanks for helping!
-erich
Please let me know if you still could see this bug then it should be the
locker order bug as https://tracker.ceph.com/issues/62123.
Thanks
- Xiubo
On 3/28/24 04:03, Erich Weiler wrote:
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs report slow requests
1 MDSs behind on trimming
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
mds: 1/1 daemons up, 2 standby
osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 260.72M objects, 466 TiB
usage: 704 TiB used, 424 TiB / 1.1 PiB avail
pgs: 1306 active+clean
4 active+clean+scrubbing+deep
3 active+clean+scrubbing
io:
client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
And the specifics are:
# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
max_segments: 250, num_segments: 13884
That "num_segments" number slowly keeps increasing. I suspect I just
need to tell the MDS servers to trim faster but after hours of
googling around I just can't figure out the best way to do it. The
best I could come up with was to decrease "mds_cache_trim_decay_rate"
from 1.0 to .8 (to start), based on this page:
https://www.suse.com/support/kb/doc/?id=000019740
But it doesn't seem to help, maybe I should decrease it further? I am
guessing this must be a common issue...? I am running Reef on the MDS
servers, but most clients are on Quincy.
Thanks for any advice!
cheers,
erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx