Re: MDS Upgrade from 17.2.5 to 17.2.6 not possible

Henning Achterrath <achhen@xxxxxxxxxxx> · Wed, 24 May 2023 14:21:25 +0200

Hello again,

In two days, the number has increased by about one and a half million 
and the ram usage of mds remains high by about 50G. We are very unsure 
if this is a normal behavior.

Today:
    "num_strays": 53695,
         "num_strays_delayed": 4,
         "num_strays_enqueuing": 0,
         "strays_created": 3618390,
         "strays_enqueued": 3943542,
         "strays_reintegrated": 144545,
         "strays_migrated": 38,

On 22.05.23

ceph daemon  mds.0 perf dump | grep stray
          "num_strays": 49846,
         "num_strays_delayed": 21,
          "num_strays_enqueuing": 0,
           "strays_created": 2042124,
          "strays_enqueued": 2396076,
           "strays_reintegrated": 44207,
           "strays_migrated": 38,

Maybe someone can explain to us what these counters mean in detail. The 
perf schema is not very revealing.

Our idea is to add a standbye-replay (hot-standbye mds) temporary, to 
ensure the journal is replayable before we resume the upgrade.

I would be grateful for any advise.

best regards
Henning

On 23.05.23 17:24, Henning Achterrath wrote:
In addition, i would like to mention that the number of "strays_created" 
also increases after this action, but the number of num_strays is lower 
now. If desired, we can provide debug logs from mds at the time the mds 
was in stopping state and we did a systemctl restart mds1.

The only active mds server has a ram usage of about 50G. The memory 
limit is 32G, but we get no warnings about that. Maybe the separate 
purge_queue is consuming a lot of RAM and it does not count for the 
limit? Usually we get notified when the mds is behind the memory limit.

thank you

On 22.05.23 15:23, T.Kulschewski@xxxxxxxxxxx wrote:
Hi Venky,

thank you for your help. We managed to shut down mds.1:
We set "ceph fs set max_mds 1" and waited for about 30 minutes. In the 
first couple minutes, strays were migrated from mds.1 to mds.0. After 
this, the stray export hung. The mds.1 remained in the state_stopping. 
After about 30 minutes, we restarted mds.1. This resulted in one 
active mds and two standby mds. However, we are not sure, if the 
remaining strays could be migrated.

When we had a closer look at the perf counter of the mds, we realized 
that the number of strays_enqueued is quite high and constantly 
increasing. Is this to be expected? What does the counter 
"strays_enqueued" mean in detail?

ceph daemon  mds.0 perf dump | grep stray
         "num_strays": 49846,
         "num_strays_delayed": 21,
         "num_strays_enqueuing": 0,
         "strays_created": 2042124,
         "strays_enqueued": 2396076,
         "strays_reintegrated": 44207,
         "strays_migrated": 38,

Would it be safe to perform "ceph orch upgrade resume" at this point? 
At the moment, the MONs and OSDs are running 17.2.6, while the MDSs 
and RGWs are running 17.2.5. So we have to upgrade the MDS and RGW 
eventually.

Best, Tobias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx