Re: libceph: osdXXX up/down all the time

Frank Schilder <frans@xxxxxx> · Wed, 21 Dec 2022 18:01:28 +0000

Hi Eugen,

thanks! I think this explains our observation.

Thanks and merry Christmas!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: 21 December 2022 14:03:06
To: ceph-users@xxxxxxx
Subject:  Re: libceph: osdXXX up/down all the time

Hi Frank,

I asked the same question 4 years ago [1]. Basically, Gregs reponse was:

> So, this is actually just noisy logging from the client processing
> an OSDMap. That should probably be turned down, as it's not really
> an indicator of...anything...as far as I can tell.

IIRC clients sometimes notice changes in the osdmap with some delay
(if they didn't require to update the osdmap), so I just ignore these
messages if the cluster is healthy otherwise and the clients work as
expected. And here was my conclusion [2].

[1] https://www.spinics.net/lists/ceph-users/msg47279.html
[2] https://www.spinics.net/lists/ceph-users/msg47502.html

Zitat von Frank Schilder <frans@xxxxxx>:

> Hi all,
>
> on ceph fs kernel clients we see a lot of these kind of messages in bursts:
>
> ...
> [Mon Dec 19 09:43:15 2022] libceph: osd1258 weight 0x10000 (in)
> [Mon Dec 19 09:43:15 2022] libceph: osd1258 up
> [Mon Dec 19 09:43:15 2022] libceph: osd1259 weight 0x10000 (in)
> [Mon Dec 19 09:43:15 2022] libceph: osd1259 up
> [Mon Dec 19 09:43:16 2022] libceph: osd0 down
> [Mon Dec 19 09:43:16 2022] libceph: osd0 up
> [Mon Dec 19 09:43:16 2022] libceph: osd0 down
> [Mon Dec 19 09:43:16 2022] libceph: osd0 up
> ...
>
> There haven't been osds going up and down since Wednesday last week.
> However, no OSDs went up/down. What is libceph actually reporting
> here?
>
> The cluster is rebalancing since last Wednesday, when we added new
> disks. There has not been any daemon down since then:
>
> # ceph status
>   cluster:
>     id:     ###
>     health: HEALTH_OK
>
>   services:
>     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 12d)
>     mgr: ceph-25(active, since 11w), standbys: ceph-03, ceph-02,
> ceph-01, ceph-26
>     mds: con-fs2:8 4 up:standby 8 up:active
>     osd: 1260 osds: 1260 up (since 6d), 1260 in (since 6d); 2342 remapped pgs
>
>   task status:
>
>   data:
>     pools:   14 pools, 25065 pgs
>     objects: 1.53G objects, 2.8 PiB
>     usage:   3.4 PiB used, 9.7 PiB / 13 PiB avail
>     pgs:     1158282360/13309135349 objects misplaced (8.703%)
>              22704 active+clean
>              2261  active+remapped+backfill_wait
>              81    active+remapped+backfilling
>              16    active+clean+snaptrim
>              3     active+clean+scrubbing+deep
>
>   io:
>     client:   120 MiB/s rd, 175 MiB/s wr, 1.46k op/s rd, 2.23k op/s wr
>     recovery: 2.3 GiB/s, 850 objects/s
>
> We are investigating the reason why some jobs on our HPC cluster get
> stuck after the job ends. These messages look somewhat suspicious
> and we wonder if it has anything to do with the ceph client/fs.
>
> The cluster has been healthy the whole time.
>
> Best regards and thanks for pointers!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx