Re: libceph: osdXXX up/down all the time

Eugen Block <eblock@xxxxxx> · Wed, 21 Dec 2022 13:03:06 +0000

Hi Frank,

I asked the same question 4 years ago [1]. Basically, Gregs reponse was:

So, this is actually just noisy logging from the client processing  
an OSDMap. That should probably be turned down, as it's not really  
an indicator of...anything...as far as I can tell.

IIRC clients sometimes notice changes in the osdmap with some delay  
(if they didn't require to update the osdmap), so I just ignore these  
messages if the cluster is healthy otherwise and the clients work as  
expected. And here was my conclusion [2].

[1] https://www.spinics.net/lists/ceph-users/msg47279.html
[2] https://www.spinics.net/lists/ceph-users/msg47502.html

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

on ceph fs kernel clients we see a lot of these kind of messages in bursts:

...
[Mon Dec 19 09:43:15 2022] libceph: osd1258 weight 0x10000 (in)
[Mon Dec 19 09:43:15 2022] libceph: osd1258 up
[Mon Dec 19 09:43:15 2022] libceph: osd1259 weight 0x10000 (in)
[Mon Dec 19 09:43:15 2022] libceph: osd1259 up
[Mon Dec 19 09:43:16 2022] libceph: osd0 down
[Mon Dec 19 09:43:16 2022] libceph: osd0 up
[Mon Dec 19 09:43:16 2022] libceph: osd0 down
[Mon Dec 19 09:43:16 2022] libceph: osd0 up
...

There haven't been osds going up and down since Wednesday last week.  
However, no OSDs went up/down. What is libceph actually reporting  
here?

The cluster is rebalancing since last Wednesday, when we added new  
disks. There has not been any daemon down since then:

# ceph status
  cluster:
    id:     ###
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 12d)
    mgr: ceph-25(active, since 11w), standbys: ceph-03, ceph-02,  
ceph-01, ceph-26
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1260 osds: 1260 up (since 6d), 1260 in (since 6d); 2342 remapped pgs

  task status:

  data:
    pools:   14 pools, 25065 pgs
    objects: 1.53G objects, 2.8 PiB
    usage:   3.4 PiB used, 9.7 PiB / 13 PiB avail
    pgs:     1158282360/13309135349 objects misplaced (8.703%)
             22704 active+clean
             2261  active+remapped+backfill_wait
             81    active+remapped+backfilling
             16    active+clean+snaptrim
             3     active+clean+scrubbing+deep

  io:
    client:   120 MiB/s rd, 175 MiB/s wr, 1.46k op/s rd, 2.23k op/s wr
    recovery: 2.3 GiB/s, 850 objects/s

We are investigating the reason why some jobs on our HPC cluster get  
stuck after the job ends. These messages look somewhat suspicious  
and we wonder if it has anything to do with the ceph client/fs.

The cluster has been healthy the whole time.

Best regards and thanks for pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx