Re: Stuck in remapped state?

Tim Holloway <timh@xxxxxxxxxxxxx> · Sat, 27 Jul 2024 11:44:45 -0400

Update on "ceph-s". A machine was in the process of crashing when I
took the original snapshot. Here it is after the reboot:

[root@dell02 ~]# ceph -s
  cluster:
    id:     278fcd86-0861-11ee-a7df-9c5c8e86cf8f
    health: HEALTH_WARN
            1 filesystem is degraded
            25 client(s) laggy due to laggy OSDs

  services:
    mon: 3 daemons, quorum dell02,www7,ceph03 (age 8m)
    mgr: ceph08.tlocfi(active, since 81m), standbys: www7.rxagfn,
dell02.odtbqw
    mds: 1/1 daemons up, 2 standby
    osd: 7 osds: 7 up (since 12h), 7 in (since 19h); 308 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   22 pools, 681 pgs
    objects: 125.10k objects, 36 GiB
    usage:   91 GiB used, 759 GiB / 850 GiB avail
    pgs:     47772/369076 objects misplaced (12.944%)
             373 active+clean
             308 active+clean+remapped

  io:
    client:   170 B/s rd, 0 op/s rd, 0 op/s wr

On Sat, 2024-07-27 at 11:31 -0400, Tim Holloway wrote:
> I was in the  middle of tuning my OSDs when lightning blew me off the
> Internet. Had to wait 5 days for my ISP to send a tech and replace a
> fried cable. In the mean time, among other things. I had some serious
> time drift between servers thanks to the OS upgrades replacing NTP
> with
> chrony and me not having thought to re-establish a master in-house
> timeserver.
> 
> Ceph tried really hard to keep up with all that, but eventually it
> was
> just too much. Now I've got an offline filesystem and apparently it's
> stuck trying to get back online again.
> 
> The forensics:
> [ceph: root@www7 /]# ceph -s
>   cluster:
>     id:     278fcd86-0861-11ee-a7df-9c5c8e86cf8f
>     health: HEALTH_WARN
>             failed to probe daemons or devices
>             1 filesystem is degraded
>             1/3 mons down, quorum www7,ceph03
>  
>   services:
>     mon: 3 daemons, quorum www7,ceph03 (age 2m), out of quorum:
> dell02
>     mgr: ceph08.tlocfi(active, since 58m), standbys: dell02.odtbqw,
> www7.rxagfn
>     mds: 1/1 daemons up, 1 standby
>     osd: 7 osds: 7 up (since 12h), 7 in (since 18h); 308 remapped pgs
>     rgw: 2 daemons active (2 hosts, 1 zones)
>  
>   data:
>     volumes: 0/1 healthy, 1 recovering
>     pools:   22 pools, 681 pgs
>     objects: 125.10k objects, 36 GiB
>     usage:   91 GiB used, 759 GiB / 850 GiB avail
>     pgs:     47772/369076 objects misplaced (12.944%)
>              373 active+clean
>              308 active+clean+remapped
>  
>   io:
>     client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> 
> [ceph: root@www7 /]# ceph health detail
> HEALTH_WARN 1 filesystem is degraded; 25 client(s) laggy due to laggy
> OSDs
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>     fs ceefs is degraded
> [WRN] MDS_CLIENTS_LAGGY: 25 client(s) laggy due to laggy OSDs
>     mds.ceefs.www7.drnuyi(mds.0): Client 14019719 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14124385 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14144243 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14144375 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14224103 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14224523 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14234194 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14234545 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14236841 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14237837 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14238536 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14244124 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14264236 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14266870 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14294170 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14294434 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14296012 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14304212 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14316057 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14318379 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14325518 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14328956 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14334283 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14336104 is laggy; not
> evicted
> because some OSD(s) is/are laggy
>     mds.ceefs.www7.drnuyi(mds.0): Client 14374237 is laggy; not
> evicted
> because some OSD(s) is/are laggy
> 
> [ceph: root@www7 /]# ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME        STATUS  REWEIGHT  PRI-AFF
>  -1         2.79994  root default                              
> -25         0.15999      host ceph01                           
>   1    hdd  0.15999          osd.1        up   0.15999  1.00000
> -28         1.15999      host ceph03                           
>   3    hdd  0.15999          osd.3        up   0.15999  1.00000
>   5    hdd  1.00000          osd.5        up   1.00000  1.00000
>  -9         0.15999      host ceph06                           
>   2    hdd  0.15999          osd.2        up   0.15999  1.00000
>  -3         0.15999      host ceph07                           
>   6    hdd  0.15999          osd.6        up   0.15999  1.00000
>  -6         1.00000      host ceph08                           
>   4    hdd  1.00000          osd.4        up   1.00000  1.00000
>  -7         0.15999      host www7                             
>   0    hdd  0.15999          osd.0        up   0.15999  1.00000
> 
> [ceph: root@www7 /]# ceph pg stat
> 681 pgs: 373 active+clean, 308 active+clean+remapped; 36 GiB data, 91
> GiB used, 759 GiB / 850 GiB avail; 255 B/s rd, 0 op/s; 47772/369073
> objects misplaced (12.944%)
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx