Re: Cannot start virtual machines KVM / LXC

Ashley Merrick <singapore@xxxxxxxxxxxxxx> · Mon, 23 Sep 2019 14:55:25 +0800

Have you been able to start the VMs now?

Are you using RBD or are the VM's hosted on a CephFS?

---- On Mon, 23 Sep 2019 14:16:47 +0800 Thomas Schneider <74cmonty@xxxxxxxxx> wrote ----

Hi, 

currently ceph -s is not reporting any unknown PGs. 
The following flags are set: nobackfill, norebalance, norecover 
There are no PGs in stuck peering either. 
And there's very little traffic on the ceph network. 

This is the output of today: 
root@ld3955:~# ceph -s 
  cluster: 
    id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae 
    health: HEALTH_ERR 
            1 filesystem is degraded 
            1 filesystem has a failed mds daemon 
            1 filesystem is offline 
            insufficient standby MDS daemons available 
            nobackfill,norebalance,norecover flag(s) set 
            83 nearfull osd(s) 
            1 pool(s) nearfull 
            Degraded data redundancy: 360047/153249771 objects degraded 
(0.235%), 78 pgs degraded, 81 pgs undersized 
            Degraded data redundancy (low space): 265 pgs backfill_toofull 
            3 pools have too many placement groups 

  services: 
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 109m) 
    mgr: ld5505(active, since 2d), standbys: ld5506, ld5507 
    mds: pve_cephfs:0/1, 1 failed 
    osd: 368 osds: 368 up, 367 in; 398 remapped pgs 
         flags nobackfill,norebalance,norecover 

  data: 
    pools:   5 pools, 8868 pgs 
    objects: 51.08M objects, 195 TiB 
    usage:   590 TiB used, 562 TiB / 1.1 PiB avail 
    pgs:     360047/153249771 objects degraded (0.235%) 
             1998603/153249771 objects misplaced (1.304%) 
             8469 active+clean 
             124  active+remapped+backfill_toofull 
             83   active+remapped+backfill_wait+backfill_toofull 
             77   active+remapped+backfill_wait 
             45   active+undersized+degraded+remapped+backfill_toofull 
             33   active+remapped+backfilling 
             13   
active+undersized+degraded+remapped+backfill_wait+backfill_toofull 
             11   active+undersized+degraded+remapped+backfilling 
             4    active+undersized+degraded+remapped+backfill_wait 
             4    active+recovery_wait+undersized+degraded+remapped 
             3    active+recovering+undersized+remapped 
             1    active+recovering+undersized+degraded+remapped 
             1    active+recovering 

  io: 
    client:   5.6 KiB/s wr, 0 op/s rd, 0 op/s wr 

  progress: 
    Rebalancing after osd.9 marked out 
      [====================..........] 

2 days before the output was: 
root@ld3955:~# ceph -s 
  cluster: 
    id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae 
    health: HEALTH_ERR 
            1 filesystem is degraded 
            1 filesystem has a failed mds daemon 
            1 filesystem is offline 
            insufficient standby MDS daemons available 
            nobackfill,norebalance,norecover flag(s) set 
            2 backfillfull osd(s) 
            86 nearfull osd(s) 
            1 pool(s) backfillfull 
            Reduced data availability: 75 pgs inactive, 74 pgs peering 
            Degraded data redundancy: 364117/154942251 objects degraded 
(0.235%), 76 pgs degraded, 76 pgs undersized 
            Degraded data redundancy (low space): 309 pgs backfill_toofull 
            3 pools have too many placement groups 
            105 slow requests are blocked > 32 sec 
            91 stuck requests are blocked > 4096 sec 

  services: 
    mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 9h) 
    mgr: ld5505(active, since 9h), standbys: ld5506, ld5507 
    mds: pve_cephfs:0/1, 1 failed 
    osd: 368 osds: 368 up, 367 in; 400 remapped pgs 
         flags nobackfill,norebalance,norecover 

  data: 
    pools:   5 pools, 8868 pgs 
    objects: 51.65M objects, 197 TiB 
    usage:   596 TiB used, 554 TiB / 1.1 PiB avail 
    pgs:     0.011% pgs unknown 
             0.834% pgs not active 
             364117/154942251 objects degraded (0.235%) 
             2003579/154942251 objects misplaced (1.293%) 
             8395 active+clean 
             209  active+remapped+backfill_toofull 
             69   active+undersized+degraded+remapped+backfill_toofull 
             49   active+remapped+backfill_wait 
             37   peering 
             37   remapped+peering 
             31   active+remapped+backfill_wait+backfill_toofull 
             17   active+clean+scrubbing+deep 
             14   active+clean+scrubbing 
             4    active+undersized+degraded 
             2    active+undersized+degraded+remapped+backfill_wait 
             1    unknown 
             1    active+clean+remapped 
             1    active+remapped+backfilling 
             1    active+undersized+degraded+remapped+backfilling 

  io: 
    client:   256 MiB/s wr, 0 op/s rd, 99 op/s wr 

  progress: 
    Rebalancing after osd.9 marked out 
      [==............................] 

As you can see there's very little progress in "Rebalancing after osd.9 
marked out", the number of objects degraded dropped to 360047/153249771 
and the number of objects misplaced to 1998603/153249771. 

I think this is very little progress in 2 days with no activity (over 
the weekend) on the cluster. 

Am 21.09.2019 um 20:39 schrieb Paul Emmerich: 
> On Sat, Sep 21, 2019 at 6:47 PM Thomas <74cmonty@xxxxxxxxx> wrote: 
>> Hello, 
>> 
>> I have re-created the OSDs using these disks. 
>> Can I still export the affected PGs manually? 
> No, the data probably cannot be recovered in this case :( 
> (It might still be somewhere on the disk if it hasn't been overwritten 
> yet, but it's virtually impossible to recover it: the metadata has 
> almost certainly long been overwritten) 
> 
> But that's only for 17 PGs showing up as unknown. The ones stuck in 
> peering can probably be revived but without the latest writes. Can you 
> run "ceph pg X.YX query" on one of the PGs stuck in peering? 
> It might tell you what's wrong and how to proceed. 
> 
> But even 17 PGs will probably affect almost all of your VMs/containers... 
> 
> Paul 
> 
>> Regards 
>> Thomas 
>> 
>> 
>> Am 20.09.19 um 21:15 schrieb Paul Emmerich: 
>>> On Fri, Sep 20, 2019 at 1:31 PM Thomas Schneider <74cmonty@xxxxxxxxx> wrote: 
>>>> Hi, 
>>>> 
>>>> I cannot get rid of 
>>>>       pgs unknown 
>>>> because there were 3 disks that couldn't be started. 
>>>> Therefore I destroyed the relevant OSD and re-created it for the 
>>>> relevant disks. 
>>> and you had it configured to run with replica 3? Well, I guess the 
>>> down PGs where located on these three disks that you wiped. 
>>> 
>>> Do you still have the disks? Use ceph-objectstore-tool to export the 
>>> affected PGs manually and inject them into another OSD. 
>>> 
>>> 
>>> Paul 
>>> 
>>>> Then I added the 3 OSDs to crushmap. 
>>>> 
>>>> Regards 
>>>> Thomas 
>>>> 
>>>> Am 20.09.2019 um 08:19 schrieb Ashley Merrick: 
>>>>> Your need to fix this first. 
>>>>> 
>>>>>      pgs:     0.056% pgs unknown 
>>>>>               0.553% pgs not active 
>>>>> 
>>>>> The back filling will cause slow I/O, but having pgs unknown and not 
>>>>> active will cause I/O blocking which your seeing with the VM booting. 
>>>>> 
>>>>> Seems you have 4 OSD's down, if you get them back online you should be 
>>>>> able to get all the PG's online. 
>>>>> 
>>>>> 
>>>>> ---- On Fri, 20 Sep 2019 14:14:01 +0800 *Thomas <74cmonty@xxxxxxxxx>* 
>>>>> wrote ---- 
>>>>> 
>>>>>      Hi, 
>>>>> 
>>>>>      here I describe 1 of the 2 major issues I'm currently facing in my 8 
>>>>>      node ceph cluster (2x MDS, 6x ODS). 
>>>>> 
>>>>>      The issue is that I cannot start any virtual machine KVM or container 
>>>>>      LXC; the boot process just hangs after a few seconds. 
>>>>>      All these KVMs and LXCs have in common that their virtual disks 
>>>>>      reside 
>>>>>      in the same pool: hdd 
>>>>> 
>>>>>      This pool hdd is relatively small compared to the largest pool: 
>>>>>      hdb_backup 
>>>>>      root@ld3955:~# rados df 
>>>>>      POOL_NAME              USED  OBJECTS CLONES    COPIES 
>>>>>      MISSING_ON_PRIMARY 
>>>>>      UNFOUND DEGRADED    RD_OPS       RD    WR_OPS      WR USED COMPR 
>>>>>      UNDER COMPR 
>>>>>      backup                  0 B        0      0 
>>>>>      0 
>>>>>      0       0        0         0      0 B         0     0 B        0 
>>>>>      B         0 B 
>>>>>      hdb_backup          589 TiB 51262212      0 
>>>>>      153786636 
>>>>>      0       0   124895  12266095  4.3 TiB 247132863 463 TiB        0 
>>>>>      B         0 B 
>>>>>      hdd                 3.2 TiB   281884   6568 
>>>>>      845652 
>>>>>      0       0     1658 275277357   16 TiB 208213922  10 TiB        0 
>>>>>      B         0 B 
>>>>>      pve_cephfs_data     955 GiB    91832      0 
>>>>>      275496 
>>>>>      0       0     3038      2103 1021 MiB    102170 318 GiB        0 
>>>>>      B         0 B 
>>>>>      pve_cephfs_metadata 486 MiB       62      0 
>>>>>      186 
>>>>>      0       0        7       860  1.4 GiB     12393 166 MiB        0 
>>>>>      B         0 B 
>>>>> 
>>>>>      total_objects    51635990 
>>>>>      total_used       597 TiB 
>>>>>      total_avail      522 TiB 
>>>>>      total_space      1.1 PiB 
>>>>> 
>>>>>      This is the current health status of the ceph cluster: 
>>>>>        cluster: 
>>>>>          id:     6b1b5117-6e08-4843-93d6-2da3cf8a6bae 
>>>>>          health: HEALTH_ERR 
>>>>>                  1 filesystem is degraded 
>>>>>                  1 MDSs report slow metadata IOs 
>>>>>                  1 backfillfull osd(s) 
>>>>>                  87 nearfull osd(s) 
>>>>>                  1 pool(s) backfillfull 
>>>>>                  Reduced data availability: 54 pgs inactive, 47 pgs 
>>>>>      peering, 
>>>>>      1 pg stale 
>>>>>                  Degraded data redundancy: 129598/154907946 objects 
>>>>>      degraded 
>>>>>      (0.084%), 33 pgs degraded, 33 pgs undersized 
>>>>>                  Degraded data redundancy (low space): 322 pgs 
>>>>>      backfill_toofull 
>>>>>                  1 subtrees have overcommitted pool target_size_bytes 
>>>>>                  1 subtrees have overcommitted pool target_size_ratio 
>>>>>                  1 pools have too many placement groups 
>>>>>                  21 slow requests are blocked > 32 sec 
>>>>> 
>>>>>        services: 
>>>>>          mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 14h) 
>>>>>          mgr: ld5507(active, since 16h), standbys: ld5506, ld5505 
>>>>>          mds: pve_cephfs:1/1 {0=ld3955=up:replay} 1 up:standby 
>>>>>          osd: 360 osds: 356 up, 356 in; 382 remapped pgs 
>>>>> 
>>>>>        data: 
>>>>>          pools:   5 pools, 8868 pgs 
>>>>>          objects: 51.64M objects, 197 TiB 
>>>>>          usage:   597 TiB used, 522 TiB / 1.1 PiB avail 
>>>>>          pgs:     0.056% pgs unknown 
>>>>>                   0.553% pgs not active 
>>>>>                   129598/154907946 objects degraded (0.084%) 
>>>>>                   2211119/154907946 objects misplaced (1.427%) 
>>>>>                   8458 active+clean 
>>>>>                   298  active+remapped+backfill_toofull 
>>>>>                   29   remapped+peering 
>>>>>                   24 
>>>>>      active+undersized+degraded+remapped+backfill_toofull 
>>>>>                   22   active+remapped+backfill_wait 
>>>>>                   17   peering 
>>>>>                   5    unknown 
>>>>>                   5    active+recovery_wait+undersized+degraded+remapped 
>>>>>                   3    active+undersized+degraded+remapped+backfill_wait 
>>>>>                   2    activating+remapped 
>>>>>                   1    active+clean+remapped 
>>>>>                   1    stale+peering 
>>>>>                   1    active+remapped+backfilling 
>>>>>                   1    active+recovering+undersized+remapped 
>>>>>                   1    active+recovery_wait+degraded 
>>>>> 
>>>>>        io: 
>>>>>          client:   9.2 KiB/s wr, 0 op/s rd, 1 op/s wr 
>>>>> 
>>>>>      I believe the cluster is busy with rebalancing pool hdb_backup. 
>>>>>      I set the balance mode upmap recently after the 589TB data was 
>>>>>      written. 
>>>>>      root@ld3955:~# ceph balancer status 
>>>>>      { 
>>>>>          "active": true, 
>>>>>          "plans": [], 
>>>>>          "mode": "upmap" 
>>>>>      } 
>>>>> 
>>>>> 
>>>>>      In order to resolve the issue with pool hdd I started some 
>>>>>      investigation. 
>>>>>      First step was to install drivers for the NIC provided Mellanox. 
>>>>>      Then I configured some kernel parameters recommended 
>>>>>      <https://community.mellanox.com/s/article/linux-sysctl-tuning> by 
>>>>>      Mellanox. 
>>>>> 
>>>>>      However this didn't fix the issue. 
>>>>>      In my opinion I must get rid of all "slow requests are blocked". 
>>>>> 
>>>>>      When I check the output of ceph health detail any OSD listed under 
>>>>>      REQUEST_SLOW points to an OSD that belongs to pool hdd. 
>>>>>      This means none of the disks belonging to pool hdb_backup is 
>>>>>      showing a 
>>>>>      comparable behaviour. 
>>>>> 
>>>>>      Then I checked the running processes on the different OSD nodes; I 
>>>>>      use 
>>>>>      tool "glances" here. 
>>>>>      Here I can see single processes that are running for hours and 
>>>>>      consuming 
>>>>>      much CPU, e.g. 
>>>>>      66.8   0.2   2.13G 1.17G 1192756 ceph        17h8:33 58    0 S 
>>>>>      41M 2K 
>>>>>      /usr/bin/ceph-osd -f --cluster ceph --id 37 --setuser ceph 
>>>>>      --setgroup ceph 
>>>>>      34.2   0.2   4.31G 1.20G  971267 ceph       15h38:46 58    0 S 
>>>>>      14M 3K 
>>>>>      /usr/bin/ceph-osd -f --cluster ceph --id 73 --setuser ceph 
>>>>>      --setgroup ceph 
>>>>> 
>>>>>      Similar processes are running on 4 OSD nodes. 
>>>>>      All processes have in common that the relevant OSD belongs to pool 
>>>>>      hdd. 
>>>>> 
>>>>>      Furthermore glances gives me this alert: 
>>>>>      CRITICAL on CPU_IOWAIT (Min:1.9 Mean:2.3 Max:2.6): ceph-osd, 
>>>>>      ceph-osd, 
>>>>>      ceph-osd 
>>>>> 
>>>>>      What can / should I do now? 
>>>>>      Kill the long running processes? 
>>>>>      Stop the relevant OSDs? 
>>>>> 
>>>>>      Please advise? 
>>>>> 
>>>>>      THX 
>>>>>      Thomas 
>>>>>      _______________________________________________ 
>>>>>      ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>>      <mailto:ceph-users@xxxxxxx> 
>>>>>      To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>>>>>      <mailto:ceph-users-leave@xxxxxxx> 
>>>>> 
>>>>> 
>>>>> 
>>>> _______________________________________________ 
>>>> ceph-users mailing list -- ceph-users@xxxxxxx 
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx