Re: Minimize data lost with PG incomplete

Henrik Korkuc <lists@xxxxxxxxx> · Tue, 31 Jan 2017 10:44:09 +0200

I am not sure about "incomplete" part out of my head, but you can try 
setting min_size to 1 for pools toreactivate some PG, if they are 
down/inactive due to missing replicas.

On 17-01-31 10:24, José M. Martín wrote:
# ceph -s
     cluster 29a91870-2ed2-40dc-969e-07b22f37928b
      health HEALTH_ERR
             clock skew detected on mon.loki04
             155 pgs are stuck inactive for more than 300 seconds
             7 pgs backfill_toofull
             1028 pgs backfill_wait
             48 pgs backfilling
             892 pgs degraded
             20 pgs down
             153 pgs incomplete
             2 pgs peering
             155 pgs stuck inactive
             1077 pgs stuck unclean
             892 pgs undersized
             1471 requests are blocked > 32 sec
             recovery 3195781/36460868 objects degraded (8.765%)
             recovery 5079026/36460868 objects misplaced (13.930%)
             mds0: Behind on trimming (175/30)
             noscrub,nodeep-scrub flag(s) set
             Monitor clock skew detected
      monmap e5: 5 mons at
{loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
             election epoch 4028, quorum 0,1,2,3,4
loki01,loki02,loki03,loki04,loki05
       fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
      osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
             flags noscrub,nodeep-scrub
       pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
             45892 GB used, 34024 GB / 79916 GB avail
             3195781/36460868 objects degraded (8.765%)
             5079026/36460868 objects misplaced (13.930%)
                 3640 active+clean
                  838 active+undersized+degraded+remapped+wait_backfill
                  184 active+remapped+wait_backfill
                  134 incomplete
                   48 active+undersized+degraded+remapped+backfilling
                   19 down+incomplete
                    6
active+undersized+degraded+remapped+wait_backfill+backfill_toofull
                    1 active+remapped+backfill_toofull
                    1 peering
                    1 down+peering
recovery io 93909 kB/s, 10 keys/s, 67 objects/s

# ceph osd tree
ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
  -1 77.22777 root default
  -9 27.14778     rack sala1
  -2  5.41974         host loki01
  14  0.90329             osd.14       up  1.00000          1.00000
  15  0.90329             osd.15       up  1.00000          1.00000
  16  0.90329             osd.16       up  1.00000          1.00000
  17  0.90329             osd.17       up  1.00000          1.00000
  18  0.90329             osd.18       up  1.00000          1.00000
  25  0.90329             osd.25       up  1.00000          1.00000
  -4  3.61316         host loki03
   0  0.90329             osd.0        up  1.00000          1.00000
   2  0.90329             osd.2        up  1.00000          1.00000
  20  0.90329             osd.20       up  1.00000          1.00000
  24  0.90329             osd.24       up  1.00000          1.00000
  -3  9.05714         host loki02
   1  0.90300             osd.1        up  0.90002          1.00000
  31  2.72198             osd.31       up  1.00000          1.00000
  29  0.90329             osd.29       up  1.00000          1.00000
  30  0.90329             osd.30       up  1.00000          1.00000
  33  0.90329             osd.33       up  1.00000          1.00000
  32  2.72229             osd.32       up  1.00000          1.00000
  -5  9.05774         host loki04
   3  0.90329             osd.3        up  1.00000          1.00000
  19  0.90329             osd.19       up  1.00000          1.00000
  21  0.90329             osd.21       up  1.00000          1.00000
  22  0.90329             osd.22       up  1.00000          1.00000
  23  2.72229             osd.23       up  1.00000          1.00000
  28  2.72229             osd.28       up  1.00000          1.00000
-10 24.61000     rack sala2.2
  -6 24.61000         host loki05
   5  2.73000             osd.5        up  1.00000          1.00000
   6  2.73000             osd.6        up  1.00000          1.00000
   9  2.73000             osd.9        up  1.00000          1.00000
  10  2.73000             osd.10       up  1.00000          1.00000
  11  2.73000             osd.11       up  1.00000          1.00000
  12  2.73000             osd.12       up  1.00000          1.00000
  13  2.73000             osd.13       up  1.00000          1.00000
   4  2.73000             osd.4        up  1.00000          1.00000
   8  2.73000             osd.8        up  1.00000          1.00000
   7  0.03999             osd.7        up  1.00000          1.00000
-12 25.46999     rack sala2.1
-11 25.46999         host loki06
  34  2.73000             osd.34       up  1.00000          1.00000
  35  2.73000             osd.35       up  1.00000          1.00000
  36  2.73000             osd.36       up  1.00000          1.00000
  37  2.73000             osd.37       up  1.00000          1.00000
  38  2.73000             osd.38       up  1.00000          1.00000
  39  2.73000             osd.39       up  1.00000          1.00000
  40  2.73000             osd.40       up  1.00000          1.00000
  43  2.73000             osd.43       up  1.00000          1.00000
  42  0.90999             osd.42       up  1.00000          1.00000
  41  2.71999             osd.41       up  1.00000          1.00000

# ceph pg dump
You can find it in this link:
http://ergodic.ugr.es/pgdumpoutput.txt

What I did:
My cluster is  heterogeneous, having old oss nodes with 1TB disks and
new ones with 3TB. I was having problems with balance, some 1TB osd got
nearly full meanwhile there was plenty of space in others. My plan was
changing some disks to another one biggers. I started the process with
no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
removed.
After that, searching for my problem, I read about straw2. Then, I
changed the algorithm editing the crush map and some data movement did.
My setup was not optimal, I had the journal in the xfs filesystem, so I
decided to change it also. First, I did it slowly, disk by disk, but as
rebalance take much time and my group was pushing me to finish quickly,
I did
ceph osd out osd.id
ceph osd crush remove osd.id
ceph auth del osd.id
ceph osd rm id

Then umount the disks, and using ceph-deploy add then again
ceph-deploy disk zap loki01:/dev/sda
ceph-deploy osd create loki01:/dev/sda

For every disk in rack "sala1". First, I finished loki02. Then, I did
this steps en loki04, loki01 and loki03 at the same time.

Thanks,
--
José M. Martín

El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
First off, the followings, please.

  * ceph -s
  * ceph osd tree
  * ceph pg dump

and

  * what you actually did with exact commands.

Regards,

On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín <jmartin@xxxxxxxxxxxxxx> wrote:
Dear list,

I'm having some big problems with my setup.

I was trying to increase the global capacity by changing some osds by
bigger ones. I changed them without wait the rebalance process finished,
thinking the replicas were saved in other buckets, but I found a lot of
PGs incomplete, so replicas of a PG were placed in a same bucket. I have
assumed I have lost data because I zapped the disks and used in other tasks.

My question is: what should I do to recover as much data as possible?
I'm using the filesystem and RBD.

Thank you so much,

--

Jose M. Martín

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com