Re: Minimize data lost with PG incomplete

"José M. Martín" <jmartin@xxxxxxxxxxxxxx> · Tue, 31 Jan 2017 09:24:14 +0100

# ceph -s
    cluster 29a91870-2ed2-40dc-969e-07b22f37928b
     health HEALTH_ERR
            clock skew detected on mon.loki04
            155 pgs are stuck inactive for more than 300 seconds
            7 pgs backfill_toofull
            1028 pgs backfill_wait
            48 pgs backfilling
            892 pgs degraded
            20 pgs down
            153 pgs incomplete
            2 pgs peering
            155 pgs stuck inactive
            1077 pgs stuck unclean
            892 pgs undersized
            1471 requests are blocked > 32 sec
            recovery 3195781/36460868 objects degraded (8.765%)
            recovery 5079026/36460868 objects misplaced (13.930%)
            mds0: Behind on trimming (175/30)
            noscrub,nodeep-scrub flag(s) set
            Monitor clock skew detected
     monmap e5: 5 mons at
{loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
            election epoch 4028, quorum 0,1,2,3,4
loki01,loki02,loki03,loki04,loki05
      fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
     osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
            45892 GB used, 34024 GB / 79916 GB avail
            3195781/36460868 objects degraded (8.765%)
            5079026/36460868 objects misplaced (13.930%)
                3640 active+clean
                 838 active+undersized+degraded+remapped+wait_backfill
                 184 active+remapped+wait_backfill
                 134 incomplete
                  48 active+undersized+degraded+remapped+backfilling
                  19 down+incomplete
                   6
active+undersized+degraded+remapped+wait_backfill+backfill_toofull
                   1 active+remapped+backfill_toofull
                   1 peering
                   1 down+peering
recovery io 93909 kB/s, 10 keys/s, 67 objects/s

# ceph osd tree
ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 77.22777 root default                                         
 -9 27.14778     rack sala1                                       
 -2  5.41974         host loki01                                  
 14  0.90329             osd.14       up  1.00000          1.00000
 15  0.90329             osd.15       up  1.00000          1.00000
 16  0.90329             osd.16       up  1.00000          1.00000
 17  0.90329             osd.17       up  1.00000          1.00000
 18  0.90329             osd.18       up  1.00000          1.00000
 25  0.90329             osd.25       up  1.00000          1.00000
 -4  3.61316         host loki03                                  
  0  0.90329             osd.0        up  1.00000          1.00000
  2  0.90329             osd.2        up  1.00000          1.00000
 20  0.90329             osd.20       up  1.00000          1.00000
 24  0.90329             osd.24       up  1.00000          1.00000
 -3  9.05714         host loki02                                  
  1  0.90300             osd.1        up  0.90002          1.00000
 31  2.72198             osd.31       up  1.00000          1.00000
 29  0.90329             osd.29       up  1.00000          1.00000
 30  0.90329             osd.30       up  1.00000          1.00000
 33  0.90329             osd.33       up  1.00000          1.00000
 32  2.72229             osd.32       up  1.00000          1.00000
 -5  9.05774         host loki04                                  
  3  0.90329             osd.3        up  1.00000          1.00000
 19  0.90329             osd.19       up  1.00000          1.00000
 21  0.90329             osd.21       up  1.00000          1.00000
 22  0.90329             osd.22       up  1.00000          1.00000
 23  2.72229             osd.23       up  1.00000          1.00000
 28  2.72229             osd.28       up  1.00000          1.00000
-10 24.61000     rack sala2.2                                     
 -6 24.61000         host loki05                                  
  5  2.73000             osd.5        up  1.00000          1.00000
  6  2.73000             osd.6        up  1.00000          1.00000
  9  2.73000             osd.9        up  1.00000          1.00000
 10  2.73000             osd.10       up  1.00000          1.00000
 11  2.73000             osd.11       up  1.00000          1.00000
 12  2.73000             osd.12       up  1.00000          1.00000
 13  2.73000             osd.13       up  1.00000          1.00000
  4  2.73000             osd.4        up  1.00000          1.00000
  8  2.73000             osd.8        up  1.00000          1.00000
  7  0.03999             osd.7        up  1.00000          1.00000
-12 25.46999     rack sala2.1                                     
-11 25.46999         host loki06                                  
 34  2.73000             osd.34       up  1.00000          1.00000
 35  2.73000             osd.35       up  1.00000          1.00000
 36  2.73000             osd.36       up  1.00000          1.00000
 37  2.73000             osd.37       up  1.00000          1.00000
 38  2.73000             osd.38       up  1.00000          1.00000
 39  2.73000             osd.39       up  1.00000          1.00000
 40  2.73000             osd.40       up  1.00000          1.00000
 43  2.73000             osd.43       up  1.00000          1.00000
 42  0.90999             osd.42       up  1.00000          1.00000
 41  2.71999             osd.41       up  1.00000          1.00000

# ceph pg dump
You can find it in this link:
http://ergodic.ugr.es/pgdumpoutput.txt

What I did:
My cluster is  heterogeneous, having old oss nodes with 1TB disks and
new ones with 3TB. I was having problems with balance, some 1TB osd got
nearly full meanwhile there was plenty of space in others. My plan was
changing some disks to another one biggers. I started the process with
no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
removed.
After that, searching for my problem, I read about straw2. Then, I
changed the algorithm editing the crush map and some data movement did.
My setup was not optimal, I had the journal in the xfs filesystem, so I
decided to change it also. First, I did it slowly, disk by disk, but as
rebalance take much time and my group was pushing me to finish quickly,
I did
ceph osd out osd.id
ceph osd crush remove osd.id
ceph auth del osd.id
ceph osd rm id

Then umount the disks, and using ceph-deploy add then again
ceph-deploy disk zap loki01:/dev/sda
ceph-deploy osd create loki01:/dev/sda

For every disk in rack "sala1". First, I finished loki02. Then, I did
this steps en loki04, loki01 and loki03 at the same time.

Thanks,
--
José M. Martín

El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
> First off, the followings, please.
>
>  * ceph -s
>  * ceph osd tree
>  * ceph pg dump
>
> and
>
>  * what you actually did with exact commands.
>
> Regards,
>
> On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín <jmartin@xxxxxxxxxxxxxx> wrote:
>> Dear list,
>>
>> I'm having some big problems with my setup.
>>
>> I was trying to increase the global capacity by changing some osds by
>> bigger ones. I changed them without wait the rebalance process finished,
>> thinking the replicas were saved in other buckets, but I found a lot of
>> PGs incomplete, so replicas of a PG were placed in a same bucket. I have
>> assumed I have lost data because I zapped the disks and used in other tasks.
>>
>> My question is: what should I do to recover as much data as possible?
>> I'm using the filesystem and RBD.
>>
>> Thank you so much,
>>
>> --
>>
>> Jose M. Martín
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com