Re: Minimize data lost with PG incomplete

"José M. Martín" <jmartin@xxxxxxxxxxxxxx> · Tue, 31 Jan 2017 11:19:55 +0100

Thanks.
I just realized I keep some of the original OSD. If it contains some of
the incomplete PGs , would be possible to add then into the new disks?
Maybe following this steps? http://ceph.com/community/incomplete-pgs-oh-my/

El 31/01/17 a las 10:44, Maxime Guyot escribió:
> Hi José,
>
> Too late, but you could have updated the CRUSHmap *before* moving the disks. Something like: “ceph osd crush set osd.0 0.90329 root=default rack=sala2.2  host=loki05” would move the osd.0 to loki05 and would trigger the appropriate PG movements before any physical move. Then the physical move is done as usual: set noout, stop osd, physically move, active osd, unnset noout.
>
> It’s a way to trigger the data movement overnight (maybe with a cron) and do the physical move at your own convenience in the morning.
>
> Cheers, 
> Maxime 
>
> On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:
>
>     Already min_size = 1
>     
>     Thanks,
>     Jose M. Martín
>     
>     El 31/01/17 a las 09:44, Henrik Korkuc escribió:
>     > I am not sure about "incomplete" part out of my head, but you can try
>     > setting min_size to 1 for pools toreactivate some PG, if they are
>     > down/inactive due to missing replicas.
>     >
>     > On 17-01-31 10:24, José M. Martín wrote:
>     >> # ceph -s
>     >>      cluster 29a91870-2ed2-40dc-969e-07b22f37928b
>     >>       health HEALTH_ERR
>     >>              clock skew detected on mon.loki04
>     >>              155 pgs are stuck inactive for more than 300 seconds
>     >>              7 pgs backfill_toofull
>     >>              1028 pgs backfill_wait
>     >>              48 pgs backfilling
>     >>              892 pgs degraded
>     >>              20 pgs down
>     >>              153 pgs incomplete
>     >>              2 pgs peering
>     >>              155 pgs stuck inactive
>     >>              1077 pgs stuck unclean
>     >>              892 pgs undersized
>     >>              1471 requests are blocked > 32 sec
>     >>              recovery 3195781/36460868 objects degraded (8.765%)
>     >>              recovery 5079026/36460868 objects misplaced (13.930%)
>     >>              mds0: Behind on trimming (175/30)
>     >>              noscrub,nodeep-scrub flag(s) set
>     >>              Monitor clock skew detected
>     >>       monmap e5: 5 mons at
>     >> {loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
>     >>
>     >>              election epoch 4028, quorum 0,1,2,3,4
>     >> loki01,loki02,loki03,loki04,loki05
>     >>        fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
>     >>       osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
>     >>              flags noscrub,nodeep-scrub
>     >>        pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
>     >>              45892 GB used, 34024 GB / 79916 GB avail
>     >>              3195781/36460868 objects degraded (8.765%)
>     >>              5079026/36460868 objects misplaced (13.930%)
>     >>                  3640 active+clean
>     >>                   838 active+undersized+degraded+remapped+wait_backfill
>     >>                   184 active+remapped+wait_backfill
>     >>                   134 incomplete
>     >>                    48 active+undersized+degraded+remapped+backfilling
>     >>                    19 down+incomplete
>     >>                     6
>     >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>     >>                     1 active+remapped+backfill_toofull
>     >>                     1 peering
>     >>                     1 down+peering
>     >> recovery io 93909 kB/s, 10 keys/s, 67 objects/s
>     >>
>     >>
>     >>
>     >> # ceph osd tree
>     >> ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
>     >>   -1 77.22777 root default
>     >>   -9 27.14778     rack sala1
>     >>   -2  5.41974         host loki01
>     >>   14  0.90329             osd.14       up  1.00000          1.00000
>     >>   15  0.90329             osd.15       up  1.00000          1.00000
>     >>   16  0.90329             osd.16       up  1.00000          1.00000
>     >>   17  0.90329             osd.17       up  1.00000          1.00000
>     >>   18  0.90329             osd.18       up  1.00000          1.00000
>     >>   25  0.90329             osd.25       up  1.00000          1.00000
>     >>   -4  3.61316         host loki03
>     >>    0  0.90329             osd.0        up  1.00000          1.00000
>     >>    2  0.90329             osd.2        up  1.00000          1.00000
>     >>   20  0.90329             osd.20       up  1.00000          1.00000
>     >>   24  0.90329             osd.24       up  1.00000          1.00000
>     >>   -3  9.05714         host loki02
>     >>    1  0.90300             osd.1        up  0.90002          1.00000
>     >>   31  2.72198             osd.31       up  1.00000          1.00000
>     >>   29  0.90329             osd.29       up  1.00000          1.00000
>     >>   30  0.90329             osd.30       up  1.00000          1.00000
>     >>   33  0.90329             osd.33       up  1.00000          1.00000
>     >>   32  2.72229             osd.32       up  1.00000          1.00000
>     >>   -5  9.05774         host loki04
>     >>    3  0.90329             osd.3        up  1.00000          1.00000
>     >>   19  0.90329             osd.19       up  1.00000          1.00000
>     >>   21  0.90329             osd.21       up  1.00000          1.00000
>     >>   22  0.90329             osd.22       up  1.00000          1.00000
>     >>   23  2.72229             osd.23       up  1.00000          1.00000
>     >>   28  2.72229             osd.28       up  1.00000          1.00000
>     >> -10 24.61000     rack sala2.2
>     >>   -6 24.61000         host loki05
>     >>    5  2.73000             osd.5        up  1.00000          1.00000
>     >>    6  2.73000             osd.6        up  1.00000          1.00000
>     >>    9  2.73000             osd.9        up  1.00000          1.00000
>     >>   10  2.73000             osd.10       up  1.00000          1.00000
>     >>   11  2.73000             osd.11       up  1.00000          1.00000
>     >>   12  2.73000             osd.12       up  1.00000          1.00000
>     >>   13  2.73000             osd.13       up  1.00000          1.00000
>     >>    4  2.73000             osd.4        up  1.00000          1.00000
>     >>    8  2.73000             osd.8        up  1.00000          1.00000
>     >>    7  0.03999             osd.7        up  1.00000          1.00000
>     >> -12 25.46999     rack sala2.1
>     >> -11 25.46999         host loki06
>     >>   34  2.73000             osd.34       up  1.00000          1.00000
>     >>   35  2.73000             osd.35       up  1.00000          1.00000
>     >>   36  2.73000             osd.36       up  1.00000          1.00000
>     >>   37  2.73000             osd.37       up  1.00000          1.00000
>     >>   38  2.73000             osd.38       up  1.00000          1.00000
>     >>   39  2.73000             osd.39       up  1.00000          1.00000
>     >>   40  2.73000             osd.40       up  1.00000          1.00000
>     >>   43  2.73000             osd.43       up  1.00000          1.00000
>     >>   42  0.90999             osd.42       up  1.00000          1.00000
>     >>   41  2.71999             osd.41       up  1.00000          1.00000
>     >>
>     >>
>     >> # ceph pg dump
>     >> You can find it in this link:
>     >> http://ergodic.ugr.es/pgdumpoutput.txt
>     >>
>     >>
>     >> What I did:
>     >> My cluster is  heterogeneous, having old oss nodes with 1TB disks and
>     >> new ones with 3TB. I was having problems with balance, some 1TB osd got
>     >> nearly full meanwhile there was plenty of space in others. My plan was
>     >> changing some disks to another one biggers. I started the process with
>     >> no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
>     >> removed.
>     >> After that, searching for my problem, I read about straw2. Then, I
>     >> changed the algorithm editing the crush map and some data movement did.
>     >> My setup was not optimal, I had the journal in the xfs filesystem, so I
>     >> decided to change it also. First, I did it slowly, disk by disk, but as
>     >> rebalance take much time and my group was pushing me to finish quickly,
>     >> I did
>     >> ceph osd out osd.id
>     >> ceph osd crush remove osd.id
>     >> ceph auth del osd.id
>     >> ceph osd rm id
>     >>
>     >> Then umount the disks, and using ceph-deploy add then again
>     >> ceph-deploy disk zap loki01:/dev/sda
>     >> ceph-deploy osd create loki01:/dev/sda
>     >>
>     >> For every disk in rack "sala1". First, I finished loki02. Then, I did
>     >> this steps en loki04, loki01 and loki03 at the same time.
>     >>
>     >> Thanks,
>     >> -- 
>     >> José M. Martín
>     >>
>     >>
>     >> El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
>     >>> First off, the followings, please.
>     >>>
>     >>>   * ceph -s
>     >>>   * ceph osd tree
>     >>>   * ceph pg dump
>     >>>
>     >>> and
>     >>>
>     >>>   * what you actually did with exact commands.
>     >>>
>     >>> Regards,
>     >>>
>     >>> On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín
>     >>> <jmartin@xxxxxxxxxxxxxx> wrote:
>     >>>> Dear list,
>     >>>>
>     >>>> I'm having some big problems with my setup.
>     >>>>
>     >>>> I was trying to increase the global capacity by changing some osds by
>     >>>> bigger ones. I changed them without wait the rebalance process
>     >>>> finished,
>     >>>> thinking the replicas were saved in other buckets, but I found a
>     >>>> lot of
>     >>>> PGs incomplete, so replicas of a PG were placed in a same bucket. I
>     >>>> have
>     >>>> assumed I have lost data because I zapped the disks and used in
>     >>>> other tasks.
>     >>>>
>     >>>> My question is: what should I do to recover as much data as possible?
>     >>>> I'm using the filesystem and RBD.
>     >>>>
>     >>>> Thank you so much,
>     >>>>
>     >>>> -- 
>     >>>>
>     >>>> Jose M. Martín
>     >>>>
>     >>>>
>     >>>> _______________________________________________
>     >>>> ceph-users mailing list
>     >>>> ceph-users@xxxxxxxxxxxxxx
>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>
>     >>
>     >> _______________________________________________
>     >> ceph-users mailing list
>     >> ceph-users@xxxxxxxxxxxxxx
>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     >
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@xxxxxxxxxxxxxx
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     
>     
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com