Re: Minimize data lost with PG incomplete

"José M. Martín" <jmartin@xxxxxxxxxxxxxx> · Wed, 1 Feb 2017 14:29:51 +0100

Hi Maxime

I have 3 of the original disks but I don't know which OSD correspond
each one. Besides, I don't think I have enough technical skills to do
that and I don't want to go worse...
I'm trying to write a script that copy files from the damaged CephFS to
a new location.
Any help will be very gratefull

José

El 01/02/17 a las 07:56, Maxime Guyot escribió:
> Hi José
>
> If you have some of the original OSDs (not zapped or erased) then you might be able to just re-add them to your cluster and have a happy cluster.
> If you attempt the ceph_objectstore_tool –op export & import make sure to do it on a temporary OSD of weight 0 as recommended in the link provided.
>
> Either way and from what I can see inthe pg dump you provided, if you restore osd.0, osd.3, osd.20, osd.21 and osd.22 it should be enough to bring back the pg that are down.
>
> Cheers,
>  
> On 31/01/17 11:48, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:
>
>     Any idea of how could I recover files from the filesystem mount?
>     Doing a cp, it hungs when find a damaged file/folder. I would be happy
>     getting no damaged files
>     
>     Thanks
>     
>     El 31/01/17 a las 11:19, José M. Martín escribió:
>     > Thanks.
>     > I just realized I keep some of the original OSD. If it contains some of
>     > the incomplete PGs , would be possible to add then into the new disks?
>     > Maybe following this steps? http://ceph.com/community/incomplete-pgs-oh-my/
>     >
>     > El 31/01/17 a las 10:44, Maxime Guyot escribió:
>     >> Hi José,
>     >>
>     >> Too late, but you could have updated the CRUSHmap *before* moving the disks. Something like: “ceph osd crush set osd.0 0.90329 root=default rack=sala2.2  host=loki05” would move the osd.0 to loki05 and would trigger the appropriate PG movements before any physical move. Then the physical move is done as usual: set noout, stop osd, physically move, active osd, unnset noout.
>     >>
>     >> It’s a way to trigger the data movement overnight (maybe with a cron) and do the physical move at your own convenience in the morning.
>     >>
>     >> Cheers, 
>     >> Maxime 
>     >>
>     >> On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:
>     >>
>     >>     Already min_size = 1
>     >>     
>     >>     Thanks,
>     >>     Jose M. Martín
>     >>     
>     >>     El 31/01/17 a las 09:44, Henrik Korkuc escribió:
>     >>     > I am not sure about "incomplete" part out of my head, but you can try
>     >>     > setting min_size to 1 for pools toreactivate some PG, if they are
>     >>     > down/inactive due to missing replicas.
>     >>     >
>     >>     > On 17-01-31 10:24, José M. Martín wrote:
>     >>     >> # ceph -s
>     >>     >>      cluster 29a91870-2ed2-40dc-969e-07b22f37928b
>     >>     >>       health HEALTH_ERR
>     >>     >>              clock skew detected on mon.loki04
>     >>     >>              155 pgs are stuck inactive for more than 300 seconds
>     >>     >>              7 pgs backfill_toofull
>     >>     >>              1028 pgs backfill_wait
>     >>     >>              48 pgs backfilling
>     >>     >>              892 pgs degraded
>     >>     >>              20 pgs down
>     >>     >>              153 pgs incomplete
>     >>     >>              2 pgs peering
>     >>     >>              155 pgs stuck inactive
>     >>     >>              1077 pgs stuck unclean
>     >>     >>              892 pgs undersized
>     >>     >>              1471 requests are blocked > 32 sec
>     >>     >>              recovery 3195781/36460868 objects degraded (8.765%)
>     >>     >>              recovery 5079026/36460868 objects misplaced (13.930%)
>     >>     >>              mds0: Behind on trimming (175/30)
>     >>     >>              noscrub,nodeep-scrub flag(s) set
>     >>     >>              Monitor clock skew detected
>     >>     >>       monmap e5: 5 mons at
>     >>     >> {loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
>     >>     >>
>     >>     >>              election epoch 4028, quorum 0,1,2,3,4
>     >>     >> loki01,loki02,loki03,loki04,loki05
>     >>     >>        fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
>     >>     >>       osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
>     >>     >>              flags noscrub,nodeep-scrub
>     >>     >>        pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
>     >>     >>              45892 GB used, 34024 GB / 79916 GB avail
>     >>     >>              3195781/36460868 objects degraded (8.765%)
>     >>     >>              5079026/36460868 objects misplaced (13.930%)
>     >>     >>                  3640 active+clean
>     >>     >>                   838 active+undersized+degraded+remapped+wait_backfill
>     >>     >>                   184 active+remapped+wait_backfill
>     >>     >>                   134 incomplete
>     >>     >>                    48 active+undersized+degraded+remapped+backfilling
>     >>     >>                    19 down+incomplete
>     >>     >>                     6
>     >>     >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>     >>     >>                     1 active+remapped+backfill_toofull
>     >>     >>                     1 peering
>     >>     >>                     1 down+peering
>     >>     >> recovery io 93909 kB/s, 10 keys/s, 67 objects/s
>     >>     >>
>     >>     >>
>     >>     >>
>     >>     >> # ceph osd tree
>     >>     >> ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
>     >>     >>   -1 77.22777 root default
>     >>     >>   -9 27.14778     rack sala1
>     >>     >>   -2  5.41974         host loki01
>     >>     >>   14  0.90329             osd.14       up  1.00000          1.00000
>     >>     >>   15  0.90329             osd.15       up  1.00000          1.00000
>     >>     >>   16  0.90329             osd.16       up  1.00000          1.00000
>     >>     >>   17  0.90329             osd.17       up  1.00000          1.00000
>     >>     >>   18  0.90329             osd.18       up  1.00000          1.00000
>     >>     >>   25  0.90329             osd.25       up  1.00000          1.00000
>     >>     >>   -4  3.61316         host loki03
>     >>     >>    0  0.90329             osd.0        up  1.00000          1.00000
>     >>     >>    2  0.90329             osd.2        up  1.00000          1.00000
>     >>     >>   20  0.90329             osd.20       up  1.00000          1.00000
>     >>     >>   24  0.90329             osd.24       up  1.00000          1.00000
>     >>     >>   -3  9.05714         host loki02
>     >>     >>    1  0.90300             osd.1        up  0.90002          1.00000
>     >>     >>   31  2.72198             osd.31       up  1.00000          1.00000
>     >>     >>   29  0.90329             osd.29       up  1.00000          1.00000
>     >>     >>   30  0.90329             osd.30       up  1.00000          1.00000
>     >>     >>   33  0.90329             osd.33       up  1.00000          1.00000
>     >>     >>   32  2.72229             osd.32       up  1.00000          1.00000
>     >>     >>   -5  9.05774         host loki04
>     >>     >>    3  0.90329             osd.3        up  1.00000          1.00000
>     >>     >>   19  0.90329             osd.19       up  1.00000          1.00000
>     >>     >>   21  0.90329             osd.21       up  1.00000          1.00000
>     >>     >>   22  0.90329             osd.22       up  1.00000          1.00000
>     >>     >>   23  2.72229             osd.23       up  1.00000          1.00000
>     >>     >>   28  2.72229             osd.28       up  1.00000          1.00000
>     >>     >> -10 24.61000     rack sala2.2
>     >>     >>   -6 24.61000         host loki05
>     >>     >>    5  2.73000             osd.5        up  1.00000          1.00000
>     >>     >>    6  2.73000             osd.6        up  1.00000          1.00000
>     >>     >>    9  2.73000             osd.9        up  1.00000          1.00000
>     >>     >>   10  2.73000             osd.10       up  1.00000          1.00000
>     >>     >>   11  2.73000             osd.11       up  1.00000          1.00000
>     >>     >>   12  2.73000             osd.12       up  1.00000          1.00000
>     >>     >>   13  2.73000             osd.13       up  1.00000          1.00000
>     >>     >>    4  2.73000             osd.4        up  1.00000          1.00000
>     >>     >>    8  2.73000             osd.8        up  1.00000          1.00000
>     >>     >>    7  0.03999             osd.7        up  1.00000          1.00000
>     >>     >> -12 25.46999     rack sala2.1
>     >>     >> -11 25.46999         host loki06
>     >>     >>   34  2.73000             osd.34       up  1.00000          1.00000
>     >>     >>   35  2.73000             osd.35       up  1.00000          1.00000
>     >>     >>   36  2.73000             osd.36       up  1.00000          1.00000
>     >>     >>   37  2.73000             osd.37       up  1.00000          1.00000
>     >>     >>   38  2.73000             osd.38       up  1.00000          1.00000
>     >>     >>   39  2.73000             osd.39       up  1.00000          1.00000
>     >>     >>   40  2.73000             osd.40       up  1.00000          1.00000
>     >>     >>   43  2.73000             osd.43       up  1.00000          1.00000
>     >>     >>   42  0.90999             osd.42       up  1.00000          1.00000
>     >>     >>   41  2.71999             osd.41       up  1.00000          1.00000
>     >>     >>
>     >>     >>
>     >>     >> # ceph pg dump
>     >>     >> You can find it in this link:
>     >>     >> http://ergodic.ugr.es/pgdumpoutput.txt
>     >>     >>
>     >>     >>
>     >>     >> What I did:
>     >>     >> My cluster is  heterogeneous, having old oss nodes with 1TB disks and
>     >>     >> new ones with 3TB. I was having problems with balance, some 1TB osd got
>     >>     >> nearly full meanwhile there was plenty of space in others. My plan was
>     >>     >> changing some disks to another one biggers. I started the process with
>     >>     >> no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
>     >>     >> removed.
>     >>     >> After that, searching for my problem, I read about straw2. Then, I
>     >>     >> changed the algorithm editing the crush map and some data movement did.
>     >>     >> My setup was not optimal, I had the journal in the xfs filesystem, so I
>     >>     >> decided to change it also. First, I did it slowly, disk by disk, but as
>     >>     >> rebalance take much time and my group was pushing me to finish quickly,
>     >>     >> I did
>     >>     >> ceph osd out osd.id
>     >>     >> ceph osd crush remove osd.id
>     >>     >> ceph auth del osd.id
>     >>     >> ceph osd rm id
>     >>     >>
>     >>     >> Then umount the disks, and using ceph-deploy add then again
>     >>     >> ceph-deploy disk zap loki01:/dev/sda
>     >>     >> ceph-deploy osd create loki01:/dev/sda
>     >>     >>
>     >>     >> For every disk in rack "sala1". First, I finished loki02. Then, I did
>     >>     >> this steps en loki04, loki01 and loki03 at the same time.
>     >>     >>
>     >>     >> Thanks,
>     >>     >> -- 
>     >>     >> José M. Martín
>     >>     >>
>     >>     >>
>     >>     >> El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
>     >>     >>> First off, the followings, please.
>     >>     >>>
>     >>     >>>   * ceph -s
>     >>     >>>   * ceph osd tree
>     >>     >>>   * ceph pg dump
>     >>     >>>
>     >>     >>> and
>     >>     >>>
>     >>     >>>   * what you actually did with exact commands.
>     >>     >>>
>     >>     >>> Regards,
>     >>     >>>
>     >>     >>> On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín
>     >>     >>> <jmartin@xxxxxxxxxxxxxx> wrote:
>     >>     >>>> Dear list,
>     >>     >>>>
>     >>     >>>> I'm having some big problems with my setup.
>     >>     >>>>
>     >>     >>>> I was trying to increase the global capacity by changing some osds by
>     >>     >>>> bigger ones. I changed them without wait the rebalance process
>     >>     >>>> finished,
>     >>     >>>> thinking the replicas were saved in other buckets, but I found a
>     >>     >>>> lot of
>     >>     >>>> PGs incomplete, so replicas of a PG were placed in a same bucket. I
>     >>     >>>> have
>     >>     >>>> assumed I have lost data because I zapped the disks and used in
>     >>     >>>> other tasks.
>     >>     >>>>
>     >>     >>>> My question is: what should I do to recover as much data as possible?
>     >>     >>>> I'm using the filesystem and RBD.
>     >>     >>>>
>     >>     >>>> Thank you so much,
>     >>     >>>>
>     >>     >>>> -- 
>     >>     >>>>
>     >>     >>>> Jose M. Martín
>     >>     >>>>
>     >>     >>>>
>     >>     >>>> _______________________________________________
>     >>     >>>> ceph-users mailing list
>     >>     >>>> ceph-users@xxxxxxxxxxxxxx
>     >>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>     >>
>     >>     >>
>     >>     >> _______________________________________________
>     >>     >> ceph-users mailing list
>     >>     >> ceph-users@xxxxxxxxxxxxxx
>     >>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>     >
>     >>     >
>     >>     > _______________________________________________
>     >>     > ceph-users mailing list
>     >>     > ceph-users@xxxxxxxxxxxxxx
>     >>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>     
>     >>     
>     >>     _______________________________________________
>     >>     ceph-users mailing list
>     >>     ceph-users@xxxxxxxxxxxxxx
>     >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>     
>     >>
>     >
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@xxxxxxxxxxxxxx
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     
>     
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com