Re: Minimize data lost with PG incomplete

Maxime Guyot <Maxime.Guyot@xxxxxxxxx> · Wed, 1 Feb 2017 06:56:22 +0000

Hi José

If you have some of the original OSDs (not zapped or erased) then you might be able to just re-add them to your cluster and have a happy cluster.
If you attempt the ceph_objectstore_tool –op export & import make sure to do it on a temporary OSD of weight 0 as recommended in the link provided.

Either way and from what I can see inthe pg dump you provided, if you restore osd.0, osd.3, osd.20, osd.21 and osd.22 it should be enough to bring back the pg that are down.

Cheers,

On 31/01/17 11:48, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:

    Any idea of how could I recover files from the filesystem mount?
    Doing a cp, it hungs when find a damaged file/folder. I would be happy
    getting no damaged files

    Thanks

    El 31/01/17 a las 11:19, José M. Martín escribió:
    > Thanks.
    > I just realized I keep some of the original OSD. If it contains some of
    > the incomplete PGs , would be possible to add then into the new disks?
    > Maybe following this steps? http://ceph.com/community/incomplete-pgs-oh-my/
    >
    > El 31/01/17 a las 10:44, Maxime Guyot escribió:
    >> Hi José,
    >>
    >> Too late, but you could have updated the CRUSHmap *before* moving the disks. Something like: “ceph osd crush set osd.0 0.90329 root=default rack=sala2.2  host=loki05” would move the osd.0 to loki05 and would trigger the appropriate PG movements before any physical move. Then the physical move is done as usual: set noout, stop osd, physically move, active osd, unnset noout.
    >>
    >> It’s a way to trigger the data movement overnight (maybe with a cron) and do the physical move at your own convenience in the morning.
    >>
    >> Cheers, 
    >> Maxime 
    >>
    >> On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:
    >>
    >>     Already min_size = 1
    >>     
    >>     Thanks,
    >>     Jose M. Martín
    >>     
    >>     El 31/01/17 a las 09:44, Henrik Korkuc escribió:
    >>     > I am not sure about "incomplete" part out of my head, but you can try
    >>     > setting min_size to 1 for pools toreactivate some PG, if they are
    >>     > down/inactive due to missing replicas.
    >>     >
    >>     > On 17-01-31 10:24, José M. Martín wrote:
    >>     >> # ceph -s
    >>     >>      cluster 29a91870-2ed2-40dc-969e-07b22f37928b
    >>     >>       health HEALTH_ERR
    >>     >>              clock skew detected on mon.loki04
    >>     >>              155 pgs are stuck inactive for more than 300 seconds
    >>     >>              7 pgs backfill_toofull
    >>     >>              1028 pgs backfill_wait
    >>     >>              48 pgs backfilling
    >>     >>              892 pgs degraded
    >>     >>              20 pgs down
    >>     >>              153 pgs incomplete
    >>     >>              2 pgs peering
    >>     >>              155 pgs stuck inactive
    >>     >>              1077 pgs stuck unclean
    >>     >>              892 pgs undersized
    >>     >>              1471 requests are blocked > 32 sec
    >>     >>              recovery 3195781/36460868 objects degraded (8.765%)
    >>     >>              recovery 5079026/36460868 objects misplaced (13.930%)
    >>     >>              mds0: Behind on trimming (175/30)
    >>     >>              noscrub,nodeep-scrub flag(s) set
    >>     >>              Monitor clock skew detected
    >>     >>       monmap e5: 5 mons at
    >>     >> {loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
    >>     >>
    >>     >>              election epoch 4028, quorum 0,1,2,3,4
    >>     >> loki01,loki02,loki03,loki04,loki05
    >>     >>        fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
    >>     >>       osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
    >>     >>              flags noscrub,nodeep-scrub
    >>     >>        pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
    >>     >>              45892 GB used, 34024 GB / 79916 GB avail
    >>     >>              3195781/36460868 objects degraded (8.765%)
    >>     >>              5079026/36460868 objects misplaced (13.930%)
    >>     >>                  3640 active+clean
    >>     >>                   838 active+undersized+degraded+remapped+wait_backfill
    >>     >>                   184 active+remapped+wait_backfill
    >>     >>                   134 incomplete
    >>     >>                    48 active+undersized+degraded+remapped+backfilling
    >>     >>                    19 down+incomplete
    >>     >>                     6
    >>     >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
    >>     >>                     1 active+remapped+backfill_toofull
    >>     >>                     1 peering
    >>     >>                     1 down+peering
    >>     >> recovery io 93909 kB/s, 10 keys/s, 67 objects/s
    >>     >>
    >>     >>
    >>     >>
    >>     >> # ceph osd tree
    >>     >> ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
    >>     >>   -1 77.22777 root default
    >>     >>   -9 27.14778     rack sala1
    >>     >>   -2  5.41974         host loki01
    >>     >>   14  0.90329             osd.14       up  1.00000          1.00000
    >>     >>   15  0.90329             osd.15       up  1.00000          1.00000
    >>     >>   16  0.90329             osd.16       up  1.00000          1.00000
    >>     >>   17  0.90329             osd.17       up  1.00000          1.00000
    >>     >>   18  0.90329             osd.18       up  1.00000          1.00000
    >>     >>   25  0.90329             osd.25       up  1.00000          1.00000
    >>     >>   -4  3.61316         host loki03
    >>     >>    0  0.90329             osd.0        up  1.00000          1.00000
    >>     >>    2  0.90329             osd.2        up  1.00000          1.00000
    >>     >>   20  0.90329             osd.20       up  1.00000          1.00000
    >>     >>   24  0.90329             osd.24       up  1.00000          1.00000
    >>     >>   -3  9.05714         host loki02
    >>     >>    1  0.90300             osd.1        up  0.90002          1.00000
    >>     >>   31  2.72198             osd.31       up  1.00000          1.00000
    >>     >>   29  0.90329             osd.29       up  1.00000          1.00000
    >>     >>   30  0.90329             osd.30       up  1.00000          1.00000
    >>     >>   33  0.90329             osd.33       up  1.00000          1.00000
    >>     >>   32  2.72229             osd.32       up  1.00000          1.00000
    >>     >>   -5  9.05774         host loki04
    >>     >>    3  0.90329             osd.3        up  1.00000          1.00000
    >>     >>   19  0.90329             osd.19       up  1.00000          1.00000
    >>     >>   21  0.90329             osd.21       up  1.00000          1.00000
    >>     >>   22  0.90329             osd.22       up  1.00000          1.00000
    >>     >>   23  2.72229             osd.23       up  1.00000          1.00000
    >>     >>   28  2.72229             osd.28       up  1.00000          1.00000
    >>     >> -10 24.61000     rack sala2.2
    >>     >>   -6 24.61000         host loki05
    >>     >>    5  2.73000             osd.5        up  1.00000          1.00000
    >>     >>    6  2.73000             osd.6        up  1.00000          1.00000
    >>     >>    9  2.73000             osd.9        up  1.00000          1.00000
    >>     >>   10  2.73000             osd.10       up  1.00000          1.00000
    >>     >>   11  2.73000             osd.11       up  1.00000          1.00000
    >>     >>   12  2.73000             osd.12       up  1.00000          1.00000
    >>     >>   13  2.73000             osd.13       up  1.00000          1.00000
    >>     >>    4  2.73000             osd.4        up  1.00000          1.00000
    >>     >>    8  2.73000             osd.8        up  1.00000          1.00000
    >>     >>    7  0.03999             osd.7        up  1.00000          1.00000
    >>     >> -12 25.46999     rack sala2.1
    >>     >> -11 25.46999         host loki06
    >>     >>   34  2.73000             osd.34       up  1.00000          1.00000
    >>     >>   35  2.73000             osd.35       up  1.00000          1.00000
    >>     >>   36  2.73000             osd.36       up  1.00000          1.00000
    >>     >>   37  2.73000             osd.37       up  1.00000          1.00000
    >>     >>   38  2.73000             osd.38       up  1.00000          1.00000
    >>     >>   39  2.73000             osd.39       up  1.00000          1.00000
    >>     >>   40  2.73000             osd.40       up  1.00000          1.00000
    >>     >>   43  2.73000             osd.43       up  1.00000          1.00000
    >>     >>   42  0.90999             osd.42       up  1.00000          1.00000
    >>     >>   41  2.71999             osd.41       up  1.00000          1.00000
    >>     >>
    >>     >>
    >>     >> # ceph pg dump
    >>     >> You can find it in this link:
    >>     >> http://ergodic.ugr.es/pgdumpoutput.txt
    >>     >>
    >>     >>
    >>     >> What I did:
    >>     >> My cluster is  heterogeneous, having old oss nodes with 1TB disks and
    >>     >> new ones with 3TB. I was having problems with balance, some 1TB osd got
    >>     >> nearly full meanwhile there was plenty of space in others. My plan was
    >>     >> changing some disks to another one biggers. I started the process with
    >>     >> no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
    >>     >> removed.
    >>     >> After that, searching for my problem, I read about straw2. Then, I
    >>     >> changed the algorithm editing the crush map and some data movement did.
    >>     >> My setup was not optimal, I had the journal in the xfs filesystem, so I
    >>     >> decided to change it also. First, I did it slowly, disk by disk, but as
    >>     >> rebalance take much time and my group was pushing me to finish quickly,
    >>     >> I did
    >>     >> ceph osd out osd.id
    >>     >> ceph osd crush remove osd.id
    >>     >> ceph auth del osd.id
    >>     >> ceph osd rm id
    >>     >>
    >>     >> Then umount the disks, and using ceph-deploy add then again
    >>     >> ceph-deploy disk zap loki01:/dev/sda
    >>     >> ceph-deploy osd create loki01:/dev/sda
    >>     >>
    >>     >> For every disk in rack "sala1". First, I finished loki02. Then, I did
    >>     >> this steps en loki04, loki01 and loki03 at the same time.
    >>     >>
    >>     >> Thanks,
    >>     >> -- 
    >>     >> José M. Martín
    >>     >>
    >>     >>
    >>     >> El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
    >>     >>> First off, the followings, please.
    >>     >>>
    >>     >>>   * ceph -s
    >>     >>>   * ceph osd tree
    >>     >>>   * ceph pg dump
    >>     >>>
    >>     >>> and
    >>     >>>
    >>     >>>   * what you actually did with exact commands.
    >>     >>>
    >>     >>> Regards,
    >>     >>>
    >>     >>> On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín
    >>     >>> <jmartin@xxxxxxxxxxxxxx> wrote:
    >>     >>>> Dear list,
    >>     >>>>
    >>     >>>> I'm having some big problems with my setup.
    >>     >>>>
    >>     >>>> I was trying to increase the global capacity by changing some osds by
    >>     >>>> bigger ones. I changed them without wait the rebalance process
    >>     >>>> finished,
    >>     >>>> thinking the replicas were saved in other buckets, but I found a
    >>     >>>> lot of
    >>     >>>> PGs incomplete, so replicas of a PG were placed in a same bucket. I
    >>     >>>> have
    >>     >>>> assumed I have lost data because I zapped the disks and used in
    >>     >>>> other tasks.
    >>     >>>>
    >>     >>>> My question is: what should I do to recover as much data as possible?
    >>     >>>> I'm using the filesystem and RBD.
    >>     >>>>
    >>     >>>> Thank you so much,
    >>     >>>>
    >>     >>>> -- 
    >>     >>>>
    >>     >>>> Jose M. Martín
    >>     >>>>
    >>     >>>>
    >>     >>>> _______________________________________________
    >>     >>>> ceph-users mailing list
    >>     >>>> ceph-users@xxxxxxxxxxxxxx
    >>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >>     >>
    >>     >>
    >>     >> _______________________________________________
    >>     >> ceph-users mailing list
    >>     >> ceph-users@xxxxxxxxxxxxxx
    >>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >>     >
    >>     >
    >>     > _______________________________________________
    >>     > ceph-users mailing list
    >>     > ceph-users@xxxxxxxxxxxxxx
    >>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >>     
    >>     
    >>     _______________________________________________
    >>     ceph-users mailing list
    >>     ceph-users@xxxxxxxxxxxxxx
    >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >>     
    >>
    >
    > _______________________________________________
    > ceph-users mailing list
    > ceph-users@xxxxxxxxxxxxxx
    > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com