Re: Minimize data lost with PG incomplete

Maxime Guyot <Maxime.Guyot@xxxxxxxxx> · Tue, 31 Jan 2017 09:44:24 +0000

Hi José,

Too late, but you could have updated the CRUSHmap *before* moving the disks. Something like: “ceph osd crush set osd.0 0.90329 root=default rack=sala2.2  host=loki05” would move the osd.0 to loki05 and would trigger the appropriate PG movements before any physical move. Then the physical move is done as usual: set noout, stop osd, physically move, active osd, unnset noout.

It’s a way to trigger the data movement overnight (maybe with a cron) and do the physical move at your own convenience in the morning.

Cheers, 
Maxime 

On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" <ceph-users-bounces@xxxxxxxxxxxxxx on behalf of jmartin@xxxxxxxxxxxxxx> wrote:

    Already min_size = 1

    Thanks,
    Jose M. Martín

    El 31/01/17 a las 09:44, Henrik Korkuc escribió:
    > I am not sure about "incomplete" part out of my head, but you can try
    > setting min_size to 1 for pools toreactivate some PG, if they are
    > down/inactive due to missing replicas.
    >
    > On 17-01-31 10:24, José M. Martín wrote:
    >> # ceph -s
    >>      cluster 29a91870-2ed2-40dc-969e-07b22f37928b
    >>       health HEALTH_ERR
    >>              clock skew detected on mon.loki04
    >>              155 pgs are stuck inactive for more than 300 seconds
    >>              7 pgs backfill_toofull
    >>              1028 pgs backfill_wait
    >>              48 pgs backfilling
    >>              892 pgs degraded
    >>              20 pgs down
    >>              153 pgs incomplete
    >>              2 pgs peering
    >>              155 pgs stuck inactive
    >>              1077 pgs stuck unclean
    >>              892 pgs undersized
    >>              1471 requests are blocked > 32 sec
    >>              recovery 3195781/36460868 objects degraded (8.765%)
    >>              recovery 5079026/36460868 objects misplaced (13.930%)
    >>              mds0: Behind on trimming (175/30)
    >>              noscrub,nodeep-scrub flag(s) set
    >>              Monitor clock skew detected
    >>       monmap e5: 5 mons at
    >> {loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
    >>
    >>              election epoch 4028, quorum 0,1,2,3,4
    >> loki01,loki02,loki03,loki04,loki05
    >>        fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
    >>       osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
    >>              flags noscrub,nodeep-scrub
    >>        pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
    >>              45892 GB used, 34024 GB / 79916 GB avail
    >>              3195781/36460868 objects degraded (8.765%)
    >>              5079026/36460868 objects misplaced (13.930%)
    >>                  3640 active+clean
    >>                   838 active+undersized+degraded+remapped+wait_backfill
    >>                   184 active+remapped+wait_backfill
    >>                   134 incomplete
    >>                    48 active+undersized+degraded+remapped+backfilling
    >>                    19 down+incomplete
    >>                     6
    >> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
    >>                     1 active+remapped+backfill_toofull
    >>                     1 peering
    >>                     1 down+peering
    >> recovery io 93909 kB/s, 10 keys/s, 67 objects/s
    >>
    >>
    >>
    >> # ceph osd tree
    >> ID  WEIGHT   TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
    >>   -1 77.22777 root default
    >>   -9 27.14778     rack sala1
    >>   -2  5.41974         host loki01
    >>   14  0.90329             osd.14       up  1.00000          1.00000
    >>   15  0.90329             osd.15       up  1.00000          1.00000
    >>   16  0.90329             osd.16       up  1.00000          1.00000
    >>   17  0.90329             osd.17       up  1.00000          1.00000
    >>   18  0.90329             osd.18       up  1.00000          1.00000
    >>   25  0.90329             osd.25       up  1.00000          1.00000
    >>   -4  3.61316         host loki03
    >>    0  0.90329             osd.0        up  1.00000          1.00000
    >>    2  0.90329             osd.2        up  1.00000          1.00000
    >>   20  0.90329             osd.20       up  1.00000          1.00000
    >>   24  0.90329             osd.24       up  1.00000          1.00000
    >>   -3  9.05714         host loki02
    >>    1  0.90300             osd.1        up  0.90002          1.00000
    >>   31  2.72198             osd.31       up  1.00000          1.00000
    >>   29  0.90329             osd.29       up  1.00000          1.00000
    >>   30  0.90329             osd.30       up  1.00000          1.00000
    >>   33  0.90329             osd.33       up  1.00000          1.00000
    >>   32  2.72229             osd.32       up  1.00000          1.00000
    >>   -5  9.05774         host loki04
    >>    3  0.90329             osd.3        up  1.00000          1.00000
    >>   19  0.90329             osd.19       up  1.00000          1.00000
    >>   21  0.90329             osd.21       up  1.00000          1.00000
    >>   22  0.90329             osd.22       up  1.00000          1.00000
    >>   23  2.72229             osd.23       up  1.00000          1.00000
    >>   28  2.72229             osd.28       up  1.00000          1.00000
    >> -10 24.61000     rack sala2.2
    >>   -6 24.61000         host loki05
    >>    5  2.73000             osd.5        up  1.00000          1.00000
    >>    6  2.73000             osd.6        up  1.00000          1.00000
    >>    9  2.73000             osd.9        up  1.00000          1.00000
    >>   10  2.73000             osd.10       up  1.00000          1.00000
    >>   11  2.73000             osd.11       up  1.00000          1.00000
    >>   12  2.73000             osd.12       up  1.00000          1.00000
    >>   13  2.73000             osd.13       up  1.00000          1.00000
    >>    4  2.73000             osd.4        up  1.00000          1.00000
    >>    8  2.73000             osd.8        up  1.00000          1.00000
    >>    7  0.03999             osd.7        up  1.00000          1.00000
    >> -12 25.46999     rack sala2.1
    >> -11 25.46999         host loki06
    >>   34  2.73000             osd.34       up  1.00000          1.00000
    >>   35  2.73000             osd.35       up  1.00000          1.00000
    >>   36  2.73000             osd.36       up  1.00000          1.00000
    >>   37  2.73000             osd.37       up  1.00000          1.00000
    >>   38  2.73000             osd.38       up  1.00000          1.00000
    >>   39  2.73000             osd.39       up  1.00000          1.00000
    >>   40  2.73000             osd.40       up  1.00000          1.00000
    >>   43  2.73000             osd.43       up  1.00000          1.00000
    >>   42  0.90999             osd.42       up  1.00000          1.00000
    >>   41  2.71999             osd.41       up  1.00000          1.00000
    >>
    >>
    >> # ceph pg dump
    >> You can find it in this link:
    >> http://ergodic.ugr.es/pgdumpoutput.txt
    >>
    >>
    >> What I did:
    >> My cluster is  heterogeneous, having old oss nodes with 1TB disks and
    >> new ones with 3TB. I was having problems with balance, some 1TB osd got
    >> nearly full meanwhile there was plenty of space in others. My plan was
    >> changing some disks to another one biggers. I started the process with
    >> no problems, changing one disk. Reweight to 0.0, wait for rebalance, and
    >> removed.
    >> After that, searching for my problem, I read about straw2. Then, I
    >> changed the algorithm editing the crush map and some data movement did.
    >> My setup was not optimal, I had the journal in the xfs filesystem, so I
    >> decided to change it also. First, I did it slowly, disk by disk, but as
    >> rebalance take much time and my group was pushing me to finish quickly,
    >> I did
    >> ceph osd out osd.id
    >> ceph osd crush remove osd.id
    >> ceph auth del osd.id
    >> ceph osd rm id
    >>
    >> Then umount the disks, and using ceph-deploy add then again
    >> ceph-deploy disk zap loki01:/dev/sda
    >> ceph-deploy osd create loki01:/dev/sda
    >>
    >> For every disk in rack "sala1". First, I finished loki02. Then, I did
    >> this steps en loki04, loki01 and loki03 at the same time.
    >>
    >> Thanks,
    >> -- 
    >> José M. Martín
    >>
    >>
    >> El 31/01/17 a las 00:43, Shinobu Kinjo escribió:
    >>> First off, the followings, please.
    >>>
    >>>   * ceph -s
    >>>   * ceph osd tree
    >>>   * ceph pg dump
    >>>
    >>> and
    >>>
    >>>   * what you actually did with exact commands.
    >>>
    >>> Regards,
    >>>
    >>> On Tue, Jan 31, 2017 at 6:10 AM, José M. Martín
    >>> <jmartin@xxxxxxxxxxxxxx> wrote:
    >>>> Dear list,
    >>>>
    >>>> I'm having some big problems with my setup.
    >>>>
    >>>> I was trying to increase the global capacity by changing some osds by
    >>>> bigger ones. I changed them without wait the rebalance process
    >>>> finished,
    >>>> thinking the replicas were saved in other buckets, but I found a
    >>>> lot of
    >>>> PGs incomplete, so replicas of a PG were placed in a same bucket. I
    >>>> have
    >>>> assumed I have lost data because I zapped the disks and used in
    >>>> other tasks.
    >>>>
    >>>> My question is: what should I do to recover as much data as possible?
    >>>> I'm using the filesystem and RBD.
    >>>>
    >>>> Thank you so much,
    >>>>
    >>>> -- 
    >>>>
    >>>> Jose M. Martín
    >>>>
    >>>>
    >>>> _______________________________________________
    >>>> ceph-users mailing list
    >>>> ceph-users@xxxxxxxxxxxxxx
    >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >>
    >>
    >> _______________________________________________
    >> ceph-users mailing list
    >> ceph-users@xxxxxxxxxxxxxx
    >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >
    >
    > _______________________________________________
    > ceph-users mailing list
    > ceph-users@xxxxxxxxxxxxxx
    > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com