OKay, now I changed the crush rule also on a pool with the real data, and it seems all the client i/o on that pool has stopped. The recovery continues, but things like qemu I/O, "rbd ls", and so on are just stuck doing nothing. Can I unstuck it somehow (faster than waiting for all the recovery to finish)? Thanks. # ceph -s cluster: id: ... my-uuid ... health: HEALTH_ERR 3308311/3803892 objects misplaced (86.972%) Reduced data availability: 1721 pgs inactive Degraded data redundancy: 85361/3803892 objects degraded (2.244%), 1 39 pgs degraded, 139 pgs undersized Degraded data redundancy (low space): 25 pgs backfill_toofull services: mon: 3 daemons, quorum mon1,mon2,mon3 mgr: mon2(active), standbys: mon1, mon3 osd: 80 osds: 80 up, 80 in; 1868 remapped pgs rgw: 1 daemon active data: pools: 13 pools, 5056 pgs objects: 1.27 M objects, 4.8 TiB usage: 15 TiB used, 208 TiB / 224 TiB avail pgs: 34.039% pgs not active 85361/3803892 objects degraded (2.244%) 3308311/3803892 objects misplaced (86.972%) 3188 active+clean 1582 activating+remapped 139 activating+undersized+degraded+remapped 93 active+remapped+backfill_wait 29 active+remapped+backfilling 25 active+remapped+backfill_wait+backfill_toofull io: recovery: 174 MiB/s, 43 objects/s -Yenya Jan Kasprzak wrote: : : ----- Original Message ----- : : From: "Caspar Smit" <casparsmit@xxxxxxxxxxx> : : To: "Jan Kasprzak" <kas@xxxxxxxxxx> : : Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> : : Sent: Thursday, 31 January, 2019 15:43:07 : : Subject: Re: backfill_toofull after adding new OSDs : : : : Hi Jan, : : : : You might be hitting the same issue as Wido here: : : : : [ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ] : : : : Kind regards, : : Caspar : : : : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:kas@xxxxxxxxxx | kas@xxxxxxxxxx ] >: : : : : : : Hello, ceph users, : : : : I see the following HEALTH_ERR during cluster rebalance: : : : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : Detailed description: : : I have upgraded my cluster to mimic and added 16 new bluestore OSDs : : on 4 hosts. The hosts are in a separate region in my crush map, and crush : : rules prevented data to be moved on the new OSDs. Now I want to move : : all data to the new OSDs (and possibly decomission the old filestore OSDs). : : I have created the following rule: : : : : # ceph osd crush rule create-replicated on-newhosts newhostsroot host : : : : after this, I am slowly moving the pools one-by-one to this new rule: : : : : # ceph osd pool set test-hdd-pool crush_rule on-newhosts : : : : When I do this, I get the above error. This is misleading, because : : ceph osd df does not suggest the OSDs are getting full (the most full : : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR : : disappears. Why am I getting this error? : : : : # ceph -s : : cluster: : : id: ...my UUID... : : health: HEALTH_ERR : : 1271/3803223 objects misplaced (0.033%) : : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : services: : : mon: 3 daemons, quorum mon1,mon2,mon3 : : mgr: mon2(active), standbys: mon1, mon3 : : osd: 80 osds: 80 up, 80 in; 90 remapped pgs : : rgw: 1 daemon active : : : : data: : : pools: 13 pools, 5056 pgs : : objects: 1.27 M objects, 4.8 TiB : : usage: 15 TiB used, 208 TiB / 224 TiB avail : : pgs: 40124/3803223 objects degraded (1.055%) : : 1271/3803223 objects misplaced (0.033%) : : 4963 active+clean : : 41 active+recovery_wait+undersized+degraded+remapped : : 21 active+recovery_wait+undersized+degraded : : 17 active+remapped+backfill_wait : : 5 active+remapped+backfill_wait+backfill_toofull : : 3 active+remapped+backfill_toofull : : 2 active+recovering+undersized+remapped : : 2 active+recovering+undersized+degraded+remapped : : 1 active+clean+remapped : : 1 active+recovering+undersized+degraded : : : : io: : : client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr : : recovery: 2.0 MiB/s, 92 objects/s : : : : Thanks for any hint, : : : : -Yenya : : : : -- : : | Jan "Yenya" Kasprzak <kas at { [ http://fi.muni.cz/ | fi.muni.cz ] - work | [ http://yenya.net/ | yenya.net ] - private}> | : : | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 4096R/A45477D5 | : : This is the world we live in: the way to deal with computers is to google : : the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : : _______________________________________________ : : ceph-users mailing list : : [ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] : : [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] : : : : _______________________________________________ : : ceph-users mailing list : : ceph-users@xxxxxxxxxxxxxx : : http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com : : -- : | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | : | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | : This is the world we live in: the way to deal with computers is to google : the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : _______________________________________________ : ceph-users mailing list : ceph-users@xxxxxxxxxxxxxx : http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com