On Wed, Jun 21, 2017 at 6:57 AM Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
Hi cephers,
I noticed something I don't understand about ceph's behavior when adding an OSD. When I start with a clean cluster (all PG's active+clean) and add an OSD (via ceph-deploy for example), the crush map gets updated and PGs get reassigned to different OSDs, and the new OSD starts getting filled with data. As the new OSD gets filled, I start seeing PGs in degraded states. Here is an example:
pgmap v52068792: 42496 pgs, 6 pools, 1305 TB data, 390 Mobjects
3164 TB used, 781 TB / 3946 TB avail
8017/994261437 objects degraded (0.001%)
2220581/994261437 objects misplaced (0.223%)
42393 active+clean
91 active+remapped+wait_backfill
9 active+clean+scrubbing+deep
1 active+recovery_wait+degraded
1 active+clean+scrubbing
1 active+remapped+backfilling
Any ideas why there would be any persistent degradation in the cluster while the newly added drive is being filled? It takes perhaps a day or two to fill the drive - and during all this time the cluster seems to be running degraded. As data is written to the cluster, the number of degraded objects increases over time. Once the newly added OSD is filled, the cluster comes back to clean again.
Here is the PG that is degraded in this picture:
7.87c 1 0 2 0 0 4194304 7 7 active+recovery_wait+degraded 2017-06-20 14:12:44.119921 344610'7 583572:2797 [402,521] 402 [402,521] 402 344610'7 2017-06-16 06:04:55.822503 344610'7 2017-06-16 06:04:55.822503
The newly added osd here is 521. Before it got added, this PG had two replicas clean, but one got forgotten somehow?
This sounds a bit concerning at first glance. Can you provide some output of exactly what commands you're invoking, and the "ceph -s" output as it changes in response?
I really don't see how adding a new OSD can result in it "forgetting" about existing valid copies — it's definitely not supposed to — so I wonder if there's a collision in how it's deciding to remove old locations.
Are you running with only two copies of your data? It shouldn't matter but there could also be errors resulting in a behavioral difference between two and three copies.
-Greg
_______________________________________________
Other remapped PGs have 521 in their "up" set but still have the two existing copies in their "acting" set - and no degradation is shown. Examples:
2.f24 14282 0 16 28564 0 51014850801 3102 3102 active+remapped+wait_backfill 2017-06-20 14:12:42.650308 583553'2033479 583573:2033266 [467,521] 467 [467,499] 467 582430'2033337 2017-06-16 09:08:51.055131 582036'2030837 2017-05-31 20:37:54.831178
6.2b7d 10499 0 140 20998 0 37242874687 3673 3673 active+remapped+wait_backfill 2017-06-20 14:12:42.070019 583569'165163 583572:342128 [541,37,521] 541 [541,37,532] 541 582430'161890 2017-06-18 09:42:49.148402 582430'161890 2017-06-18 09:42:49.148402
We are running the latest Jewel patch level everywhere (10.2.7). Any insights would be appreciated.
Andras
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com