After doing some testing, I'm a bit confused even more.
What I'm trying to achieve is minimal data movement when I have to service a node to replace a failed drive. Since these nodes don't have hot-swap bays, I'll need to power down the box to replace the failed drive. I don't want Ceph to shuffle data until the new drive comes up and is ready.
My thought was to set norecover nobackfill, take down the host, replace the drive, start the host, remove the old OSD from the cluster, ceph-disk prepare the new disk then unset norecover nobackfill.
However in my testing with a 4 node cluster ( v.94.0 10 OSDs each, replication 3, min_size 2, chooselead_fristn host), if I take down a host I/O becomes blocked even though only one copy should be taken down and still satisfies min_size. When I unset norecover, then I/O proceeds and some backfill activity happens. At some point the backfill stops and everything seems to be "happy" in the degraded state.
I'm really interested to know what is going on with "norecover" as the cluster seems to break if it is set. Unsetting the "norecover" flag causes some degraded objects to recover, but not all. Writing to new blocks in an RBD causes the number of degraded objects to increase, but works just fine otherwise. Here is an example after taking down one host and removing the OSDs from the CRUSH map (I'm reformatting all the drives in the host currently).
# ceph status
cluster 146c4fe8-7c85-46dc-b8b3-69072d658287
health HEALTH_WARN
1345 pgs backfill
10 pgs backfilling
2016 pgs degraded
661 pgs recovery_wait
2016 pgs stuck degraded
2016 pgs stuck unclean
1356 pgs stuck undersized
1356 pgs undersized
recovery 40642/167785 objects degraded (24.223%)
recovery 31481/167785 objects misplaced (18.763%)
too many PGs per OSD (665 > max 300)
nobackfill flag(s) set
monmap e5: 3 mons at {nodea=10.8.6.227:6789/0,nodeb=10.8.6.228:6789/0,nodec=10.8.6.229:6789/0}
election epoch 2576, quorum 0,1,2 nodea,nodeb,nodec
osdmap e59031: 30 osds: 30 up, 30 in; 1356 remapped pgs
flags nobackfill
pgmap v4723208: 6656 pgs, 4 pools, 330 GB data, 53235 objects
863 GB used, 55000 GB / 55863 GB avail
40642/167785 objects degraded (24.223%)
31481/167785 objects misplaced (18.763%)
4640 active+clean
1345 active+undersized+degraded+remapped+wait_backfill
660 active+recovery_wait+degraded
10 active+undersized+degraded+remapped+backfilling
1 active+recovery_wait+undersized+degraded+remapped
client io 1864 kB/s rd, 8853 kB/s wr, 65 op/s
What I'm trying to achieve is minimal data movement when I have to service a node to replace a failed drive. Since these nodes don't have hot-swap bays, I'll need to power down the box to replace the failed drive. I don't want Ceph to shuffle data until the new drive comes up and is ready.
My thought was to set norecover nobackfill, take down the host, replace the drive, start the host, remove the old OSD from the cluster, ceph-disk prepare the new disk then unset norecover nobackfill.
However in my testing with a 4 node cluster ( v.94.0 10 OSDs each, replication 3, min_size 2, chooselead_fristn host), if I take down a host I/O becomes blocked even though only one copy should be taken down and still satisfies min_size. When I unset norecover, then I/O proceeds and some backfill activity happens. At some point the backfill stops and everything seems to be "happy" in the degraded state.
I'm really interested to know what is going on with "norecover" as the cluster seems to break if it is set. Unsetting the "norecover" flag causes some degraded objects to recover, but not all. Writing to new blocks in an RBD causes the number of degraded objects to increase, but works just fine otherwise. Here is an example after taking down one host and removing the OSDs from the CRUSH map (I'm reformatting all the drives in the host currently).
# ceph status
cluster 146c4fe8-7c85-46dc-b8b3-69072d658287
health HEALTH_WARN
1345 pgs backfill
10 pgs backfilling
2016 pgs degraded
661 pgs recovery_wait
2016 pgs stuck degraded
2016 pgs stuck unclean
1356 pgs stuck undersized
1356 pgs undersized
recovery 40642/167785 objects degraded (24.223%)
recovery 31481/167785 objects misplaced (18.763%)
too many PGs per OSD (665 > max 300)
nobackfill flag(s) set
monmap e5: 3 mons at {nodea=10.8.6.227:6789/0,nodeb=10.8.6.228:6789/0,nodec=10.8.6.229:6789/0}
election epoch 2576, quorum 0,1,2 nodea,nodeb,nodec
osdmap e59031: 30 osds: 30 up, 30 in; 1356 remapped pgs
flags nobackfill
pgmap v4723208: 6656 pgs, 4 pools, 330 GB data, 53235 objects
863 GB used, 55000 GB / 55863 GB avail
40642/167785 objects degraded (24.223%)
31481/167785 objects misplaced (18.763%)
4640 active+clean
1345 active+undersized+degraded+remapped+wait_backfill
660 active+recovery_wait+degraded
10 active+undersized+degraded+remapped+backfilling
1 active+recovery_wait+undersized+degraded+remapped
client io 1864 kB/s rd, 8853 kB/s wr, 65 op/s
Any help understanding these flags would be very helpful.
Thanks,
Robert
On Mon, Apr 13, 2015 at 1:40 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
I'm looking for documentation about what exactly each of these do and
I can't find it. Can someone point me in the right direction?
The names seem too ambiguous to come to any conclusion about what
exactly they do.
Thanks,
Robert
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com