Re: Sanity check on unexpected data movement

Graham Allan <gta@xxxxxxx> · Mon, 29 Apr 2019 16:54:08 -0500



Now that I dig into this, I can see in the exported crush map that the 
choose_args weight_set for this bucket id is zero for the 9th member 
(which I assume corresponds to the evacuated node-98).
rack even01 {
        id -10          # do not change unnecessarily
        id -14 class ssd                # do not change unnecessarily
        id -18 class hdd                # do not change unnecessarily
        # weight 132.502
        alg straw2
        hash 0  # rjenkins1
        item node-08 weight 12.912
        item node-02 weight 25.619
        item node-04 weight 12.912
        item node-06 weight 12.912
        item node-10 weight 12.912
        item node-12 weight 12.912
        item node-14 weight 12.912
        item node-16 weight 12.912
        item node-98 weight 16.500
}
...
# choose_args
choose_args 18446744073709551615 {
  {
...
  {
    bucket_id -18
    weight_set [
      [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 0.000 ]
    ]
  }
...

I assume it wasn't set to zero until recently, as it was holding data... 
I wonder what caused it to change?
Presumably I can edit the crush map manually to correct this at least 
approximately to get things working better.
Changing the value from 0.000 to (guess) 12.000, recompiling the map, 
and testing with
"crushtool --test -i crush.map --show-utilization-all ..." does show 
things being stored again on the affected devices...
Even more mysterious though: I rebooted the node-98 (why not, it was no 
longer hosting any data), and after it returned, I saw that its 
choose_args value had magically changed:
  {
    bucket_id -18
    weight_set [
      [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 16.450 ]
    ]
  }
and data is moving back. I love it when things "fix themselves" without 
apparent cause!
Graham

On 4/29/19 12:12 PM, Graham Allan wrote:
I think I need a second set of eyes to understand some unexpected data 
movement when adding new OSDs to a cluster (Luminous 12.2.11).
Our cluster ran low on space sooner than expected; so as a stopgap I 
recommissioned a couple of older storage nodes while we get new hardware 
purchases under way.
I spent a little time to run drive tests and weed out any weaklings 
before creating new OSDs... because of this one node was ready before 
the other. Each has 30 HDDs/OSDs.
So for the first node I introduced the new OSDs by increasing their 
crush weight gradually to the final value (0.55 in steps of 0.1 - the 
values don't make much sense relative to hdd capacity but that's 
historic). We never had more than ~2% of pgs misplaced at any one time. 
All went well, the new OSDs acquired pgs in the expected proportions and 
the space crunch was mitigated.
Then I started adding the second node - first setting its osds to crush 
weight 0.1. All of a sudden, bam, 14-15% of pgs were misplaced! This 
didn't make any sense to me - what seems to have happened is that ceph 
evacuated almost all data from the previous new node. I just don't 
understand this given the osd crush weights...
What might cause this? The output of "ceph df tree" is below; the first 
new node is "node-98", the second is "node-99". Is there anything 
obvious I could be missing?
One note, we almost certainly need more pgs to improve the data 
distribution, but seems too risky to change that until more space 
available.
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com