Re: Sanity check on unexpected data movement

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Now that I dig into this, I can see in the exported crush map that the choose_args weight_set for this bucket id is zero for the 9th member (which I assume corresponds to the evacuated node-98).

rack even01 {
        id -10          # do not change unnecessarily
        id -14 class ssd                # do not change unnecessarily
        id -18 class hdd                # do not change unnecessarily
        # weight 132.502
        alg straw2
        hash 0  # rjenkins1
        item node-08 weight 12.912
        item node-02 weight 25.619
        item node-04 weight 12.912
        item node-06 weight 12.912
        item node-10 weight 12.912
        item node-12 weight 12.912
        item node-14 weight 12.912
        item node-16 weight 12.912
        item node-98 weight 16.500
}
...
# choose_args
choose_args 18446744073709551615 {
  {
...
  {
    bucket_id -18
    weight_set [
      [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 0.000 ]
    ]
  }
...

I assume it wasn't set to zero until recently, as it was holding data... I wonder what caused it to change?

Presumably I can edit the crush map manually to correct this at least approximately to get things working better.

Changing the value from 0.000 to (guess) 12.000, recompiling the map, and testing with "crushtool --test -i crush.map --show-utilization-all ..." does show things being stored again on the affected devices...

Even more mysterious though: I rebooted the node-98 (why not, it was no longer hosting any data), and after it returned, I saw that its choose_args value had magically changed:

  {
    bucket_id -18
    weight_set [
      [ 9.902 27.027 10.661 10.344 10.558 10.766 10.622 9.728 16.450 ]
    ]
  }

and data is moving back. I love it when things "fix themselves" without apparent cause!

Graham

On 4/29/19 12:12 PM, Graham Allan wrote:
I think I need a second set of eyes to understand some unexpected data movement when adding new OSDs to a cluster (Luminous 12.2.11).

Our cluster ran low on space sooner than expected; so as a stopgap I recommissioned a couple of older storage nodes while we get new hardware purchases under way.

I spent a little time to run drive tests and weed out any weaklings before creating new OSDs... because of this one node was ready before the other. Each has 30 HDDs/OSDs.

So for the first node I introduced the new OSDs by increasing their crush weight gradually to the final value (0.55 in steps of 0.1 - the values don't make much sense relative to hdd capacity but that's historic). We never had more than ~2% of pgs misplaced at any one time. All went well, the new OSDs acquired pgs in the expected proportions and the space crunch was mitigated.

Then I started adding the second node - first setting its osds to crush weight 0.1. All of a sudden, bam, 14-15% of pgs were misplaced! This didn't make any sense to me - what seems to have happened is that ceph evacuated almost all data from the previous new node. I just don't understand this given the osd crush weights...

What might cause this? The output of "ceph df tree" is below; the first new node is "node-98", the second is "node-99". Is there anything obvious I could be missing?

One note, we almost certainly need more pgs to improve the data distribution, but seems too risky to change that until more space available.

--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux