Re: Write freeze when writing to rbd image and rebooting one of the nodes

Vasiliy Angapov <angapov@xxxxxxxxx> · Fri, 15 May 2015 13:53:50 +0300

Hi, Robert,

Here is my crush map.

# begin crush map
tunable choose_local_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9

# types
type 0 osd
type 1 host
type 2 zone
type 3 storage_group
type 4 root

# buckets
host  controller_performance_zone_one {
        id -1           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 1.000
}
host  controller_capacity_zone_one {
        id -2           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
        item osd.1 weight 1.000
}
host  compute2_performance_zone_one {
        id -3           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.5 weight 1.000
}
host  compute2_capacity_zone_one {
        id -4           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 1.000
        item osd.4 weight 1.000
}
host  compute3_performance_zone_one {
        id -5           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
}
host  compute3_capacity_zone_one {
        id -6           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.6 weight 1.000
        item osd.7 weight 1.000
}
zone zone_one_performance {
        id -7           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item  controller_performance_zone_one weight 1.000
        item  compute2_performance_zone_one weight 1.000
        item  compute3_performance_zone_one weight 0.100
}
host  compute4_capacity_zone_one {
        id -12          # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item osd.8 weight 1.000
        item osd.9 weight 1.000
}
zone zone_one_capacity {
        id -8           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item  controller_capacity_zone_one weight 2.000
        item  compute2_capacity_zone_one weight 2.000
        item  compute3_capacity_zone_one weight 2.000
        item  compute4_capacity_zone_one weight 2.000
}
storage_group performance {
        id -9           # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item zone_one_performance weight 2.100
}
storage_group capacity {
        id -10          # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item zone_one_capacity weight 8.000
}
root vsm {
        id -11          # do not change unnecessarily
        alg straw
        hash 0  # rjenkins1
        item performance weight 2.100
        item capacity weight 8.000
}

# rules
rule capacity {
        ruleset 0
        type replicated
        min_size 0
        max_size 10
        step take capacity
        step chooseleaf firstn 0 type host
        step emit
}
rule performance {
        ruleset 1
        type replicated
        min_size 0
        max_size 10
        step take performance
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

ID  WEIGHT   TYPE NAME                                           UP/DOWN REWEIGHT PRIMARY-AFFINITY
-11 10.09999 root vsm
 -9  2.09999     storage_group performance
 -7  2.09999         zone zone_one_performance
 -1  1.00000             host controller_performance_zone_one
  2  1.00000                 osd.2                                    up  1.00000          1.00000
 -3  1.00000             host compute2_performance_zone_one
  5  1.00000                 osd.5                                    up  1.00000          1.00000
 -5  0.09999             host compute3_performance_zone_one
-10  8.00000     storage_group capacity
 -8  8.00000         zone zone_one_capacity
 -2  2.00000             host controller_capacity_zone_one
  0  1.00000                 osd.0                                    up  1.00000          1.00000
  1  1.00000                 osd.1                                    up  1.00000          1.00000
 -4  2.00000             host compute2_capacity_zone_one
  3  1.00000                 osd.3                                    up  1.00000          1.00000
  4  1.00000                 osd.4                                    up  1.00000          1.00000
 -6  2.00000             host compute3_capacity_zone_one
  6  1.00000                 osd.6                                    up  1.00000          1.00000
  7  1.00000                 osd.7                                    up  1.00000          1.00000
-12  2.00000             host compute4_capacity_zone_one
  8  1.00000                 osd.8                                    up  1.00000          1.00000
  9  1.00000                 osd.9                                    up  1.00000          1.00000

And here are the stats of the pool i was using for tests:

root@iclcompute4:# ceph osd pool get Gold crush_ruleset
crush_ruleset: 0
root@iclcompute4:# ceph osd pool get Gold size
size: 3
root@iclcompute4:# ceph osd pool get Gold min_size
min_size: 1

IO freeze happens whetger when I add or remove host with 2 OSD. I just did it with standard manual Ceph procedure at http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ to be sure that scripting mistakes are not involved. So freeze lasts until cluster says OK, then resumes.

Regards, Vasily.

On Thu, May 14, 2015 at 6:44 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Can you provide the output of the CRUSH map and a copy of the script that you are using to add the OSDs? Can you also provide the pool size and pool min_size?
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVVMLvCRDmVDuy+mK58QAAHVIQALIZ8aOWE5P8DkRe+8pz
XS+rMdA17nPUd2mX6PIqhjBxetrUhIjQUho8HSIswT9JVkjVSIj+QHs5CI1C
6ArWIPt/U8L78d1hI8NuH/vWwWydYfV32n2L2LExIgUpFAbJA81AnjjDFLvo
T63KLitQ1wz8lyhAWXp4ze15CgAv1u9VbJhazeeWunxZxd8eSGuUS8RTdhLD
sD0pSQnVT4W04TSKYfvbUlpqm68wGY+MApnuQXdpC0jBLcDz0OSu1P+OQC03
0vBCERY1er/rSskJ6TRrQGLzXAc/vc3HbPMvegIhp2voeXgONdO5P/qLfSfD
ZwVUoi6EfFe+na3S4rEjOeBU+v2P00komVEcvjOJDQb3IVcE23iVJOezk3p+
AgJqOz9VLdGvdmZTZnR08PKPZEja80QzrSklRW5f8JyjKlbE8tB5lBoM5mKo
oRcBSDbGSKvXInqygQ3XLdxULHaXbNqNPj+JvPbmfkTU6Iq6pXqcBdUSqG0o
/5Rx16+2Rouz4f8uu5irmDjz0ivKL6QCIzBwZbBTdLIwqhf9vCl1ACDWq4U3
DMorcafZbMArdOqlkVhQJiMioZEQ8U/ThY2bInkNdhii/2A35CToyOfMKyfq
FLAK5lCiM6gRfCkEBPTwkDR6GNAfgY7khz34adsBRlZPB6a3MeucAGtTjyWt
AJIV
=bcYd
-----END PGP SIGNATURE-----

----------------Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, May 14, 2015 at 6:33 AM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote:
Thanks, Robert, for sharing so many experience! I feel like I don't deserve it :)
I have another but very same situation which I don't understand. 
Last time i tried to hard kill OSD daemons. 
This time i add a new node with 2 OSDs to my cluster and also monitor the IO. I wrote a script which adds a node with OSDs fully automatically. And seems like when I start the script - an IO is also blocked until the cluster shows HEALTH_OK which takes quite an amount of time. After Ceph status is OK - copying resumes.

What should I tune this time to avoid long IO interuption?

Thanks in advance again :)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com