Just a quick guess - is it possible you ran out of file descriptors/connections on the nodes or on a firewall on the way? I’ve seen this behaviour the other way around - when too many RBD devices were connected to one client. It would explain why it seems to work but hangs when the device is used.
Jan
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
I would make sure that your CRUSH rules are designed for such a failure. We currently have two racks and can suffer a one rack loss without blocking I/O. Here is what we do:
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host
step emit
}
All pools are size=4 and min_size=2
This puts only two copies in each rack so that only half of the objects can be taken down by a rack loss. We also configure ceph with "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a whole rack out (not that it would do a whole lot in our current 2 rack config).
Our network failure (dual Ethernet switches) is two racks, so our next failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration is similar to the above with the choose changed to pud. Once we get to our 5th rack of storage, our config changes to:
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type pud
step emit
}
All pools are size=3 and min_size=2
In this configuration, only one copy is kept per PUD and we can lose two racks in a PUD without blocking I/O in our cluster.
Under the default CRUSH rules, it is possible to get two objects in one rack. What does `ceph osd crush rule dump` show?
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com
wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3
gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb
bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP
TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz
gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9
m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo
EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV
SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf
5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3
gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR
J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f
MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj
unZB
=fE8n
-----END PGP SIGNATURE-----
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxxhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com