Hello ceph community, Last week I was increasing the PGs in a pool used for RBD, in a attempt to reach 1024 PGs (from 128 PGs). The increments were of 32 each time and after creating the new placement groups I trigger re-balance of data using the pgp_num parameter. Every thing was fine until the pool reach the ~400 PGs. Before
414 PGs, the cluster interrupted the client io for 10 seconds
approx., while creating the new 32 PGs, which was fine for the SLA
we try to accomplish. After 414 PGs that interruption was longer,
reaching 40 seconds and some downtime in our virtual machines
which last 1 minute more or less and hundreds of blocked ops in
the ceph log. I would like to understand how the client io interruption took longer when I had more PGs. I've bee unable to figure that out from the documentation and distribution list. Some info of the cluster:
Let me know if you need further information and thanks in advance. Kind regards to you all. -- Fernando Cid O. Ingeniero de Operaciones AltaVoz S.A. http://www.altavoz.net Viña del Mar, Valparaiso: 2 Poniente 355 of 53 +56 32 276 8060 Providencia, Santiago: Antonio Bellet 292 of 701 +56 2 585 4264 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com