Long interruption when increasing placement groups

fcid <fcid@xxxxxxxxxxx> · Tue, 3 Jul 2018 18:19:20 -0400



    Hello ceph community,
    Last week I was increasing the PGs in a pool used for RBD, in a
      attempt to reach 1024 PGs (from 128 PGs). The increments were of
      32 each time and after creating the new placement groups I trigger
      re-balance of data using the pgp_num parameter.
    Every thing was fine until the pool reach the ~400 PGs. Before
      414 PGs, the cluster interrupted the client io for 10 seconds
      approx., while creating the new 32 PGs, which was fine for the SLA
      we try to accomplish. After 414 PGs that interruption was longer,
      reaching 40 seconds and some downtime in our virtual machines
      which last 1 minute more or less and hundreds of blocked ops in
      the ceph log.

    
    I would like to understand how the client io interruption took
      longer when I had more PGs. I've bee unable to figure that out
      from the documentation and distribution list.
    Some info of the cluster:
    
      n° OSD: 24. This cluster born with 6 OSDs.

      
      3 OSD nodes.

      
      3 monitors.
      version: Jewel 10.2.10
      OSD backend disks: HDD
      OSD journal disks: SSD
    
    Let me know if you need further information and thanks in
      advance.
    Kind regards to you all.

    
    -- 
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.
 http://www.altavoz.net
Viña del Mar, Valparaiso:
 2 Poniente 355 of 53
 +56 32 276 8060
Providencia, Santiago:
 Antonio Bellet 292 of 701
 +56 2 585 4264 
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com