Re: Blocked requests activating+remapped afterextendingpg(p)_num

Paul Emmerich <paul.emmerich@xxxxxxxx> · Thu, 17 May 2018 16:45:36 +0200

Check ceph pg query, it will (usually) tell you why something is stuck inactive.

Also: never do min_size 1.

Paul

2018-05-17 15:48 GMT+02:00 Kevin Olbrich <ko@xxxxxxx>:
I was able to obtain another NVMe to get the HDDs in node1004 into the cluster.The number of disks (all 1TB) is now balanced between racks, still some inactive PGs:

  data:
    pools:   2 pools, 1536 pgs
    objects: 639k objects, 2554 GB    usage:   5167 GB used, 14133 GB / 19300 GB avail
    pgs:     1.562% pgs not active
             1183/1309952 objects degraded (0.090%)
             199660/1309952 objects misplaced (15.242%)
             1072 active+clean
             405  active+remapped+backfill_wait
             35   active+remapped+backfilling
             21   activating+remapped
             3    activating+undersized+degraded+remapped

ID  CLASS WEIGHT   TYPE NAME                     STATUS REWEIGHT PRI-AFF  -1       18.85289 root default                                          
-16       18.85289     datacenter dc01                                   
-19       18.85289         pod dc01-agg01                                
-10        8.98700             rack dc01-rack02                          
 -4        4.03899                 host node1001                         
  0   hdd  0.90999                     osd.0         up  1.00000 1.00000 
  1   hdd  0.90999                     osd.1         up  1.00000 1.00000 
  5   hdd  0.90999                     osd.5         up  1.00000 1.00000 
  2   ssd  0.43700                     osd.2         up  1.00000 1.00000 
  3   ssd  0.43700                     osd.3         up  1.00000 1.00000 
  4   ssd  0.43700                     osd.4         up  1.00000 1.00000 
 -7        4.94899                 host node1002                         
  9   hdd  0.90999                     osd.9         up  1.00000 1.00000 
 10   hdd  0.90999                     osd.10        up  1.00000 1.00000 
 11   hdd  0.90999                     osd.11        up  1.00000 1.00000 
 12   hdd  0.90999                     osd.12        up  1.00000 1.00000 
  6   ssd  0.43700                     osd.6         up  1.00000 1.00000 
  7   ssd  0.43700                     osd.7         up  1.00000 1.00000 
  8   ssd  0.43700                     osd.8         up  1.00000 1.00000 -11        9.86589             rack dc01-rack03                          
-22        5.38794                 host node1003                         
 17   hdd  0.90999                     osd.17        up  1.00000 1.00000 
 18   hdd  0.90999                     osd.18        up  1.00000 1.00000 
 24   hdd  0.90999                     osd.24        up  1.00000 1.00000 
 26   hdd  0.90999                     osd.26        up  1.00000 1.00000 
 13   ssd  0.43700                     osd.13        up  1.00000 1.00000 
 14   ssd  0.43700                     osd.14        up  1.00000 1.00000 
 15   ssd  0.43700                     osd.15        up  1.00000 1.00000 
 16   ssd  0.43700                     osd.16        up  1.00000 1.00000 -25        4.47795                 host node1004                         
 23   hdd  0.90999                     osd.23        up  1.00000 1.00000 
 25   hdd  0.90999                     osd.25        up  1.00000 1.00000 
 27   hdd  0.90999                     osd.27        up  1.00000 1.00000 
 19   ssd  0.43700                     osd.19        up  1.00000 1.00000 
 20   ssd  0.43700                     osd.20        up  1.00000 1.00000 
 21   ssd  0.43700                     osd.21        up  1.00000 1.00000 
 22   ssd  0.43700                     osd.22        up  1.00000 1.00000

Pools are size 2, min_size 1 during setup.

The count of PGs in activate state are related to the weight of OSDs but why are they failing to proceed to active+clean or active+remapped?

Kind regards,
Kevin

2018-05-17 14:05 GMT+02:00 Kevin Olbrich <ko@xxxxxxx>:
Ok, I just waited some time but I still got some "activating" issues:

  data:
    pools:   2 pools, 1536 pgs
    objects: 639k objects, 2554 GB    usage:   5194 GB used, 11312 GB / 16506 GB avail
    pgs:     7.943% pgs not active
             5567/1309948 objects degraded (0.425%)             195386/1309948 objects misplaced (14.916%)
             1147 active+clean
             235  active+remapped+backfill_wait
             107  activating+remapped
             32   active+remapped+backfilling
             15   activating+undersized+degraded+remapped

I set these settings during runtime:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800'
ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'

Sure, mon_max_pg_per_osd is oversized but this is just temporary. Calculated PGs per OSD is 200.

I searched the net and the bugtracker but most posts suggest osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got more stuck PGs.

Any more hints?

Kind regards.
Kevin
2018-05-17 13:37 GMT+02:00 Kevin Olbrich <ko@xxxxxxx>:
PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by default, will place 200 PGs on each OSD.
I read about the protection in the docs and later noticed that I better had only placed 100 PGs.

2018-05-17 13:35 GMT+02:00 Kevin Olbrich <ko@xxxxxxx>:
Hi!

Thanks for your quick reply.
Before I read your mail, i applied the following conf to my OSDs:
ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'

Status is now:
  data:
    pools:   2 pools, 1536 pgs    objects: 639k objects, 2554 GB
    usage:   5211 GB used, 11295 GB / 16506 GB avail
    pgs:     7.943% pgs not active
             5567/1309948 objects degraded (0.425%)
             252327/1309948 objects misplaced (19.262%)
             1030 active+clean
             351  active+remapped+backfill_wait
             107  activating+remapped
             33   active+remapped+backfilling
             15   activating+undersized+degraded+remapped

A little bit better but still some non-active PGs.
I will investigate your other hints!

Thanks
Kevin

2018-05-17 13:30 GMT+02:00 Burkhard Linke <Burkhard.Linke@computational.bio.uni-giessen.de>:
Hi,

On 05/17/2018 01:09 PM, Kevin Olbrich wrote:

Hi!

Today I added some new OSDs (nearly doubled) to my luminous cluster.

I then changed pg(p)_num from 256 to 1024 for that pool because it was

complaining about to few PGs. (I noticed that should better have been small

changes).

This is the current status:

     health: HEALTH_ERR

             336568/1307562 objects misplaced (25.740%)

             Reduced data availability: 128 pgs inactive, 3 pgs peering, 1

pg stale

             Degraded data redundancy: 6985/1307562 objects degraded

(0.534%), 19 pgs degraded, 19 pgs undersized

             107 slow requests are blocked > 32 sec

             218 stuck requests are blocked > 4096 sec

   data:

     pools:   2 pools, 1536 pgs

     objects: 638k objects, 2549 GB

     usage:   5210 GB used, 11295 GB / 16506 GB avail

     pgs:     0.195% pgs unknown

              8.138% pgs not active

              6985/1307562 objects degraded (0.534%)

              336568/1307562 objects misplaced (25.740%)

              855 active+clean

              517 active+remapped+backfill_wait

              107 activating+remapped

              31  active+remapped+backfilling

              15  activating+undersized+degraded+remapped

              4   active+undersized+degraded+remapped+backfilling

              3   unknown

              3   peering

              1   stale+active+clean

You need to resolve the unknown/peering/activating pgs first. You have 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 OSDs and the heterogenous host sizes, I assume that some OSDs hold more than 200 PGs. There's a threshold for the number of PGs; reaching this threshold keeps the OSDs from accepting new PGs.

Try to increase the threshold  (mon_max_pg_per_osd / max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about the exact one, consult the documentation) to allow more PGs on the OSDs. If this is the cause of the problem, the peering and activating states should be resolved within a short time.

You can also check the number of PGs per OSD with 'ceph osd df'; the last column is the current number of PGs.

OSD tree:

ID  CLASS WEIGHT   TYPE NAME                     STATUS REWEIGHT PRI-AFF

  -1       16.12177 root default

-16       16.12177     datacenter dc01

-19       16.12177         pod dc01-agg01

-10        8.98700             rack dc01-rack02

  -4        4.03899                 host node1001

   0   hdd  0.90999                     osd.0         up  1.00000 1.00000

   1   hdd  0.90999                     osd.1         up  1.00000 1.00000

   5   hdd  0.90999                     osd.5         up  1.00000 1.00000

   2   ssd  0.43700                     osd.2         up  1.00000 1.00000

   3   ssd  0.43700                     osd.3         up  1.00000 1.00000

   4   ssd  0.43700                     osd.4         up  1.00000 1.00000

  -7        4.94899                 host node1002

   9   hdd  0.90999                     osd.9         up  1.00000 1.00000

  10   hdd  0.90999                     osd.10        up  1.00000 1.00000

  11   hdd  0.90999                     osd.11        up  1.00000 1.00000

  12   hdd  0.90999                     osd.12        up  1.00000 1.00000

   6   ssd  0.43700                     osd.6         up  1.00000 1.00000

   7   ssd  0.43700                     osd.7         up  1.00000 1.00000

   8   ssd  0.43700                     osd.8         up  1.00000 1.00000

-11        7.13477             rack dc01-rack03

-22        5.38678                 host node1003

  17   hdd  0.90970                     osd.17        up  1.00000 1.00000

  18   hdd  0.90970                     osd.18        up  1.00000 1.00000

  24   hdd  0.90970                     osd.24        up  1.00000 1.00000

  26   hdd  0.90970                     osd.26        up  1.00000 1.00000

  13   ssd  0.43700                     osd.13        up  1.00000 1.00000

  14   ssd  0.43700                     osd.14        up  1.00000 1.00000

  15   ssd  0.43700                     osd.15        up  1.00000 1.00000

  16   ssd  0.43700                     osd.16        up  1.00000 1.00000

-25        1.74799                 host node1004

  19   ssd  0.43700                     osd.19        up  1.00000 1.00000

  20   ssd  0.43700                     osd.20        up  1.00000 1.00000

  21   ssd  0.43700                     osd.21        up  1.00000 1.00000

  22   ssd  0.43700                     osd.22        up  1.00000 1.00000

Crush rule is set to chooseleaf rack and (temporary!) to size 2.

Why are PGs stuck in peering and activating?

"ceph df" shows that only 1,5TB are used on the pool, residing on the hdd's

- which would perfectly fit the crush rule....(?)

Size 2 within the crush rule or size 2 for the two pools?

Regards,

Burkhard

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com