Re: OSDs busy reading from Bluestore partition while bringing up nodes.

Subhachandra Chandra <schandra@xxxxxxxxxxxx> · Sat, 12 Jan 2019 10:23:57 -0800

I looked at the daemon status as Paul suggested. Will set "nodown noup" while bringing up the next set of OSDs.
The busy OSDs seem to be doing a lot of ops notifying PGs in "Unknown" state where this OSD was part of the PG in the past. Each of these seems to take about 5-6 minutes before going on to the next unknown PG. Is there a way to defer these ops till all the OSDs come up or reduce the amount of time before giving up.

root@data5:/# ceph daemon osd.86 ops

{
    "ops": [
        {
            "description": "pg_notify((query:213266 sent:213266 2.46ecs8( v 58049'556 (0'0,58049'556] local-lis/les=60851/60862 n=256 ec=1008/1008 lis/c 60828/60181 les/c/f 60833/60182/0 212117/212129/212129) 8->0)=([60181,212128] intervals=([60411,60413] acting 45(3),134(2),149(4),152(6),167(0),189(8),271(7),445(1),471(5)),([60822,60827] acting 45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1),471(5)),([60828,60839] acting 6(5),45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1)),([60851,60865] acting 6(5),45(3),85(2),149(4),152(6),167(0),189(8),271(7),445(1)),([60880,60891] acting 6(5),45(3),85(2),113(6),149(4),167(0),370(8),445(1),538(7)),([60933,61522] acting 6(5),45(3),86(0),113(6),122(2),149(4),370(8),379(7),445(1)),([77024,80134] acting 45(3),86(0),113(6),134(2),149(4),189(8),271(7),471(5)),([181819,186559] acting 6(5),86(0),113(6),134(2),149(4),189(8),271(7))) epoch 213266)",
            "initiated_at": "2019-01-12 18:14:17.922479",
            "age": 301.306727,
            "duration": 301.306757,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "time": "2019-01-12 18:14:17.922479",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-01-12 18:19:13.521073",
                        "event": "started"
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

root@data5:/# ceph daemon osd.86 ops
{
    "ops": [
        {
            "description": "pg_notify((query:213291 sent:213291 2.46ecs8( v 58049'556 (0'0,58049'556] local-lis/les=60851/60862 n=256 ec=1008/1008 lis/c 60828/60181 les/c/f 60833/60182/0 212117/212129/212129) 8->0)=([60181,212128] intervals=([60411,60413] acting 45(3),134(2),149(4),152(6),167(0),189(8),271(7),445(1),471(5)),([60822,60827] acting 45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1),471(5)),([60828,60839] acting 6(5),45(3),149(4),152(6),167(0),189(8),271(7),326(2),445(1)),([60851,60865] acting 6(5),45(3),85(2),149(4),152(6),167(0),189(8),271(7),445(1)),([60880,60891] acting 6(5),45(3),85(2),113(6),149(4),167(0),370(8),445(1),538(7)),([60933,61522] acting 6(5),45(3),86(0),113(6),122(2),149(4),370(8),379(7),445(1)),([77024,80134] acting 45(3),86(0),113(6),134(2),149(4),189(8),271(7),471(5)),([181819,186559] acting 6(5),86(0),113(6),134(2),149(4),189(8),271(7))) epoch 213291)",
            "initiated_at": "2019-01-12 18:19:23.596664",
            "age": 231.202152,
            "duration": 231.202176,
            "type_data": {
                "flag_point": "started",
                "events": [
                    {
                        "time": "2019-01-12 18:19:23.596664",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-01-12 18:23:10.334700",
                        "event": "started"
                    }
                ]
            }
        }
    ],
    "num_ops": 1

}

Thanks
Chandra

On Fri, Jan 11, 2019 at 10:24 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
This seems like a case of accumulating lots of new osd maps.

What might help is also setting the noup and nodown flags and wait for

the OSDs to start up. Use the "status" daemon command to check the

current OSD state even if it can't come up in the cluster map due to

noup (it also somewhere shows if it's behind on osd maps IIRC).

Once they are all running you should be able to take them up again.

This behavior got better with recent Mimic versions -- so I'd also

recommend to upgrade *after* everything is back to healthy.

Paul

-- 

Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH

Freseniusstr. 31h

81247 München

www.croit.io

Tel: +49 89 1896585 90

On Sat, Jan 12, 2019 at 3:56 AM Subhachandra Chandra

<schandra@xxxxxxxxxxxx> wrote:

>

> Hi,

>

>     We have a cluster with 9 hosts and 540 HDDs using Bluestore and containerized OSDs running luminous 12.2.4. While trying to add new nodes, the cluster collapsed as it could not keep up with establishing enough tcp connections. We fixed sysctl to be able to handle more connections and also recycle tw sockets faster. Currently, as we are trying to restart the cluster by bringing up a few OSDs at a time, some of the OSDs get very busy after around 360 of them come up. iostats show that the busy OSDs are constantly reading from the Bluestore partition. The number of busy OSDs per node vary and norecover is set with no active clients. OSD logs don't show anything other than cephx: verify_authorizer errors which happen on both busy and idle OSDs and doesn't seem to be related to drive reads.

>

>   How can we figure out why the OSDs are busy reading from the drives? If it is some kind of recovery, is there a way to track progress? Output of ceph -s and logs from a busy and idle OSD are copied below.

>

> Thanks

> Chandra

>

> Uptime stats with load averages show variance across the9 older nodes.

>

>  02:43:44 up 19:21,  0 users,  load average: 0.88, 1.03, 1.06

>

>  02:43:44 up  7:58,  0 users,  load average: 16.91, 13.49, 12.43

>

>  02:43:44 up 1 day, 14 min,  0 users,  load average: 7.67, 6.70, 6.35

>

>  02:43:45 up  7:01,  0 users,  load average: 84.40, 84.20, 83.73

>

>  02:43:45 up  6:40,  1 user,  load average: 17.08, 17.40, 20.05

>

>  02:43:45 up 19:46,  0 users,  load average: 15.58, 11.93, 11.44

>

>  02:43:45 up 20:39,  0 users,  load average: 7.88, 6.50, 5.69

>

>  02:43:46 up 1 day,  1:20,  0 users,  load average: 5.03, 3.81, 3.49

>

>  02:43:46 up 1 day, 58 min,  0 users,  load average: 0.62, 1.00, 1.38

>

>

> Ceph Config

>

> --------------

>

> [global]

>

> cluster network = 192.168.13.0/24

>

> fsid = <>

>

> mon host = 172.16.13.101,172.16.13.102,172.16.13.103

>

> mon initial members = ctrl1,ctrl2,ctrl3

>

> mon_max_pg_per_osd = 750

>

> mon_osd_backfillfull_ratio = 0.92

>

> mon_osd_down_out_interval = 900

>

> mon_osd_full_ratio = 0.95

>

> mon_osd_nearfull_ratio = 0.85

>

> osd_crush_chooseleaf_type = 3

>

> osd_heartbeat_grace = 900

>

> mon_osd_laggy_max_interval = 900

>

> osd_max_pg_per_osd_hard_ratio = 1.0

>

> public network = 172.16.13.0/24

>

>

> [mon]

>

> mon_compact_on_start = true

>

>

> [osd]

>

> osd_deep_scrub_interval = 2419200

>

> osd_deep_scrub_stride = 4194304

>

> osd_max_backfills = 10

>

> osd_max_object_size = 276824064

>

> osd_max_scrubs = 1

>

> osd_max_write_size = 264

>

> osd_pool_erasure_code_stripe_unit = 2097152

>

> osd_recovery_max_active = 10

>

> osd_heartbeat_interval = 15

>

>

> Data nodes Sysctl params

>

> -----------------------------

>

> fs.aio-max-nr=1048576

>

> kernel.pid_max=4194303

>

> kernel.threads-max=2097152

>

> net.core.netdev_max_backlog=65536

>

> net.core.optmem_max=1048576

>

> net.core.rmem_max=8388608

>

> net.core.rmem_default=8388608

>

> net.core.somaxconn=2048

>

> net.core.wmem_max=8388608

>

> net.core.wmem_default=8388608

>

> vm.max_map_count=524288

>

> vm.min_free_kbytes=262144

>

>

> net.ipv4.tcp_tw_reuse=1

>

> net.ipv4.tcp_max_syn_backlog=16384

>

> net.ipv4.tcp_fin_timeout=10

>

> net.ipv4.tcp_slow_start_after_idle=0

>

>

>

> Ceph -s output

>

> ---------------

>

>

> root@ctrl1:/# ceph -s

>

>   cluster:

>

>     id:     06126476-6deb-4baa-b7ca-50f5ccfacb68

>

>     health: HEALTH_ERR

>

>             noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set

>

>             704 osds down

>

>             9 hosts (540 osds) down

>

>             71 nearfull osd(s)

>

>             2 pool(s) nearfull

>

>             780664/74163111 objects misplaced (1.053%)

>

>             7724/8242239 objects unfound (0.094%)

>

>             396 PGs pending on creation

>

>             Reduced data availability: 32597 pgs inactive, 29764 pgs down, 820 pgs peering, 74 pgs incomplete, 1 pg stale

>

>             Degraded data redundancy: 679158/74163111 objects degraded (0.916%), 1250 pgs degraded, 1106 pgs undersized

>

>             33 slow requests are blocked > 32 sec

>

>             9 stuck requests are blocked > 4096 sec

>

>             mons ctrl1,ctrl2,ctrl3 are using a lot of disk space

>

>

>

>   services:

>

>     mon: 3 daemons, quorum ctrl1,ctrl2,ctrl3

>

>     mgr: ctrl1(active), standbys: ctrl2, ctrl3

>

>     osd: 1080 osds: 376 up, 1080 in; 1963 remapped pgs

>

>          flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

>

>

>

>   data:

>

>     pools:   2 pools, 33280 pgs

>

>     objects: 8049k objects, 2073 TB

>

>     usage:   2277 TB used, 458 TB / 2736 TB avail

>

>     pgs:     3.585% pgs unknown

>

>              94.363% pgs not active

>

>              679158/74163111 objects degraded (0.916%)

>

>              780664/74163111 objects misplaced (1.053%)

>

>              7724/8242239 objects unfound (0.094%)

>

>              29754 down

>

>              1193  unknown

>

>              535   peering

>

>              496   activating+undersized+degraded+remapped

>

>              284   remapped+peering

>

>              258   active+undersized+degraded+remapped

>

>              161   activating+degraded+remapped

>

>              143   active+recovering+undersized+degraded+remapped

>

>              89    active+undersized+degraded

>

>              76    active+clean+remapped

>

>              71    incomplete

>

>              48    active+undersized+remapped

>

>              46    undersized+degraded+peered

>

>              34    active+recovering+degraded+remapped

>

>              26    active+clean

>

>              21    activating+remapped

>

>              13    activating+undersized+degraded

>

>              10    down+remapped

>

>              9     active+recovery_wait+undersized+degraded+remapped

>

>              4     activating

>

>              3     remapped+incomplete

>

>              2     activating+undersized+remapped

>

>              1     stale+peering

>

>              1     undersized+peered

>

>              1     undersized+remapped+peered

>

>              1     activating+degraded

>

>

>

> root@ctrl1:/# ceph version

>

> ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)

>

>

> osd.86 - busy

> -------------------

>

> 2019-01-12 02:30:02.582363 7f00ea0f4700  0 osd.86 213711 No AuthAuthorizeHandler found for protocol 0

>

> 2019-01-12 02:30:02.582383 7f00ea0f4700  0 -- 192.168.13.5:6806/351612 >> 192.168.13.8:6868/770039 conn(0x55c4a8d89000 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

> 2019-01-12 02:30:04.005541 7f00ea8f5700  0 auth: could not find secret_id=7544

>

> 2019-01-12 02:30:04.005554 7f00ea8f5700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7544

>

> 2019-01-12 02:30:04.005557 7f00ea8f5700  0 -- 192.168.13.5:6806/351612 >> 192.168.13.3:6836/405613 conn(0x55c4f3617800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

> 2019-01-12 02:30:07.910816 7f00e98f3700  0 auth: could not find secret_id=7550

>

> 2019-01-12 02:30:07.910864 7f00e98f3700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7550

>

> 2019-01-12 02:30:07.910884 7f00e98f3700  0 -- 192.168.13.5:6806/351612 >> 192.168.13.8:6824/767660 conn(0x55c4ec462800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

> 2019-01-12 02:30:16.982636 7f00ea8f5700  0 osd.86 213711 No AuthAuthorizeHandler found for protocol 0

>

> 2019-01-12 02:30:16.982640 7f00ea8f5700  0 -- 192.168.13.5:6806/351612 >> 192.168.13.6:6816/349322 conn(0x55c4aee6f000 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

>

> OSD.132 - idle

>

> ---------------

>

> 2019-01-12 02:31:38.370478 7fa508450700  0 auth: could not find secret_id=7551

>

> 2019-01-12 02:31:38.370487 7fa508450700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7551

>

> 2019-01-12 02:31:38.370489 7fa508450700  0 -- 192.168.13.5:6854/356774 >> 192.168.13.8:6872/1201589 conn(0x563b3e46f000 :6854 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

> 2019-01-12 02:31:41.121603 7fa509452700  0 auth: could not find secret_id=7544

>

> 2019-01-12 02:31:41.121672 7fa509452700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=7544

>

> 2019-01-12 02:31:41.121707 7fa509452700  0 -- 192.168.13.5:6854/356774 >> 192.168.13.9:6808/515991 conn(0x563b53997800 :6854 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer

>

>

> This email message, including attachments, may contain private, proprietary, or privileged information and is the confidential information and/or property of GRAIL, Inc., and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email message, including attachments, may contain private, proprietary, or privileged information and is the confidential information and/or property of GRAIL, Inc., and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message._______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com