Re: Intepreting reason for blocked request

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 16 May 2018 14:34:33 -0700

On Sat, May 12, 2018 at 3:22 PM Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> wrote:
I recently had some requests blocked indefinitely; I eventually cleared it

up by recycling the OSDs, but I'd like some help interpreting the log messages

that supposedly give clue as to what caused the blockage:

(I reformatted for easy email reading)

2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :

  cluster [WRN] 7 slow requests, 2 included below;

  oldest blocked for > 961.596517 secs

2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :

  cluster [WRN] slow request 961.557151 seconds old,

  received at 2018-05-03 01:40:33.689191:

    pg_query(4.f epoch 490) currently wait for new map

2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :

  cluster [WRN] slow request 961.556655 seconds old,

  received at 2018-05-03 01:40:33.689686:

    pg_query(1.d epoch 490) currently wait for new map

2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :

  cluster [WRN] 2 slow requests, 2 included below;

  oldest blocked for > 960.677480 secs

2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :

  cluster [WRN] slow request 960.677480 seconds old,

  received at 2018-05-03 01:40:31.238642:

    osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893

      RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :

  cluster [WRN] slow request 960.663817 seconds old,

  received at 2018-05-03 01:40:31.252305:

    osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad

      RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

"wait for new map": what map would that be, and where is the OSD expecting it

to come from?

The OSD is waiting for a new OSD map, which it will get from one of its peers or the monitor (by request). This tends to happen if the client sees a newer version than the OSD does.

"reached_pg"?

The request has been delivered into a queue for the PG to process, but it hasn't been picked up and worked on yet. There's nothing about the request that is blocking it here, but some other kind of back pressure is going on — either the PG is working or waiting on another request that prevents it from picking up new ones, or there's some kind of throttler preventing it from picking up new work, or there's no CPU time available for some reason.

With what you've shown here, it looks like either your cluster is dramatically overloaded, or else something is going on with the mds tables that is killing the OSD whenever it tries to access them. I think there were some past issues with them if they grew too large in older releases?
-Greg

You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored

pair.

Thanks.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com