slow requests, hunting for new mon

Chris Dunlop <chris@xxxxxxxxxxxx> · Tue, 12 Feb 2013 18:28:15 +1100

Hi,

What are likely causes for "slow requests" and "monclient: hunting for new
mon" messages? E.g.:

2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
...
2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.383883 secs
2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.000000000120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 seconds old, received at 2013-02-12 16:27:15.508475: osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.000000000120 [write 987136~4096] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 seconds old, received at 2013-02-12 16:27:15.508537: osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.000000000120 [write 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 seconds old, received at 2013-02-12 16:27:15.508573: osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.000000000122 [write 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 seconds old, received at 2013-02-12 16:27:15.563271: osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.000000000122 [write 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a
single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes
are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4
is linux-3.7.3 and b5 is linux-3.6.10.

On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g.
190 times yesterday. The message does't appear at all on b4 / osd.0.

Both osd logs show the 'slow requests' messages. Generally these come in
waves, with 30-50 of the associated individual 'slow request' messages
coming in just after the initial 'slow requests' message. Each box saw
around 30 waves yesterday, with no obvious time correlation between the two.

The osd disks are generally cruising along at around 400-800 KB/s, with
occasional spikes up to 1.2-2 MB/s, with a mostly write load.

The gigabit network interfaces (2 per box for public vs cluster) are
also cruising, with the busiest interface at less than 5% utilisation.

CPU utilisation is likewise small, with 90% or more idle and less then 3%
wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5.

Cheers,

Chris

----
ceph.conf
----
[global]
        auth supported = cephx
[mon]
[mon.b2]
        host = b2
        mon addr = 10.200.63.130:6789
[mon.b4]
        host = b4
        mon addr = 10.200.63.132:6789
[mon.b5]
        host = b5
        mon addr = 10.200.63.133:6789
[osd]
        osd journal size = 1000
        public network = 10.200.63.0/24
        cluster network = 192.168.254.0/24
[osd.0]
        host = b4
        public addr = 10.200.63.132
        cluster addr = 192.168.254.132
[osd.1]
        host = b5
        public addr = 10.200.63.133
        cluster addr = 192.168.254.133
----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html