Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

"Jim Schutt" <jaschut@xxxxxxxxxx> · Mon, 13 Feb 2012 08:26:03 -0700

On 02/10/2012 05:05 PM, sridhar basam wrote:
>  But the server never ACKed that packet.  Too busy?
>
>  I was collecting vmstat data during the run; here's the important bits:
>
>  Fri Feb 10 11:56:51 MST 2012
>  vmstat -w 8 16
>  procs -------------------memory------------------ ---swap-- -----io----
>  --system-- -----cpu-------
>    r  b       swpd       free       buff      cache   si   so    bi    bo   in
>     cs  us sy  id wa st
>  13 10          0     250272        944   37859080    0    0     7  5346 1098
>    444   2  5  92  1  0
>  88  8          0     260472        944   36728776    0    0     0 1329838
>  257602 68861  19 73   5  4  0
>  100 10          0     241952        944   36066536 0 0     0 1635891 340724
>  85570  22 68   6  4  0
>  105  9          0     250288        944   34750820 0 0     0 1584816 433223
>  111462  21 73   4  3  0
>  126  3          0     259908        944   33841696    0    0     0 749648
>  225707 86716   9 83   4  3  0
>  157  2          0     245032        944   31572536 0 0     0 736841 252406
>  99083   9 81   5  5  0
>  45 17          0     246720        944   28877640    0    0     1 755085
>  282177 116551   8 77   9  5  0
Holy crap! That might explain why you aren't seeing anything. You are
writing out over a 1.6 million blocks/sec. That too averaged over a 8
second interval. I bet the missed acks are when this is happening.
What sort of I/O load is going through this system during those times?
What sort of filesystem and Linux system are these OSDs on?

Dual socket Nehalem EP @ 3 GHz, 24 ea. 7200RPM SAS drives w/ 64 MB cache,
3 LSI SAS HBAs w/8 drives per HBA, btrfs, 3.2.0 kernel.  Each OSD
has a ceph journal and a ceph data store on a single drive.

I'm running 24 OSDs on such a box; all that write load is the result
of dd from 166 linux ceph clients.

FWIW, I've seen these boxes sustain > 2 GB/s for 60 sec or so under
this load, when I have TSO/GSO/GRO turned on, and am writing to
a freshly created ceph filesystem.

That lasts until my OSDs get stalled reading from a socket, as
documented by those packet traces I posted.

If you compare the timestamps on the retransmits to the times
that vmstat is dumping reports, at least some of the retransmits
hit the system when it is ~80% idle.

-- Jim

  Sridhar

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html