Re: Measuring IOPS

Martin Steigerwald <Martin@xxxxxxxxxxxx> · Thu, 4 Aug 2011 11:34:32 +0200

Am Donnerstag, 4. August 2011 schrieb Jens Axboe:
> On 2011-08-04 10:51, Martin Steigerwald wrote:
> > Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> >> Am Mittwoch, 3. August 2011 schrieb Martin Steigerwald:
> >>> Am Mittwoch, 3. August 2011 schrieben Sie:
> >>>> Martin Steigerwald <Martin@xxxxxxxxxxxx> writes:
> >> [...]
> >> 
> >>> Does using iodepth > 1 need ioengine=libaio? Let´s see the manpage:
> >>>        iodepth=int
> >>>        
> >>>               Number  of I/O units to keep in flight against the
> >>>               file. Note that increasing iodepth beyond  1  will
> >>>               not affect synchronous ioengines (except for small
> >>>               degress when verify_async is in use).  Even  async
> >>>               engines  my  impose  OS  restrictions  causing the
> >>>               desired depth not to be achieved.  This may happen
> >>>               on   Linux  when  using  libaio  and  not  setting
> >>>               direct=1, since buffered IO is not async  on  that
> >>>               OS.  Keep  an  eye on the IO depth distribution in
> >>>               the fio output to verify that the  achieved  depth
> >>>               is as expected. Default: 1.
> >>> 
> >>> Okay, yes, it does. I start getting a hang on it. Its a bit
> >>> puzzling to have two concepts of synchronous I/O around:
> >>> 
> >>> 1) synchronous system call interfaces aka fio I/O engine
> >>> 
> >>> 2) synchronous I/O requests aka O_SYNC
> >> 
> >> But isn´t this a case for iodepth=1 if buffered I/O on Linux is
> >> synchronous? I bet most regular applications except some databases
> >> use buffered I/O.
> > 
> > Thanks a lot for your answers, Jens, Jeff, DongJin.
> > 
> > Now what about the above one?
> > 
> > In what cases is iodepth > 1 relevant, when Linux buffered I/O is
> > synchronous? For mutiple threads or processes?
> 
> iodepth controls what depth fio operates at, not the OS. You are right
> in that with iodepth=1, for buffered writes you could be seeing a much
> higher depth on the device side.
> 
> So think of iodepth as how many IO units fio can have in flight,
> nothing else.

Ah okay. So when using iodepth=64 and ioengine=libaio with fio then fio 
issues 64 I/O requests at once before it bothers waiting for I/O requests 
to complete. And as the block layer completes I/O requests fio fills up the 
64 I/O requests queue. Right?

Now when I do have two jobs running at once and iodepth=64, will each 
process submit 64 I/O requests before waiting thus having at most 128 I/O 
requests in flight? Or will each process use 32 I/O requests? My bet is 
that iodepth is per job, per process.

> > One process / thread can only submit one I/O at a time with
> > synchronous system call I/O, but the function returns when the stuff
> > is in the page cache. So first why can´t Linux use iodepth > 1 when
> > there is lots of stuff in the page cache to be written out? That
> > should help the single process case.
> 
> Since the IO unit is done when the system call returns, you can never
> have more than the one in flight for a sync engine. So iodepth > 1
> makes no sense for a sync engine.

Makes perfect sense then I understand that iodepth option related to what 
the fio processes do.

> > On the mutiple process/threadsa case Linux gets several I/O requests
> > from mutiple processes/threads and thus iodepth > 1 does make sense?
> 
> No.

Since each synchronous system call I/O fio job still submits one I/O at a 
time...

> > Maybe it helps getting clear where in the stack iodepth is located
> > at, is it
> > 
> > process / thread
> > systemcall
> > pagecache
> > blocklayer
> > iodepth
> > device driver
> > device
> > 
> > ? If so, why can´t Linux  not make use of iodepth > 1 with
> > synchronous system call I/O? Or is it further up on the system call
> > level? But then
> 
> Because it is sync. The very nature of the sync system calls is that
> submission and completion are one event. For libaio, you could submit a
> bunch of requests before retrieving or waiting for completion of any
> one of them.
> 
> The only example where a sync engine could drive a higher queue depth
> on the device side is buffered writes. For any other case (reads,
> direct writes), you need async submission to build up a higher queue
> depth.

Great! I think that makes it pretty clear.

Thus when I want to read subsequent blocks 1, 2, 3, 4, 5, 6, 7, 8, 9 and 
10 from a file at once and then wait I need async I/O.  Block might be of 
arbitrary size.

What when I use 10 processes, each reading one of these blocks as once? 
Couldn´t this fill up the queue at the device level? But then different 
processes usually read different files...

... my question hints at how I/O depths might accumulate at the device 
level, when several processes are issuing read and/or write requests at 
once.

> > what sense would it make there, when using system calls that are
> > asynchronous already?
> > (Is that ordering above correct at all?)
> 
> Your ordering looks OK. Now consider where and how you end up waiting
> for issued IO, that should tell you where queue depth could build up or
> not.

So we have several levels of queue depth.

- queue depth at the system call level 
- queue depth at device level

=== sync I/O engines ===
queue depth at the system call level = 1

== reads ==
queue depth at the device level = 1
since read() returns when the data is in RAM and thus is synchronous I/O 
on the lower level by nature

page cache will be used unless direct=1, so one might be measuring RAM / 
read ahead performance, especially when several read jobs are running 
concurrently. 

writes might not hit the device unless direct=1 and thus one should use 
larger than RAM file size.

== writes ==
queue depth at the device level = depending on the workload upto what the 
device supports

unless direct=1, cause then write() is doing synchronous I/O on the lower 
level and only returns when data is at least in drive cache

=== libaio ===
queue depth at the system call level = iodepth option of fio

as long as direct=1, since libaio falls back to synchronous system calls 
with buffered writes

queue depth at the device level = same

fio submits as much I/Os as specified by iodepth and only then waits. As the 
block layer completes I/Os fio fills up the queue.

conclusion:

thus when I want to measure higher I/O depths at read I need libaio and 
direct=1. but then I am measuring something that does not have any 
practical effect on processes that use synchronous system call I/O.

so for regular applications ioengine=sync + iodepth=64 gives more 
realistic results - even when its then just I/O depth 1 for reads - and 
for databases that use direct I/O ioengine=libaio makes sense and will 
cause higher I/O depths on the device side if it supports it.

anything without direct=1 (or the slower sync=1) is potentially measuring 
RAM performance. direct=1 omits the page cache. sync=1 basically disables 
caching on the device / controller side as well.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html