Re: definitions for /proc/fs/xfs/stat

Mark Seger <mjseger@xxxxxxxxx> · Sat, 15 Jun 2013 12:22:35 -0400

I was thinking a little color commentary might be helpful from a perspective of what the functionally is that's driving the need for fallocate.  I think I mentioned somewhere in this thread that the application is OpenStack Swift, which is  a highly scalable cloud object store.  If you're not familiar with it, it doesn't do successive sequential writes to a preallocated file but rather writes out a full object in one shot.  In other words, object = file.  The whole purpose of preallocation, at least my understanding of it, is to make sure there is enough room when the time comes to write the actual object so if there isn't, a redundant server elsewhere can do it instead.  This then makes the notion of speculative preallocation for future sequential writes moot, the ideal being to only preallocate the object size with minimal extra I/O.  Does that help?
-mark

On Sat, Jun 15, 2013 at 6:35 AM, Mark Seger <mjseger@xxxxxxxxx> wrote:

Basically everything do it with collectl, a tool I wrote and opensourced almost 10 years ago.  it's numbers are very accurate - I've compared with iostat on numerous occasions whenever I might have had doubts and they always agree.  Since both tools get their data from the same place, /proc/diskstats, it's hard for them not to agree AND its numbers also agree with /proc/fs/xfs.

Here's an example of comparing the two on a short run, leaving off the -m since collectl reports its output in KB.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

sdc               0.00     0.00    0.00  494.00     0.00 126464.00   512.00     0.11    0.22    0.00    0.22   0.22  11.00

#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util

10:18:32 sdc1             0      0    0    0  127488      0  498  256     256     1     0      0    7
10:18:33 sdc1             0      0    0    0  118784      0  464  256     256     1     0      0    4

for grins I also ran a set of numbers at a monitoring interval of 0.2 seconds just to see if they were steady and they are:

#                       <---------reads---------><---------writes---------><--------averages--------> Pct

#Time         Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
10:19:50.601 sdc1             0      0    0    0     768      0    3  256     256     0     0      0    0

10:19:50.801 sdc1             0      0    0    0   23296      0   91  256     256     1     0      0   19
10:19:51.001 sdc1             0      0    0    0   32256      0  126  256     256     1     0      0   14

10:19:51.201 sdc1             0      0    0    0   29696      0  116  256     256     1     0      0   19
10:19:51.401 sdc1             0      0    0    0   30464      0  119  256     256     1     0      0    4

10:19:51.601 sdc1             0      0    0    0   32768      0  128  256     256     1     0      0   14

but back to the problem at hand and that's the question why is this happening?

To restate what's going on, I have a very simple script that I'm duplicating what openstack swift is doing, namely to create a file with mkstmp and than running an falloc against it.  The files are being created with a size of zero but it seems that xfs is generating a ton of logging activity.  I had read your posted back in 2011 about speculative preallocation and can't help but wonder if that's what hitting me here.  I also saw where system memory can come into play and this box has 192GB and 12 hyperthreaded cores.

I also tried one more run without falloc, this is creating 10000 1K files, which should be about 10MB and it looks like it's still doing 140MB of I/O which still feels like a lot but at least it's less than the 

#                   <---------reads---------><---------writes---------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util

10:29:20 sdc1             0      0    0    0   89608      0  351  255     255     1     0      0   11
10:29:21 sdc1             0      0    0    0   55296      0  216  256     256     1     0      0    5

and to repeat the full run with falloc:

# DISK STATISTICS (/sec)
#                   <---------reads---------><---------writes---------><--------averages--------> Pct

#Time     Name       KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  Wait SvcTim Util
10:30:50 sdc1             0      0    0    0   56064      0  219  256     256     1     0      0    2

10:30:51 sdc1             0      0    0    0  409720    148 1622  253     252     1     0      0   26
10:30:52 sdc1             0      0    0    0  453240    144 1796  252     252     1     0      0   36

10:30:53 sdc1             0      0    0    0  441768    298 1800  245     245     1     0      0   37
10:30:54 sdc1             0      0    0    0  455576    144 1813  251     251     1     0      0   25

10:30:55 sdc1             0      0    0    0  453532    145 1805  251     251     1     0      0   35
10:30:56 sdc1             0      0    0    0  307352    145 1233  249     249     1     0      0   17

10:30:57 sdc1             0      0    0    0       0      0    0    0       0     0     0      0    0

If there is anything more I can provide I'll be happy to do so.  Actually I should point out I can easily generate graphs and if you'd like to see some examples I can provide those too.  Also, if there is anything I can report from /proc/fs/xfs I can relatively easily do that as well and display it side by side with the disk I/O.

-mark

On Fri, Jun 14, 2013 at 10:04 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:

On Fri, Jun 14, 2013 at 09:55:17PM -0400, Mark Seger wrote:

> I'm doing 1 second samples and the rates are very steady.  The reason I

> ended up at this level of testing was I had done a sustained test for 2

> minutes at about 5MB/sec and was seeing over 500MB/sec going to the disk,

> again sampling at 1-second intervals.  I'd be happy to provide detailed

> output and can even sample more frequently if you like.

Where are you getting your IO throughput numbers from?

How do they compare to, say, the output of `iostat -d -x -m 1`?

Cheers,

Dave.

--

Dave Chinner

david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs