Re: xfs and swift

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



since getting your last reply I've been doing a lot more trying to understand the behavior of what I'm seeing by writing some non-swift code that sort of does what swift does with respect to a directory structure.  in my case I have 1024 top level dirs, 4096 under each.  each 1k file I'm creating gets it's only directory under these so there are clearly a lot of directories.

xfs writes out about 25M objects and then the performance goes into the toilet.  I'm sure what you said before about having to flush data and causing big delays, but would it be continuous?  each entry in the following table shows the time to write 10K files so the 2 blocks are 1M each

Sat Jan 23 12:15:09 2016
 16.114386  14.656736  14.789760  17.418389  14.613157  15.938176  14.865369  14.962058  17.297193  15.953590
 14.895471  15.560252  14.789937  14.308618  16.390057  16.561789  15.713806  14.843791  15.940992  16.466924
 15.842781  15.611230  17.102329  15.006291  14.454088  17.923662  13.378340  16.084664  15.996794  13.736398
 18.125125  14.462063  18.101833  15.355139  16.603660  14.205896  16.474111  16.212237  15.072443  14.217581
 16.273899  14.905624  17.285019  14.955722  13.769731  18.308619  15.601386  15.832661  14.342416  16.516657
 14.697575  15.719496  16.723135  16.808668  15.443325  14.608358  17.031334  16.426377  13.900535  13.528603
 16.197697  16.839241  14.802707  15.507915  14.864337  15.836943  15.660089  15.998911  13.956739  14.337318
 16.416974  17.729661  14.936045  13.450859  15.943900  15.106077  15.541450  16.523752  16.555945  14.440305
 14.937772  16.486544  13.780310  16.944841  14.867400  18.214934  14.142108  15.931952  14.424949  15.533156
 16.010153  16.323108  14.423508  15.970071  15.277186  15.561362  14.978766  15.855935  16.953906  14.247016
Sat Jan 23 12:41:09 2016
 15.908483  15.638943  17.681281  15.188704  15.721495  13.359225  15.999421  15.858876  16.402176  16.416312
 15.443946  14.675751  15.470643  15.573755  15.422241  16.336590  17.220916  13.974890  15.877780  62.650921
 62.667990  46.334603  53.546195  69.465447  65.006016  68.761229  70.754684  97.571669 104.811261 104.229302
105.605257 105.166030 105.058075 105.519703 106.573306 106.708545 106.114733 105.643131 106.049387 106.379378
104.239131 104.268931 103.852929 103.549319 103.516169 103.007015 103.724020 104.519983 105.839203 105.324985
104.328205 104.932713 103.051548 104.938652 102.769383 102.851609 101.432277 102.269842 100.937972 103.450103
103.477628 103.636130 103.444242 103.023145 102.565047 102.853115 101.402610  98.928230  99.310677  99.669667
101.140554  99.628664 102.093801 100.580659 101.762283 101.369349 102.637014 102.240950 101.778506 101.144526
100.899476 102.294952 102.029285 100.871166 102.763222 102.910690 104.892447 104.748194 105.403636 106.159345
106.413154 104.626632 105.775004 104.579775 104.778526 104.634778 106.233381 104.063642 106.635481 104.314503

if I look at the disk loads at the time, I see a dramatic increase in disk reads that correspond to the slow writes so I'm guessing at least some writes are waiting in the queue as you can see there - thanks to laurence for the patch to show disk read wait times ;)

# DISK STATISTICS (/sec)
#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
12:45:30 sdb              0      0    0    0     0  270040    105 2276  119     4     118    16     4      0   62
12:45:31 sdb              0      0    0    0     0  273776    120 2262  121     4     121    18     4      0   57
12:45:32 sdb              4      0    1    4     0  100164     57  909  110     4     110     6     4      0   84
12:45:33 sdb              0      0    0    0     0  229992     87 1924  120     1     119     2     1      0   68
12:45:34 sdb              4      0    1    4     4  153528     59 1304  118     0     117     1     0      0   78
12:45:35 sdb              0      0    0    0     0  220896     97 1895  117     1     116     1     1      0   62
12:45:36 sdb              0      0    0    0     0  419084    197 3504  120     0     119     1     0      0   32
12:45:37 sdb              0      0    0    0     0  428076    193 3662  117     0     116     1     0      0   32
12:45:38 sdb              0      0    0    0     0  428492    181 3560  120     0     120     1     0      0   30
12:45:39 sdb              0      0    0    0     0  426024    199 3641  117     0     117     1     0      0   32
12:45:40 sdb              0      0    0    0     0  429764    200 3589  120     0     119     1     0      0   28
12:45:41 sdb              0      0    0    0     0  410204    165 3430  120     0     119     3     0      0   36
12:45:42 sdb              0      0    0    0     0  406192    196 3437  118     0     118     5     0      0   39
12:45:43 sdb              0      0    0    0     0  420952    175 3552  119     0     118     1     0      0   34
12:45:44 sdb              0      0    0    0     0  428424    197 3645  118     0     117     1     0      0   31
12:45:45 sdb              0      0    0    0     0  192464     76 1599  120     8     120    18     8      0   75
12:45:46 sdb              0      0    0    0     0  340522    205 2951  115     2     115    16     2      0   41
12:45:47 sdb              0      0    0    0     0  429128    193 3664  117     0     117     1     0      0   28
12:45:48 sdb              0      0    0    0     0  402600    164 3311  122     0     121     3     0      0   39
12:45:49 sdb              0      0    0    0     0  435316    195 3701  118     0     117     1     0      0   36
12:45:50 sdb              0      0    0    0     0  367976    162 3152  117     1     116     7     1      0   46
12:45:51 sdb              0      0    0    0     0  255716    125 2153  119     4     118    16     4      0   60

# DISK STATISTICS (/sec)
#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
12:45:52 sdb              0      0    0    0     0  360144    149 3006  120     1     119     9     1      0   46
12:45:53 sdb              0      0    0    0     0  343500    162 2909  118     1     118    11     1      0   43
12:45:54 sdb              0      0    0    0     0  256636    119 2188  117     2     117    11     2      0   54
12:45:55 sdb              0      0    0    0     0  149000     47 1260  118    14     118    22    14      0   79
12:45:56 sdb              0      0    0    0     0  198544     88 1654  120     7     120    19     7      0   67
12:45:57 sdb              0      0    0    0     0  320688    151 2731  117     1     117     8     1      0   53
12:45:58 sdb              0      0    0    0     0  422176    190 3532  120     0     119     1     0      0   32
12:45:59 sdb              0      0    0    0     0  266540    115 2233  119     5     119    13     5      0   93
12:46:00 sdb              8      0    2    4   690  291116    129 2463  118     3     118     9     3      0   82
12:46:01 sdb              0      0    0    0     0  249964    118 2160  116     4     115    15     4      0   60
12:46:02 sdb           4736      0   37  128     0  424680    167 3522  121     0     120     1     0      0   28
12:46:03 sdb           5016      0   42  119     0  391364    196 3344  117     0     117     6     0      0   34
12:46:04 sdb              0      0    0    0     0  415436    172 3501  119     0     118     2     0      0   33
12:46:05 sdb              0      0    0    0     0  398736    192 3373  118     0     118     3     0      0   39
12:46:06 sdb              0      0    0    0     0  367292    155 3015  122     0     121     6     0      0   39
12:46:07 sdb              0      0    0    0     0  420392    201 3614  116     0     116     1     0      0   30
12:46:08 sdb              0      0    0    0     0  424828    172 3547  120     0     119     1     0      0   32
12:46:09 sdb              0      0    0    0     0  500380    234 4277  117     0     116     2     0      0   34
12:46:10 sdb              0      0    0    0     0  104500      7  698  150     0     149     1     0      1   87
12:46:11 sdb              8      0    1    8  1260   77252     45  647  119     0     119     1     2      1   92
12:46:12 sdb              8      0    1    8  1244   73956     31  615  120     0     120     1     2      1   94
12:46:13 sdb              8      0    1    8   228  149552     64 1256  119     0     118     1     0      0   85

# DISK STATISTICS (/sec)
#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
12:46:14 sdb              8      0    1    8  1232   37124     28  319  116     0     116     1     3      3   99
12:46:15 sdb             16      0    2    8   720    2776     23  120   23     1      22     1    13      8   99
12:46:16 sdb              0      0    0    0     0  108180     16  823  131     0     131     1     0      1   90
12:46:17 sdb              8      0    1    8  1260   37136     28  322  115     0     114     1     3      2   94
12:46:18 sdb              8      0    1    8  1252  108680     57  875  124     0     124     1     1      1   88
12:46:19 sdb              0      0    0    0     0       0      0    0    0     0       0     1     0      0  100
12:46:20 sdb             16      0    2    8   618   81516     49  685  119     0     118     1     1      1   94
12:46:21 sdb             16      0    2    8   640  225788    106 1907  118     0     118     1     0      0   75
12:46:22 sdb             32      0    4    8    95   73892     17  627  118     0     117     1     0      1   93
12:46:23 sdb             24      0    3    8   408  257012    119 2171  118     0     118     1     0      0   65
12:46:24 sdb             12      0    3    4     5    3608      0   20  180     0     157     1     0     43  100
12:46:25 sdb             44      0    7    6   210   74072     41  625  119     0     117     1     2      1   97
12:46:26 sdb             48      0    6    8   216  202852    112 1819  112     0     111     1     0      0   92
12:46:27 sdb             52      0    7    7   233  307156    137 2648  116     0     115     1     0      0   95
12:46:28 sdb             16      0    2    8   100   93168      7  638  146     0     145     1     0      1   97
12:46:29 sdb             16      0    2    8   642   37028     16  319  116     0     115     1     4      3   99
12:46:30 sdb             16      0    2    8   624   39068     36  342  114     0     113     1     3      2   99
12:46:31 sdb             80      0   10    8    94  253892    105 2169  117     0     116     1     0      0   84
12:46:32 sdb              0      0    0    0     0    5676      0   33  172     0     172     1     0     30  100
12:46:33 sdb             16      0    2    8   642   69236     28  583  119     0     118     1     2      1   96
12:46:34 sdb              8      0    1    8  1032   37132     30  315  118     0     117     1     3      3  100
12:46:35 sdb             16      0    2    8   822   56292     15  515  109     0     108     1     3      1  100

# DISK STATISTICS (/sec)
#                   <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size  Wait  RWSize  QLen  Wait SvcTim Util
12:46:36 sdb              8      0    1    8    44   58768     15  452  130     0     129     1     0      2   96
12:46:37 sdb             28      0    4    7   390  114944     89 1100  104     0     104     1     1      0   88
12:46:38 sdb              0      0    0    0     0   29668      0  172  172    12     172     1    12      5   98
12:46:39 sdb             80      0   10    8    90  100084     31  882  113     0     112     1     1      1   91
12:46:40 sdb              0      0    0    0     0   24244      0  139  174     0     174     1     0      7  100
12:46:41 sdb              8      0    1    8  1224       0      0    0    0     0       8     1  1224   1000  100
12:46:42 sdb              8      0    1    8  1244   42368     29  354  120     0     119     1     3      2   96
12:46:43 sdb             36      0    5    7   251   51428     32  507  101     0     100     1     2      1   94
12:46:44 sdb             24      0    3    8    70    5732     31  147   39    15      38     2    16      6   99
12:46:45 sdb             32      0    4    8     4  213056     53 1647  129     0     129     1     0      0   74
12:46:46 sdb              8      0    1    8  1220   37416     28  328  114     0     113     1     3      2   96
12:46:47 sdb              8      0    1    8  1248   58572     67  607   96     0      96     1     2      1   93
12:46:48 sdb             40      0    5    8    84  274808     82 2173  126     0     126     1     0      0   70
12:46:49 sdb              0      0    0    0     0       0      0    0    0     0       0     1     0      0  100
12:46:50 sdb              8      0    1    8  1248       0      0    0    0     0       8     1  1248   1000  100
12:46:51 sdb              8      0    1    8  1272       0      0    0    0     0       8     1  1272   1000  100
12:46:52 sdb             24      0    3    8   414  205240    113 1798  114     0     113     1     0      0   75
12:46:53 sdb              8      0    1    8   876   92476     48  839  110     0     110     1     1      1   89
12:46:54 sdb              0      0    0    0     0   38700      0  225  172     0     172     1     0      4   99
12:46:55 sdb             16      0    2    8   582  150680     73 1262  119     0     119     1     1      0   87
12:46:56 sdb              8      0    1    8  1228       0      0    0    0     0       8     1  1228   1000  100
12:46:57 sdb              8      0    1    8  1244       0      0    0    0     0       8     1  1244   1000  100

next I played back the collectl process data and sorted by disk reads and discovered the top process, corresponding to the long disk reads was xfsaild.  btw - I also see the slab xfs_inode using about 60GB.

It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest of the data looks like it's coming from xfs journaling because when I look at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging writes - clearly they're not all going to disk.  Once the read waits increase everything slows down including xfs logging (since it's doing less).

I'm sure the simple answer may be that it is what it is, but I'm also wondering without changes to swift itself, might there be some ways to improve the situation by adding more memory or making any other tuning changes?  The system I'm currently running my tests on has 128GB.

-mark 

On Wed, Jan 6, 2016 at 6:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> dave, thanks for getting back to me and the pointer to the config doc.
>  lots to absorb and play with.
>
> the real challenge for me is that I'm doing testing as different levels.
> While i realize running 100 parallel swift PUT threads on a small system is
> not the ideal way to do things, it's the only easy way to get massive
> numbers of objects into the fillesystem and once there, the performance of
> a single stream is pretty poor and by instrumenting the swift code I can
> clearly see excess time being spent in creating/writing the objects and so
> that's lead us to believe the problem lies in the way xfs is configured.
>  creating a new directory structure on that same mount point immediately
> results in high levels of performance.
>
> As an attempt to try to reproduce the problems w/o swift, I wrote a little
> python script that simply creates files in a 2-tier structure, the first
> tier consisting of 1024 directories and each directory contains 4096
> subdirectories into which 1K files are created.

So you created something with even greater fan-out than what your
swift app is using?

> I'm doing this for 10000
> objects as a time and then timing them, reporting the times, 10 per line so
> each line represents 100 thousand file creates.
>
> Here too I'm seeing degradation and if I look at what happens when there
> are already 3M files and I write 1M more, I see these creation times/10
> thousand:
>
>  1.004236  0.961419  0.996514  1.012150  1.101794  0.999422  0.994796
>  1.214535  0.997276  1.306736
>  2.793429  1.201471  1.133576  1.069682  1.030985  1.096341  1.052602
>  1.391364  0.999480  1.914125
>  1.193892  0.967206  1.263310  0.890472  1.051962  4.253694  1.145573
>  1.528848 13.586892  4.925790
>  3.975442  8.896552  1.197005  3.904226  7.503806  1.294842  1.816422
>  9.329792  7.270323  5.936545
>  7.058685  5.516841  4.527271  1.956592  1.382551  1.510339  1.318341
> 13.255939  6.938845  4.106066
>  2.612064  2.028795  4.647980  7.371628  5.473423  5.823201 14.229120
>  0.899348  3.539658  8.501498
>  4.662593  6.423530  7.980757  6.367012  3.414239  7.364857  4.143751
>  6.317348 11.393067  1.273371
> 146.067300  1.317814  1.176529  1.177830 52.206605  1.112854  2.087990
> 42.328220  1.178436  1.335202
> 49.118140  1.368696  1.515826 44.690431  0.927428  0.920801  0.985965
>  1.000591  1.027458 60.650443
>  1.771318  2.690499  2.262868  1.061343  0.932998 64.064210 37.726213
>  1.245129  0.743771  0.996683
>
> nothing one set of 10K took almost 3 minutes!

Which is no surprise because you have slow disks and a *lot* of
memory. At some point the journal and/or memory is going to fill up
with dirty objects and have to block waiting for writeback. At that
point there's going to be several hundred thousand dirty inodes that
need to be flushed to disk before progress can be made again.  That
metadata writeback will be seek bound, and that's where all the
delay comes from.

We've been through this problem several times now with different
swift users over the past couple of years. Please go and search the
list archives, because every time the solution has been the same:

        - reduce the directory heirarchy to a single level with, at
          most, the number of directories matching the expected
          *production* concurrency level
        - reduce the XFS log size down to 32-128MB to limit dirty
          metadata object buildup in memory
        - reduce the number of AGs to as small as necessary to
          maintain /allocation/ concurrency to limit the number of
          different locations XFS writes to the disks (typically
          10-20x less than the application level concurrency)
        - use a 3.16+ kernel with the free inode btree on-disk
          format feature to keep inode allocation CPU overhead low
          and consistent regardless of the number of inodes already
          allocated in the filesystem.

> my main questions at this point are is this performance expected and/or
> might a newer kernel help?  and might it be possible to significantly
> improve things via tuning or is it what it is?  I do realize I'm starting
> with an empty directory tree whose performance degrades as it fills, but if
> I wanted to tune for say 10M or maybe 100M files might I be able to expect

The mkfs defaults will work just fine with that many files in the
filesystem. Your application configuration and data store layout is
likely to be your biggest problem here.

Cheers,

Dave.

--
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs

[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux