since getting your last reply I've been doing a lot more trying to understand the behavior of what I'm seeing by writing some non-swift code that sort of does what swift does with respect to a directory structure. in my case I have 1024 top level dirs, 4096 under each. each 1k file I'm creating gets it's only directory under these so there are clearly a lot of directories.
xfs writes out about 25M objects and then the performance goes into the toilet. I'm sure what you said before about having to flush data and causing big delays, but would it be continuous? each entry in the following table shows the time to write 10K files so the 2 blocks are 1M each
Sat Jan 23 12:15:09 2016
16.114386 14.656736 14.789760 17.418389 14.613157 15.938176 14.865369 14.962058 17.297193 15.953590
14.895471 15.560252 14.789937 14.308618 16.390057 16.561789 15.713806 14.843791 15.940992 16.466924
15.842781 15.611230 17.102329 15.006291 14.454088 17.923662 13.378340 16.084664 15.996794 13.736398
18.125125 14.462063 18.101833 15.355139 16.603660 14.205896 16.474111 16.212237 15.072443 14.217581
16.273899 14.905624 17.285019 14.955722 13.769731 18.308619 15.601386 15.832661 14.342416 16.516657
14.697575 15.719496 16.723135 16.808668 15.443325 14.608358 17.031334 16.426377 13.900535 13.528603
16.197697 16.839241 14.802707 15.507915 14.864337 15.836943 15.660089 15.998911 13.956739 14.337318
16.416974 17.729661 14.936045 13.450859 15.943900 15.106077 15.541450 16.523752 16.555945 14.440305
14.937772 16.486544 13.780310 16.944841 14.867400 18.214934 14.142108 15.931952 14.424949 15.533156
16.010153 16.323108 14.423508 15.970071 15.277186 15.561362 14.978766 15.855935 16.953906 14.247016
Sat Jan 23 12:41:09 2016
15.908483 15.638943 17.681281 15.188704 15.721495 13.359225 15.999421 15.858876 16.402176 16.416312
15.443946 14.675751 15.470643 15.573755 15.422241 16.336590 17.220916 13.974890 15.877780 62.650921
62.667990 46.334603 53.546195 69.465447 65.006016 68.761229 70.754684 97.571669 104.811261 104.229302
105.605257 105.166030 105.058075 105.519703 106.573306 106.708545 106.114733 105.643131 106.049387 106.379378
104.239131 104.268931 103.852929 103.549319 103.516169 103.007015 103.724020 104.519983 105.839203 105.324985
104.328205 104.932713 103.051548 104.938652 102.769383 102.851609 101.432277 102.269842 100.937972 103.450103
103.477628 103.636130 103.444242 103.023145 102.565047 102.853115 101.402610 98.928230 99.310677 99.669667
101.140554 99.628664 102.093801 100.580659 101.762283 101.369349 102.637014 102.240950 101.778506 101.144526
100.899476 102.294952 102.029285 100.871166 102.763222 102.910690 104.892447 104.748194 105.403636 106.159345
106.413154 104.626632 105.775004 104.579775 104.778526 104.634778 106.233381 104.063642 106.635481 104.314503
if I look at the disk loads at the time, I see a dramatic increase in disk reads that correspond to the slow writes so I'm guessing at least some writes are waiting in the queue as you can see there - thanks to laurence for the patch to show disk read wait times ;)
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
12:45:30 sdb 0 0 0 0 0 270040 105 2276 119 4 118 16 4 0 62
12:45:31 sdb 0 0 0 0 0 273776 120 2262 121 4 121 18 4 0 57
12:45:32 sdb 4 0 1 4 0 100164 57 909 110 4 110 6 4 0 84
12:45:33 sdb 0 0 0 0 0 229992 87 1924 120 1 119 2 1 0 68
12:45:34 sdb 4 0 1 4 4 153528 59 1304 118 0 117 1 0 0 78
12:45:35 sdb 0 0 0 0 0 220896 97 1895 117 1 116 1 1 0 62
12:45:36 sdb 0 0 0 0 0 419084 197 3504 120 0 119 1 0 0 32
12:45:37 sdb 0 0 0 0 0 428076 193 3662 117 0 116 1 0 0 32
12:45:38 sdb 0 0 0 0 0 428492 181 3560 120 0 120 1 0 0 30
12:45:39 sdb 0 0 0 0 0 426024 199 3641 117 0 117 1 0 0 32
12:45:40 sdb 0 0 0 0 0 429764 200 3589 120 0 119 1 0 0 28
12:45:41 sdb 0 0 0 0 0 410204 165 3430 120 0 119 3 0 0 36
12:45:42 sdb 0 0 0 0 0 406192 196 3437 118 0 118 5 0 0 39
12:45:43 sdb 0 0 0 0 0 420952 175 3552 119 0 118 1 0 0 34
12:45:44 sdb 0 0 0 0 0 428424 197 3645 118 0 117 1 0 0 31
12:45:45 sdb 0 0 0 0 0 192464 76 1599 120 8 120 18 8 0 75
12:45:46 sdb 0 0 0 0 0 340522 205 2951 115 2 115 16 2 0 41
12:45:47 sdb 0 0 0 0 0 429128 193 3664 117 0 117 1 0 0 28
12:45:48 sdb 0 0 0 0 0 402600 164 3311 122 0 121 3 0 0 39
12:45:49 sdb 0 0 0 0 0 435316 195 3701 118 0 117 1 0 0 36
12:45:50 sdb 0 0 0 0 0 367976 162 3152 117 1 116 7 1 0 46
12:45:51 sdb 0 0 0 0 0 255716 125 2153 119 4 118 16 4 0 60
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
12:45:52 sdb 0 0 0 0 0 360144 149 3006 120 1 119 9 1 0 46
12:45:53 sdb 0 0 0 0 0 343500 162 2909 118 1 118 11 1 0 43
12:45:54 sdb 0 0 0 0 0 256636 119 2188 117 2 117 11 2 0 54
12:45:55 sdb 0 0 0 0 0 149000 47 1260 118 14 118 22 14 0 79
12:45:56 sdb 0 0 0 0 0 198544 88 1654 120 7 120 19 7 0 67
12:45:57 sdb 0 0 0 0 0 320688 151 2731 117 1 117 8 1 0 53
12:45:58 sdb 0 0 0 0 0 422176 190 3532 120 0 119 1 0 0 32
12:45:59 sdb 0 0 0 0 0 266540 115 2233 119 5 119 13 5 0 93
12:46:00 sdb 8 0 2 4 690 291116 129 2463 118 3 118 9 3 0 82
12:46:01 sdb 0 0 0 0 0 249964 118 2160 116 4 115 15 4 0 60
12:46:02 sdb 4736 0 37 128 0 424680 167 3522 121 0 120 1 0 0 28
12:46:03 sdb 5016 0 42 119 0 391364 196 3344 117 0 117 6 0 0 34
12:46:04 sdb 0 0 0 0 0 415436 172 3501 119 0 118 2 0 0 33
12:46:05 sdb 0 0 0 0 0 398736 192 3373 118 0 118 3 0 0 39
12:46:06 sdb 0 0 0 0 0 367292 155 3015 122 0 121 6 0 0 39
12:46:07 sdb 0 0 0 0 0 420392 201 3614 116 0 116 1 0 0 30
12:46:08 sdb 0 0 0 0 0 424828 172 3547 120 0 119 1 0 0 32
12:46:09 sdb 0 0 0 0 0 500380 234 4277 117 0 116 2 0 0 34
12:46:10 sdb 0 0 0 0 0 104500 7 698 150 0 149 1 0 1 87
12:46:11 sdb 8 0 1 8 1260 77252 45 647 119 0 119 1 2 1 92
12:46:12 sdb 8 0 1 8 1244 73956 31 615 120 0 120 1 2 1 94
12:46:13 sdb 8 0 1 8 228 149552 64 1256 119 0 118 1 0 0 85
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
12:46:14 sdb 8 0 1 8 1232 37124 28 319 116 0 116 1 3 3 99
12:46:15 sdb 16 0 2 8 720 2776 23 120 23 1 22 1 13 8 99
12:46:16 sdb 0 0 0 0 0 108180 16 823 131 0 131 1 0 1 90
12:46:17 sdb 8 0 1 8 1260 37136 28 322 115 0 114 1 3 2 94
12:46:18 sdb 8 0 1 8 1252 108680 57 875 124 0 124 1 1 1 88
12:46:19 sdb 0 0 0 0 0 0 0 0 0 0 0 1 0 0 100
12:46:20 sdb 16 0 2 8 618 81516 49 685 119 0 118 1 1 1 94
12:46:21 sdb 16 0 2 8 640 225788 106 1907 118 0 118 1 0 0 75
12:46:22 sdb 32 0 4 8 95 73892 17 627 118 0 117 1 0 1 93
12:46:23 sdb 24 0 3 8 408 257012 119 2171 118 0 118 1 0 0 65
12:46:24 sdb 12 0 3 4 5 3608 0 20 180 0 157 1 0 43 100
12:46:25 sdb 44 0 7 6 210 74072 41 625 119 0 117 1 2 1 97
12:46:26 sdb 48 0 6 8 216 202852 112 1819 112 0 111 1 0 0 92
12:46:27 sdb 52 0 7 7 233 307156 137 2648 116 0 115 1 0 0 95
12:46:28 sdb 16 0 2 8 100 93168 7 638 146 0 145 1 0 1 97
12:46:29 sdb 16 0 2 8 642 37028 16 319 116 0 115 1 4 3 99
12:46:30 sdb 16 0 2 8 624 39068 36 342 114 0 113 1 3 2 99
12:46:31 sdb 80 0 10 8 94 253892 105 2169 117 0 116 1 0 0 84
12:46:32 sdb 0 0 0 0 0 5676 0 33 172 0 172 1 0 30 100
12:46:33 sdb 16 0 2 8 642 69236 28 583 119 0 118 1 2 1 96
12:46:34 sdb 8 0 1 8 1032 37132 30 315 118 0 117 1 3 3 100
12:46:35 sdb 16 0 2 8 822 56292 15 515 109 0 108 1 3 1 100
# DISK STATISTICS (/sec)
# <---------reads---------------><---------writes--------------><--------averages--------> Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util
12:46:36 sdb 8 0 1 8 44 58768 15 452 130 0 129 1 0 2 96
12:46:37 sdb 28 0 4 7 390 114944 89 1100 104 0 104 1 1 0 88
12:46:38 sdb 0 0 0 0 0 29668 0 172 172 12 172 1 12 5 98
12:46:39 sdb 80 0 10 8 90 100084 31 882 113 0 112 1 1 1 91
12:46:40 sdb 0 0 0 0 0 24244 0 139 174 0 174 1 0 7 100
12:46:41 sdb 8 0 1 8 1224 0 0 0 0 0 8 1 1224 1000 100
12:46:42 sdb 8 0 1 8 1244 42368 29 354 120 0 119 1 3 2 96
12:46:43 sdb 36 0 5 7 251 51428 32 507 101 0 100 1 2 1 94
12:46:44 sdb 24 0 3 8 70 5732 31 147 39 15 38 2 16 6 99
12:46:45 sdb 32 0 4 8 4 213056 53 1647 129 0 129 1 0 0 74
12:46:46 sdb 8 0 1 8 1220 37416 28 328 114 0 113 1 3 2 96
12:46:47 sdb 8 0 1 8 1248 58572 67 607 96 0 96 1 2 1 93
12:46:48 sdb 40 0 5 8 84 274808 82 2173 126 0 126 1 0 0 70
12:46:49 sdb 0 0 0 0 0 0 0 0 0 0 0 1 0 0 100
12:46:50 sdb 8 0 1 8 1248 0 0 0 0 0 8 1 1248 1000 100
12:46:51 sdb 8 0 1 8 1272 0 0 0 0 0 8 1 1272 1000 100
12:46:52 sdb 24 0 3 8 414 205240 113 1798 114 0 113 1 0 0 75
12:46:53 sdb 8 0 1 8 876 92476 48 839 110 0 110 1 1 1 89
12:46:54 sdb 0 0 0 0 0 38700 0 225 172 0 172 1 0 4 99
12:46:55 sdb 16 0 2 8 582 150680 73 1262 119 0 119 1 1 0 87
12:46:56 sdb 8 0 1 8 1228 0 0 0 0 0 8 1 1228 1000 100
12:46:57 sdb 8 0 1 8 1244 0 0 0 0 0 8 1 1244 1000 100
next I played back the collectl process data and sorted by disk reads and discovered the top process, corresponding to the long disk reads was xfsaild. btw - I also see the slab xfs_inode using about 60GB.
It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest of the data looks like it's coming from xfs journaling because when I look at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging writes - clearly they're not all going to disk. Once the read waits increase everything slows down including xfs logging (since it's doing less).
I'm sure the simple answer may be that it is what it is, but I'm also wondering without changes to swift itself, might there be some ways to improve the situation by adding more memory or making any other tuning changes? The system I'm currently running my tests on has 128GB.
-mark
On Wed, Jan 6, 2016 at 6:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> dave, thanks for getting back to me and the pointer to the config doc.
> lots to absorb and play with.
>
> the real challenge for me is that I'm doing testing as different levels.
> While i realize running 100 parallel swift PUT threads on a small system is
> not the ideal way to do things, it's the only easy way to get massive
> numbers of objects into the fillesystem and once there, the performance of
> a single stream is pretty poor and by instrumenting the swift code I can
> clearly see excess time being spent in creating/writing the objects and so
> that's lead us to believe the problem lies in the way xfs is configured.
> creating a new directory structure on that same mount point immediately
> results in high levels of performance.
>
> As an attempt to try to reproduce the problems w/o swift, I wrote a little
> python script that simply creates files in a 2-tier structure, the first
> tier consisting of 1024 directories and each directory contains 4096
> subdirectories into which 1K files are created.
So you created something with even greater fan-out than what your
swift app is using?
Which is no surprise because you have slow disks and a *lot* of
> I'm doing this for 10000
> objects as a time and then timing them, reporting the times, 10 per line so
> each line represents 100 thousand file creates.
>
> Here too I'm seeing degradation and if I look at what happens when there
> are already 3M files and I write 1M more, I see these creation times/10
> thousand:
>
> 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796
> 1.214535 0.997276 1.306736
> 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602
> 1.391364 0.999480 1.914125
> 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573
> 1.528848 13.586892 4.925790
> 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422
> 9.329792 7.270323 5.936545
> 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341
> 13.255939 6.938845 4.106066
> 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120
> 0.899348 3.539658 8.501498
> 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751
> 6.317348 11.393067 1.273371
> 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990
> 42.328220 1.178436 1.335202
> 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965
> 1.000591 1.027458 60.650443
> 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213
> 1.245129 0.743771 0.996683
>
> nothing one set of 10K took almost 3 minutes!
memory. At some point the journal and/or memory is going to fill up
with dirty objects and have to block waiting for writeback. At that
point there's going to be several hundred thousand dirty inodes that
need to be flushed to disk before progress can be made again. That
metadata writeback will be seek bound, and that's where all the
delay comes from.
We've been through this problem several times now with different
swift users over the past couple of years. Please go and search the
list archives, because every time the solution has been the same:
- reduce the directory heirarchy to a single level with, at
most, the number of directories matching the expected
*production* concurrency level
- reduce the XFS log size down to 32-128MB to limit dirty
metadata object buildup in memory
- reduce the number of AGs to as small as necessary to
maintain /allocation/ concurrency to limit the number of
different locations XFS writes to the disks (typically
10-20x less than the application level concurrency)
- use a 3.16+ kernel with the free inode btree on-disk
format feature to keep inode allocation CPU overhead low
and consistent regardless of the number of inodes already
allocated in the filesystem.
> my main questions at this point are is this performance expected and/or
> might a newer kernel help? and might it be possible to significantly
> improve things via tuning or is it what it is? I do realize I'm starting
> with an empty directory tree whose performance degrades as it fills, but if
> I wanted to tune for say 10M or maybe 100M files might I be able to expect
The mkfs defaults will work just fine with that many files in the
filesystem. Your application configuration and data store layout is
likely to be your biggest problem here.
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs