On Wed, Jun 03, 2015 at 05:58:07PM -0700, Shrinand Javadekar wrote: > Thanks Dave. Please see my responses inline. > > On Wed, Jun 3, 2015 at 5:35 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Jun 03, 2015 at 04:18:20PM -0700, Shrinand Javadekar wrote: > >> Here you go! > > > > Thanks! > > > >> /dev/mapper/35000c50062e6a12b-part2 /srv/node/r1 xfs > >> rw,nosuid,nodev,noexec,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,noquota > >> 0 0 > > ..... > >> meta-data=/dev/mapper/35000c50062e6a7eb-part2 isize=256 agcount=64, agsize=11446344 blks > >> = sectsz=512 attr=2 > >> data = bsize=4096 blocks=732566016, imaxpct=5 > >> = sunit=0 swidth=0 blks > >> naming =version 2 bsize=4096 ascii-ci=0 > >> log =internal bsize=4096 blocks=357698, version=2 > >> = sectsz=512 sunit=0 blks, lazy-count=1 > >> realtime =none extsz=4096 blocks=0, rtextents=0 > > > > Ok, so agcount=64 is unusual, especially for a single disk > > filesystem. What was the reason for doing this? > > I read few articles that recommend using an increased number of AGs, > especially when there are large disks. I can use the default # of AGs > (4?) and try again. <sigh> The Google Fallacy strikes again. http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E > >> Openstack Swift. This is what it's doing: > >> > >> 1. A path like /srv/node/r1/objects/1024/eef/tmp already exists. > >> /srv/node/r1 is the mount point. > >> 2. Creates a tmp file, say tmpfoo in the patch above. Path: > >> /srv/node/r1/objects/1024/eef/tmp/tmpfoo. > >> 3. Issues a 256KB write into this file. > >> 4. Issues an fsync on the file. > >> 5. Closes this file. > >> 6. Creates another directory named "deadbeef" inside "eef" if it > >> doesn't exist. Path /srv/node/r1/objects/1024/eef/deadbeef. > >> 7. Moves file tmpfoo into the deadbeef directory using rename(). > >> /srv/node/r1/objects/1023/eef/tmp/tmpfoo --> > >> /srv/node/r1/objects/1024/eef/deadbeef/foo.data > >> 8. Does a readdir on /srv/node/r1/objects/1024/eef/deadbeef/ > >> 9. Iterates over all files obtained in #8 above. Usually #8 gives only one file. > > > > Oh. We've already discussed this problem in a previous thread: > > > > http://oss.sgi.com/archives/xfs/2015-04/msg00256.html > > Yes, we touched upon this earlier and found that all files were > getting created in the same AG. We fixed that by and my current > testing includes that fix. Right, I noticed that looking at the inode allocation distribution. It's pretty good (output is count, agno): $$ awk '/xfs_ialloc_read_agi:/ {print $8}' trace_report.txt | sort -n |uniq -c 1362 0 1351 1 1359 2 1354 3 1374 4 1345 5 1380 6 1371 7 1356 8 1354 9 1373 10 1364 11 1357 12 1363 13 1368 14 1386 15 1355 16 1384 17 1352 18 1377 19 1358 20 1371 21 1356 22 1367 23 1342 24 1383 25 1352 26 1354 27 1347 28 1382 29 1348 30 1347 31 1351 32 1346 33 1350 34 1365 35 1346 36 1361 37 1358 38 1337 39 1356 40 1371 41 1347 42 1335 43 1378 44 1370 45 1372 46 1334 47 1363 48 1355 49 1365 50 1353 51 1370 52 1346 53 1369 54 1356 55 1381 56 1349 57 1365 58 1356 59 1351 60 1345 61 1379 62 1351 63 > > Specifically, that discussion touched on problems your workload > > induces in metadata layout and locality: > > > > http://oss.sgi.com/archives/xfs/2015-04/msg00300.html > > > > And you are using agcount=64 on these machines, so that's going to > > cause you all sorts of locality problems, which will translate into > > seek bound IO performance.... > > > >> - IOStat and vmstat output > >> (attached) > > > > I am assuming these are 1 second samples, based on your 18s fast/12s > > slow description earlier. > > Yes, these are 1 seconds samples. > > > > > The vmstat shows fast writeback at 150-200MB/s, with no idle time, > > anything up to 200 processes in running or blocked state and 20-30% > > iowait, followed by idle CPU time with maybe 10 running/blocked > > processes, writeback at 15-20MB/s with 70% idle time and 30% iowait. > > > > IOWs, the workload is cyclic - lots of incoming data with lots of > > throughput, followed by zero incoming data processing on only small > > amounts of writeback. > > My understanding is that the workload is either > > a) waiting for issued IOs to complete. > b) not able to issue more IOs because XFS is busy flushing the journal entries. > > Is this not true? It's just an *observation* that the incoming processing has stopped from the data presented, and it doesn't speak to the cause of why incoming data is not being processed. You're jumping to conclusions again before there is supporting evidence to make such a statement. > > The vmstat information implies that front end application processing > > is stopping for some period of time, but it does not indicate why it > > is doing so. When the disks are doing 4k writeback, can you grab > > the output of 'echo w > /proc/sysrq-trigger' from dmesg and post the > > output? That will tell us if the front end processing is blocked on > > the filesystem at all... > > Aah.. ok. Will do and get back to you soon. See? more information is required. ;) Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs