Re: xfs and swift

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 1 Feb 2016 16:27:48 +1100

On Mon, Jan 25, 2016 at 11:38:07AM -0500, Mark Seger wrote:
> since getting your last reply I've been doing a lot more trying to
> understand the behavior of what I'm seeing by writing some non-swift code
> that sort of does what swift does with respect to a directory structure.
>  in my case I have 1024 top level dirs, 4096 under each.  each 1k file I'm
> creating gets it's only directory under these so there are clearly a lot of
> directories.

I'm not sure you understood what I said in my last reply: your
directory structure is the problem, and that's what needs changing.

> xfs writes out about 25M objects and then the performance goes into the
> toilet.  I'm sure what you said before about having to flush data and
> causing big delays, but would it be continuous?

Go read the previous thread on this subject. Or, alternatively, try
some of the subgestions I made, like reducing the log size, to see
how this affects such behaviour.

> each entry in the
> following table shows the time to write 10K files so the 2 blocks are 1M
> each
> 
> Sat Jan 23 12:15:09 2016
>  16.114386  14.656736  14.789760  17.418389  14.613157  15.938176
>  14.865369  14.962058  17.297193  15.953590
.....
>  62.667990  46.334603  53.546195  69.465447  65.006016  68.761229
>  70.754684  97.571669 104.811261 104.229302
> 105.605257 105.166030 105.058075 105.519703 106.573306 106.708545
> 106.114733 105.643131 106.049387 106.379378

Your test goes from operating wholly in memory to being limited by
disk speed because it no longer fits in memory.

> if I look at the disk loads at the time, I see a dramatic increase in disk
> reads that correspond to the slow writes so I'm guessing at least some
.....
> next I played back the collectl process data and sorted by disk reads and
> discovered the top process, corresponding to the long disk reads was
> xfsaild.  btw - I also see the slab xfs_inode using about 60GB.

And there's your problem. You're accumulating gigabytes of dirty
inodes in memory, then wondering why everything goes to crap when
memory fills up and we have to start cleaning inodes. TO clean those
inodes, we have to do 

RMW cycles on the inode cluster buffers, because the inode cache
memory pressure has caused the inod buffers to be reclaimed from
memory before the cached dirty inodes are written. All the
changes I recommended you make also happen address this problem,
too....

> It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest
> of the data looks like it's coming from xfs journaling because when I look
> at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging
> writes - clearly they're not all going to disk.

Before delayed logging was introduced 5 years ago, it was quite
common to see XFS writing >500MB/s to the journal. The thing is,
your massive fan-out directory structure is mostly going to defeat
the relogging optimisations that make delayed logging work, so it's
entirely possible that you are seeing this much throughput through
the journal.

> Once the read waits
> increase everything slows down including xfs logging (since it's doing
> less).

Of course, because we can't journal more changes until the dirty
inodes in the journal are cleaned. That's what the xfsaild does -
clean dirty inodes, and the reads coming from that threads are for
cleaning inodes...

> I'm sure the simple answer may be that it is what it is, but I'm also
> wondering without changes to swift itself, might there be some ways to
> improve the situation by adding more memory or making any other tuning
> changes?  The system I'm currently running my tests on has 128GB.

I've already described what you need to do to both the swift
directory layout and the XFS filesystem configuration to minimise
the impact of storing millions of tiny records in a filesystem. I'll
leave the quote from my last email for you:

> > We've been through this problem several times now with different
> > swift users over the past couple of years. Please go and search the
> > list archives, because every time the solution has been the same:
> >
> >         - reduce the directory heirarchy to a single level with, at
> >           most, the number of directories matching the expected
> >           *production* concurrency level
> >         - reduce the XFS log size down to 32-128MB to limit dirty
> >           metadata object buildup in memory
> >         - reduce the number of AGs to as small as necessary to
> >           maintain /allocation/ concurrency to limit the number of
> >           different locations XFS writes to the disks (typically
> >           10-20x less than the application level concurrency)
> >         - use a 3.16+ kernel with the free inode btree on-disk
> >           format feature to keep inode allocation CPU overhead low
> >           and consistent regardless of the number of inodes already
> >           allocated in the filesystem.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs