Re: sleeps and waits during io_submit

Avi Kivity <avi@xxxxxxxxxxxx> · Tue, 1 Dec 2015 23:38:29 +0200

On 12/01/2015 11:19 PM, Dave Chinner wrote:
On Tue, Dec 01, 2015 at 09:07:14PM +0200, Avi Kivity wrote:
On 12/01/2015 08:03 PM, Carlos Maiolino wrote:
Hi Avi,

else is going to execute in our place until this thread can make
progress.
For us, nothing else can execute in our place, we usually have exactly one
thread per logical core.  So we are heavily dependent on io_submit not
sleeping.

The case of a contended lock is, to me, less worrying.  It can be reduced by
using more allocation groups, which is apparently the shared resource under
contention.

I apologize if I misread your previous comments, but, IIRC you said you can't
change the directory structure your application is using, and IIRC your
application does not spread files across several directories.
I miswrote somewhat: the application writes data files and commitlog
files.  The data file directory structure is fixed due to
compatibility concerns (it is not a single directory, but some
workloads will see most access on files in a single directory.  The
commitlog directory structure is more relaxed, and we can split it
to a directory per shard (=cpu) or something else.

If worst comes to worst, we'll hack around this and distribute the
data files into more directories, and provide some hack for
compatibility.

XFS spread files across the allocation groups, based on the directory these
files are created,
Idea: create the files in some subdirectory, and immediately move
them to their required location.
See xfs_fsr.

Can you elaborate?  I don't see how it is applicable.

My hack involves creating the file in a random directory, and while it 
is still zero sized, move it to its final directory.  This is simply to 
defeat the ag selection heuristic.  No data is copied.

  trying to keep files as close as possible from their
metadata.
This is pointless for an SSD. Perhaps XFS should randomize the ag on
nonrotational media instead.
Actually, no, it is not pointless. SSDs do not require optimisation
for minimal seek time, but data locality is still just as important
as spinning disks, if not moreso. Why? Because the garbage
collection routines in the SSDs are all about locality and we can't
drive garbage collection effectively via discard operations if the
filesystem is not keeping temporally related files close together in
it's block address space.

In my case, files in the same directory are not temporally related. But 
I understand where the heuristic comes from.

Maybe an ioctl to set a directory attribute "the files in this directory 
are not temporally related"?

I imagine this will be useful for many server applications.

e.g. If the files in a directory are all close together, and the
directory is removed, we then leave a big empty contiguous region in
the filesystem free space map, and when we send discards over that
we end up with a single big trim and the drive handles that far more

Would this not be defeated if a directory that happens to share the 
allocation group gets populated simultaneously?

effectively than lots of little trims (i.e. one per file) that the
drive cannot do anything useful with because they are all smaller
than the internal SSD page/block sizes and so get ignored.  This is
one of the reasons fstrim is so much more efficient and effective
than using the discard mount option.

In my use case, the files are fairly large, and there is constant 
rewriting (not in-place: files are read, merged, and written back). So 
I'm worried an fstrim can happen too late.

And, well, XFS is designed to operate on storage devices made up of
more than one drive, so the way AGs are selected is designed to
given long term load balancing (both for space usage and
instantenous performance). With the existing algorithms we've not
had any issues with SSD lifetimes, long term performance
degradation, etc, so there's no evidence that we actually need to
change the fundamental allocation algorithms specially for SSDs.

Ok.  Maybe the SSDs can deal with untrimmed overwrites efficiently, 
provided the io sizes are large enough.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs