Re: Issues with delalloc->real extent allocation

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 21 Jan 2011 13:51:40 +1100

On Thu, Jan 20, 2011 at 08:45:03AM -0600, Geoffrey Wehrman wrote:
> On Thu, Jan 20, 2011 at 12:33:46PM +1100, Dave Chinner wrote:
> | Given that XFS is aimed towards optimising for the large file/large
> | IO/high throughput type of application, I'm comfortable with saying
> | that avoiding sub-page writes for optimal throughput IO is an
> | application problem and going from there. Especially considering
> | that stuff like rsync and untarring kernel tarballs are all
> | appending writes so won't take any performance hit at all...
> 
> I agree.  I do not expect systems with 64K pages to be used for single
> bit manipulations.  However, I do see a couple of potential problems.
> 
> The one place where page size I/O may not work though is for an DMAPI/HSM
> (DMF) managed filesystem where some blocks are managed on near-line media.
> The DMAPI needs to be able to remove and restore extents on a filesystem
> block boundary, not a page boundary.

DMAPI is still free to remove extents at whatever boundary it wants.
The only difference is that it would be asked to restore extents to
the page boundary the write covers rather than a block boundary. The
allocation and direct IO boundaries do not change, so the only thing
that needs to change is the range that DMAPI is told that the
read/write is going to cover....

> The other downside is that for sparse files, we could end up allocating
> space for zero filled blocks.  There may be some workloads where
> significant quantities of space are wasted.

Yes, that is possible, though on the other hand is will reduce worst
case fragmentation of pathological sparse file filling applications.
e.g. out-of-core solvers that do strided writes across the file to
write the first column in the result matrix as it is calculated,
then the second column, then the third ...  until all columns are
written.

----

Realistically, for every disadvantage or advantage we can enumerate
for specific workloads, I think one of us will be able to come up
with a counter example that shows the opposite of the original
point. I don't think this sort of argument is particularly
productive. :/

Instead, I look at it from the point of view that a 64k IO is little
slower than a 4k IO so such a change would not make much difference
to performance. And given that terabytes of storage capacity is
_cheap_ these days (and getting cheaper all the time), the extra
space of using 64k instead of 4k for sparse blocks isn't a big deal.

When I combine that with my experience from SGI where we always
recommended using filesystems block size == page size for best IO
performance on HPC setups, there's a fair argument that using page
size extents for small sparse writes isn't a problem we really need
to care about.

Ð'd prefer to design for where we expect storage to be in the next
few years e.g. 10TB spindles. Minimising space usage is not a big
priority when we consider that in 2-3 years 100TB of storage will
cost less than $5000 (it's about $15-20k right now).  Even on
desktops we're going to have more capacity that we know what to do
with, so trading off storage space for lower memory overhead, lower
metadata IO overhead and lower potential fragmentation seems like
the right way to move forward to me.

Does that seem like a reasonable position to take, or are there
other factors that you think I should be considering?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs