Re: Need advice on building a new XFS setup for large files

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 23 Jan 2013 09:05:11 +1100

On Tue, Jan 22, 2013 at 06:49:46AM -0600, Stan Hoeppner wrote:
> On 1/21/2013 10:22 PM, Alvin Ong wrote:
> > Hi,
> > 
> > We are building a solution with a web front end for our users to store
> > large files.
> > Large files starting from the size of 500GB and above and can grow up to
> > 1-2TB's per file.
> > This is the reason we are trying out XFS to see if we can get a test system
> > running.
> 
> Tell us more about these files.  Is this simply bulk file storage?
> Start at 500GB and append until 2TB?  How often will the files be
> appended and at what rate?  I.e. will it take 3 days to append from
> 500GB to 2TB or take 3 months?  The answer to this dictates how the
> files and filesystem will fragment over time.  Constantly expanding with
> additional 6 spindle constituent arrays, LVM concatenation, and
> xfs_growfs may leave you with an undesirable, possibly disastrous,
> fragmentation pattern.

I'd say it's guaranteed, not a possibility.

> > We plan to use a 6+2 RAID6 to start off with. Then when it gets filled up
> > to maybe 60-70% we will
> > expand by adding another 6+2 RAID6 to the array.
> > The max we can grow this configuration is up to 252TB usable which should
> > be enough for a year.
> > Our requirements might grow up to 2PB in 2 years time if all goes well.
> 
> I'd not attempt growing a single XFS to the scale you're describing, via
> the methods you describe.  The odds of catastrophe are too great.

It's a recipe for disaster and not recommended at all.

> > So I have been testing all of this out on a VM running 3 vmdk's and using
> > LVM to create a single logical volume of the 3 disks.
> > I noticed that out of sdb, sdc and sdd, files keep getting written to sdc.
> > This is probably due to our web app creating a single folder and all files
> > are written under that folder.
> > This is the nature of the Allocation Group of XFS? Is there a way to avoid
> > this? 
> 
> Yes.
> 
> 1.  Don't put all files in a single directory.
> 
> 2.  Use the inode32 allocator on a filesystem greater than 1TB in size.
>  This will cause inodes to be located in the first 1TB and files to be
> allocated round robin across the AGs via rotor stepping.  See page 10:
> http://oss.sgi.com/projects/xfs/training/xfs_slides_06_allocators.pdf

3: Use a storage layout that is not affected by hotspots due to
filesystem locality.

That is, build the storage to the scale that you are likely to need
in the future. i.e. use all 112 disks (14x 6+2 RAID = 112 disks) to
begin with and lay the storage and filesystem out optimally
accordingly.  I'd build seven 14+2 hardware RAID6 luns (112 disks)
and stripe them in RAID0, setting the XFS stripe unit to be the
width of a hardware RAID6 lun. That way sequential IO to a single
region of the disk still hits every single  disk in the array, and
hotspots don't occur.

If you do this, it doesn't matter if you use inode64 or inode32 for
a hotspot perspective, only a file fragmentation perspective. This
is the way XFs has been used for exactly this sort of storage for
the last 15 years....

> > As we will have files keep writing to the same disk thus creating a
> > hot spot.
> > Although it might not hurt us that much if we fill up a single RAID6 to
> > 60-70% then adding another RAID6 to the mix. We could go up to a total of
> > 14 RAID6 sets.
> 
> Again, you probably don't want to do this.  Too many eggs in one basket.
> 
> You should investigate using GlusterFS to tie multiple XFS storage
> servers together into a single file tree.

Another possible solution. You should talk to RedHat (says the
RedHat employee ;)....

> Running an xfs_repair on a single filesystem denies all access, and with
> a 252TB XFS this could take some time.

For a filesystem with 1-2TB files, it'll take 30s to run. That's not
an issue.

> > Is LVM a good choice of doing this configuration? Or do you have a better
> > recommendation?
> > The reason we thought LVM would be good was so that we could easily grow
> > XFS.
> 
> Why not do the concatenation within the SAN array controller?

Same problem as LVM concatenation. Hot spots.

> > Is I was to use the 8-disk RAID6 array with a 256kB stripe size will have a
> > sunit of 512 and a swidth of (8-2)*512=3072.
> 
> So a 256KB strip and a 1.5MB stripe.  With RAID6 RMW?  I wouldn't
> recommend this.

Large files, sequential IO, there will be no RMW cycles in the RAID.
The write cache of RAID controller will do the aggregation of
individual IOs into full stripe writes just fine. 

> It appears most of your writes will be appends, meaning little
> allocation, which means little stripe aligned write out.  Here you are
> trying to optimize for large IOs which would be fine if you had an all
> or mostly allocation workload, but you don't.  You have an append heavy
> workload.
> 
> Using large strips (stripe units, chunks) with parity RAID, especially
> RAID6, will simply murder your append performance due to massive
> read-modify-write operations on large strips.

No, that's wrong. sequential IO will always fill full stripes in the
cache, so RMW cycles simple will not happen. Remember that RMW
occurs whenteh cache has to be flushed to the back end disks, not
when writes come in to the front end cache....

> With RAID6 with a mostly append workload, you should be using a small
> strip size.  This has been discussed here at length and the consensus is
> anything over a 32KB strip size doesn't improve performance, but can
> hurt performance, especially with parity RAID.  Thus you should create
> your 6+2 arrays with a 32KB strip and (6*32)=192KB stripe, and create
> your XFS with "-d su=32k,sw=6".  This should yield significantly better
> append performance.

That's a tuning for an IOPS intensive workload, not a large scale,
large file storage workloads.

While sequential writes are an append workload, it's an append
workload that the RAID controller is optimised to avoid causing RMW
cycles for. As such, the above is bad advice for large files with
sequential IO workloads. Large files, large fielsystem, sequential
IO is ideal for large RAID6 widths....

> External log devices are for systems that modify metadata at rates of
> hundreds of IOs per second.  So don't specify a log device.

Even at hundreds of thousands of IOs per second, external logs don't
provide much in way of benefit thanks to delayed logging. The only
reason for using an external log these days is a fsync heavy or
synchronous write workload. And in most cases a BBWC means even
those worklaods don't need an external log...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs