Re: Need advice on building a new XFS setup for large files

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 23 Jan 2013 06:35:29 -0600

On 1/22/2013 4:05 PM, Dave Chinner wrote:
> On Tue, Jan 22, 2013 at 06:49:46AM -0600, Stan Hoeppner wrote:
>> On 1/21/2013 10:22 PM, Alvin Ong wrote:
>>> Hi,
>>>
>>> We are building a solution with a web front end for our users to store
>>> large files.
>>> Large files starting from the size of 500GB and above and can grow up to
>>> 1-2TB's per file.
>>> This is the reason we are trying out XFS to see if we can get a test system
>>> running.
>>
>> Tell us more about these files.  Is this simply bulk file storage?
>> Start at 500GB and append until 2TB?  How often will the files be
>> appended and at what rate?  I.e. will it take 3 days to append from
>> 500GB to 2TB or take 3 months?  The answer to this dictates how the
>> files and filesystem will fragment over time.  Constantly expanding with
>> additional 6 spindle constituent arrays, LVM concatenation, and
>> xfs_growfs may leave you with an undesirable, possibly disastrous,
>> fragmentation pattern.
> 
> I'd say it's guaranteed, not a possibility.
> 
>>> We plan to use a 6+2 RAID6 to start off with. Then when it gets filled up
>>> to maybe 60-70% we will
>>> expand by adding another 6+2 RAID6 to the array.
>>> The max we can grow this configuration is up to 252TB usable which should
>>> be enough for a year.
>>> Our requirements might grow up to 2PB in 2 years time if all goes well.
>>
>> I'd not attempt growing a single XFS to the scale you're describing, via
>> the methods you describe.  The odds of catastrophe are too great.
> 
> It's a recipe for disaster and not recommended at all.
> 
>>> So I have been testing all of this out on a VM running 3 vmdk's and using
>>> LVM to create a single logical volume of the 3 disks.
>>> I noticed that out of sdb, sdc and sdd, files keep getting written to sdc.
>>> This is probably due to our web app creating a single folder and all files
>>> are written under that folder.
>>> This is the nature of the Allocation Group of XFS? Is there a way to avoid
>>> this? 
>>
>> Yes.
>>
>> 1.  Don't put all files in a single directory.
>>
>> 2.  Use the inode32 allocator on a filesystem greater than 1TB in size.
>>  This will cause inodes to be located in the first 1TB and files to be
>> allocated round robin across the AGs via rotor stepping.  See page 10:
>> http://oss.sgi.com/projects/xfs/training/xfs_slides_06_allocators.pdf
> 
> 3: Use a storage layout that is not affected by hotspots due to
> filesystem locality.
> 
> That is, build the storage to the scale that you are likely to need
> in the future. i.e. use all 112 disks (14x 6+2 RAID = 112 disks) to
> begin with and lay the storage and filesystem out optimally
> accordingly.  I'd build seven 14+2 hardware RAID6 luns (112 disks)
> and stripe them in RAID0, setting the XFS stripe unit to be the
> width of a hardware RAID6 lun. That way sequential IO to a single
> region of the disk still hits every single  disk in the array, and
> hotspots don't occur
>
> If you do this, it doesn't matter if you use inode64 or inode32 for
> a hotspot perspective, only a file fragmentation perspective. This
> is the way XFs has been used for exactly this sort of storage for
> the last 15 years....
> 
>>> As we will have files keep writing to the same disk thus creating a
>>> hot spot.
>>> Although it might not hurt us that much if we fill up a single RAID6 to
>>> 60-70% then adding another RAID6 to the mix. We could go up to a total of
>>> 14 RAID6 sets.
>>
>> Again, you probably don't want to do this.  Too many eggs in one basket.
>>
>> You should investigate using GlusterFS to tie multiple XFS storage
>> servers together into a single file tree.
> 
> Another possible solution. You should talk to RedHat (says the
> RedHat employee ;)....

I get the impression the "grow as you go" mindset here is probably due
to budget/cash flow issues, as well as evaluating the system at small
scale before committing to going larger.  Thus I'd guess building the
112 drive system up front isn't a real possibility.  And this is where
something like Gluster atop XFS would really come in handy, as it would
make "grow as you go" much more feasible, while avoiding the 'game over'
fragmentation issue with simply growing XFS in the manner described by
the OP.

Emmanuel states Gluster is slow, but that's a very relative statement.
For clients streaming single large files over GbE or slower links it
should be plenty fast.  Gluster and similar network file systems tend to
be slow with metadata intensive or transactional workloads.

>> Running an xfs_repair on a single filesystem denies all access, and with
>> a 252TB XFS this could take some time.
> 
> For a filesystem with 1-2TB files, it'll take 30s to run. That's not
> an issue.

For some reason I was thinking data size instead of metadata.  With only
a few hundred to low thousand files it would be quick indeed, a non issue.

>>> Is LVM a good choice of doing this configuration? Or do you have a better
>>> recommendation?
>>> The reason we thought LVM would be good was so that we could easily grow
>>> XFS.
>>
>> Why not do the concatenation within the SAN array controller?
> 
> Same problem as LVM concatenation. Hot spots.

I was simply suggesting hardware vs software concatenation here
unrelated to his current flawed expansion path idea, as his SAN
controller probably has some nice features and performance here.

>>> Is I was to use the 8-disk RAID6 array with a 256kB stripe size will have a
>>> sunit of 512 and a swidth of (8-2)*512=3072.
>>
>> So a 256KB strip and a 1.5MB stripe.  With RAID6 RMW?  I wouldn't
>> recommend this.
> 
> Large files, sequential IO, there will be no RMW cycles in the RAID.
> The write cache of RAID controller will do the aggregation of
> individual IOs into full stripe writes just fine. 
> 
>> It appears most of your writes will be appends, meaning little
>> allocation, which means little stripe aligned write out.  Here you are
>> trying to optimize for large IOs which would be fine if you had an all
>> or mostly allocation workload, but you don't.  You have an append heavy
>> workload.
>>
>> Using large strips (stripe units, chunks) with parity RAID, especially
>> RAID6, will simply murder your append performance due to massive
>> read-modify-write operations on large strips.
> 
> No, that's wrong. sequential IO will always fill full stripes in the
> cache, so RMW cycles simple will not happen. Remember that RMW
> occurs whenteh cache has to be flushed to the back end disks, not
> when writes come in to the front end cache....
> 
>> With RAID6 with a mostly append workload, you should be using a small
>> strip size.  This has been discussed here at length and the consensus is
>> anything over a 32KB strip size doesn't improve performance, but can
>> hurt performance, especially with parity RAID.  Thus you should create
>> your 6+2 arrays with a 32KB strip and (6*32)=192KB stripe, and create
>> your XFS with "-d su=32k,sw=6".  This should yield significantly better
>> append performance.
> 
> That's a tuning for an IOPS intensive workload, not a large scale,
> large file storage workloads.
> 
> While sequential writes are an append workload, it's an append
> workload that the RAID controller is optimised to avoid causing RMW
> cycles for. As such, the above is bad advice for large files with
> sequential IO workloads. Large files, large fielsystem, sequential
> IO is ideal for large RAID6 widths....

Yes, of course.  WRT XFS you've drilled "allocation=aligned" and "non
allocation=unaligned" so thoroughly into my head that I failed to
actually think for a second about what the hardware does with this type
of large append data stream.  I feel a bit silly making this juvenile
oversight.  Won't happen again. ;)

>> External log devices are for systems that modify metadata at rates of
>> hundreds of IOs per second.  So don't specify a log device.
> 
> Even at hundreds of thousands of IOs per second, external logs don't
> provide much in way of benefit thanks to delayed logging. The only
> reason for using an external log these days is a fsync heavy or
> synchronous write workload. And in most cases a BBWC means even
> those worklaods don't need an external log...

Which bloke provided us with this journal magic code again?  Can't
recall his name... ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs