Re: Need advice on building a new XFS setup for large files

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 23 Jan 2013 22:23:08 -0600

On 1/23/2013 9:09 AM, Alvin Ong wrote:
> Thanks Stan, Dave and Emmanuel for such informative replies. I will take
> some time to digest this information and make some considerations.
> As for the files they start at 500GB at a minimum. The rate of the
> growth is not known as of yet.
> But it won't be high loads. The idea is sort of like a cloud storage for
> the customer to dump data.

So you're going to be using some kind of virtual disk setup for each
customer, and these virtual disks are sparse files on XFS?  They start
as empty 500GB files and you will increase this size over time as they
fill up?  The user files that each customer uploads could be either
small or large, correct?

If so then my caution about large strip/stripe RMW cycles may have been
justified after all.  What customer behavior do you anticipate?  I.e.
will they be using this resource akin to a local disk, copying
individual files up, or more in a backup scenario, where they do a batch
copy with some size to the total transfer?

If mostly the small file small transfer scenario, then you're probably
not going to want the large stripe setup Dave described as you will
incur a big RMW penalty.  If the transfers are all or mostly multiple
megabytes in size then the large stripe may still be ok.

> With that said we also do not want to have issues with fragmentation or
> any failure that could cause data lost in the future.

I'm not familiar enough with sparse files to comment on fragmentation
patterns.  Others may have some answers/recommendations for you here.

-- 
Stan

> Thanks
> Alvin
> 
> 
> On 23-Jan-13 8:35 PM, Stan Hoeppner wrote:
>> On 1/22/2013 4:05 PM, Dave Chinner wrote:
>>> On Tue, Jan 22, 2013 at 06:49:46AM -0600, Stan Hoeppner wrote:
>>>> On 1/21/2013 10:22 PM, Alvin Ong wrote:
>>>>> Hi,
>>>>>
>>>>> We are building a solution with a web front end for our users to store
>>>>> large files.
>>>>> Large files starting from the size of 500GB and above and can grow
>>>>> up to
>>>>> 1-2TB's per file.
>>>>> This is the reason we are trying out XFS to see if we can get a
>>>>> test system
>>>>> running.
>>>> Tell us more about these files.  Is this simply bulk file storage?
>>>> Start at 500GB and append until 2TB?  How often will the files be
>>>> appended and at what rate?  I.e. will it take 3 days to append from
>>>> 500GB to 2TB or take 3 months?  The answer to this dictates how the
>>>> files and filesystem will fragment over time.  Constantly expanding
>>>> with
>>>> additional 6 spindle constituent arrays, LVM concatenation, and
>>>> xfs_growfs may leave you with an undesirable, possibly disastrous,
>>>> fragmentation pattern.
>>> I'd say it's guaranteed, not a possibility.
>>>
>>>>> We plan to use a 6+2 RAID6 to start off with. Then when it gets
>>>>> filled up
>>>>> to maybe 60-70% we will
>>>>> expand by adding another 6+2 RAID6 to the array.
>>>>> The max we can grow this configuration is up to 252TB usable which
>>>>> should
>>>>> be enough for a year.
>>>>> Our requirements might grow up to 2PB in 2 years time if all goes
>>>>> well.
>>>> I'd not attempt growing a single XFS to the scale you're describing,
>>>> via
>>>> the methods you describe.  The odds of catastrophe are too great.
>>> It's a recipe for disaster and not recommended at all.
>>>
>>>>> So I have been testing all of this out on a VM running 3 vmdk's and
>>>>> using
>>>>> LVM to create a single logical volume of the 3 disks.
>>>>> I noticed that out of sdb, sdc and sdd, files keep getting written
>>>>> to sdc.
>>>>> This is probably due to our web app creating a single folder and
>>>>> all files
>>>>> are written under that folder.
>>>>> This is the nature of the Allocation Group of XFS? Is there a way
>>>>> to avoid
>>>>> this?
>>>> Yes.
>>>>
>>>> 1.  Don't put all files in a single directory.
>>>>
>>>> 2.  Use the inode32 allocator on a filesystem greater than 1TB in size.
>>>>   This will cause inodes to be located in the first 1TB and files to be
>>>> allocated round robin across the AGs via rotor stepping.  See page 10:
>>>> http://oss.sgi.com/projects/xfs/training/xfs_slides_06_allocators.pdf
>>> 3: Use a storage layout that is not affected by hotspots due to
>>> filesystem locality.
>>>
>>> That is, build the storage to the scale that you are likely to need
>>> in the future. i.e. use all 112 disks (14x 6+2 RAID = 112 disks) to
>>> begin with and lay the storage and filesystem out optimally
>>> accordingly.  I'd build seven 14+2 hardware RAID6 luns (112 disks)
>>> and stripe them in RAID0, setting the XFS stripe unit to be the
>>> width of a hardware RAID6 lun. That way sequential IO to a single
>>> region of the disk still hits every single  disk in the array, and
>>> hotspots don't occur
>>>
>>> If you do this, it doesn't matter if you use inode64 or inode32 for
>>> a hotspot perspective, only a file fragmentation perspective. This
>>> is the way XFs has been used for exactly this sort of storage for
>>> the last 15 years....
>>>
>>>>> As we will have files keep writing to the same disk thus creating a
>>>>> hot spot.
>>>>> Although it might not hurt us that much if we fill up a single
>>>>> RAID6 to
>>>>> 60-70% then adding another RAID6 to the mix. We could go up to a
>>>>> total of
>>>>> 14 RAID6 sets.
>>>> Again, you probably don't want to do this.  Too many eggs in one
>>>> basket.
>>>>
>>>> You should investigate using GlusterFS to tie multiple XFS storage
>>>> servers together into a single file tree.
>>> Another possible solution. You should talk to RedHat (says the
>>> RedHat employee ;)....
>> I get the impression the "grow as you go" mindset here is probably due
>> to budget/cash flow issues, as well as evaluating the system at small
>> scale before committing to going larger.  Thus I'd guess building the
>> 112 drive system up front isn't a real possibility.  And this is where
>> something like Gluster atop XFS would really come in handy, as it would
>> make "grow as you go" much more feasible, while avoiding the 'game over'
>> fragmentation issue with simply growing XFS in the manner described by
>> the OP.
>>
>> Emmanuel states Gluster is slow, but that's a very relative statement.
>> For clients streaming single large files over GbE or slower links it
>> should be plenty fast.  Gluster and similar network file systems tend to
>> be slow with metadata intensive or transactional workloads.
>>
>>>> Running an xfs_repair on a single filesystem denies all access, and
>>>> with
>>>> a 252TB XFS this could take some time.
>>> For a filesystem with 1-2TB files, it'll take 30s to run. That's not
>>> an issue.
>> For some reason I was thinking data size instead of metadata.  With only
>> a few hundred to low thousand files it would be quick indeed, a non
>> issue.
>>
>>>>> Is LVM a good choice of doing this configuration? Or do you have a
>>>>> better
>>>>> recommendation?
>>>>> The reason we thought LVM would be good was so that we could easily
>>>>> grow
>>>>> XFS.
>>>> Why not do the concatenation within the SAN array controller?
>>> Same problem as LVM concatenation. Hot spots.
>> I was simply suggesting hardware vs software concatenation here
>> unrelated to his current flawed expansion path idea, as his SAN
>> controller probably has some nice features and performance here.
>>
>>>>> Is I was to use the 8-disk RAID6 array with a 256kB stripe size
>>>>> will have a
>>>>> sunit of 512 and a swidth of (8-2)*512=3072.
>>>> So a 256KB strip and a 1.5MB stripe.  With RAID6 RMW?  I wouldn't
>>>> recommend this.
>>> Large files, sequential IO, there will be no RMW cycles in the RAID.
>>> The write cache of RAID controller will do the aggregation of
>>> individual IOs into full stripe writes just fine.
>>>
>>>> It appears most of your writes will be appends, meaning little
>>>> allocation, which means little stripe aligned write out.  Here you are
>>>> trying to optimize for large IOs which would be fine if you had an all
>>>> or mostly allocation workload, but you don't.  You have an append heavy
>>>> workload.
>>>>
>>>> Using large strips (stripe units, chunks) with parity RAID, especially
>>>> RAID6, will simply murder your append performance due to massive
>>>> read-modify-write operations on large strips.
>>> No, that's wrong. sequential IO will always fill full stripes in the
>>> cache, so RMW cycles simple will not happen. Remember that RMW
>>> occurs whenteh cache has to be flushed to the back end disks, not
>>> when writes come in to the front end cache....
>>>
>>>> With RAID6 with a mostly append workload, you should be using a small
>>>> strip size.  This has been discussed here at length and the
>>>> consensus is
>>>> anything over a 32KB strip size doesn't improve performance, but can
>>>> hurt performance, especially with parity RAID.  Thus you should create
>>>> your 6+2 arrays with a 32KB strip and (6*32)=192KB stripe, and create
>>>> your XFS with "-d su=32k,sw=6".  This should yield significantly better
>>>> append performance.
>>> That's a tuning for an IOPS intensive workload, not a large scale,
>>> large file storage workloads.
>>>
>>> While sequential writes are an append workload, it's an append
>>> workload that the RAID controller is optimised to avoid causing RMW
>>> cycles for. As such, the above is bad advice for large files with
>>> sequential IO workloads. Large files, large fielsystem, sequential
>>> IO is ideal for large RAID6 widths....
>> Yes, of course.  WRT XFS you've drilled "allocation=aligned" and "non
>> allocation=unaligned" so thoroughly into my head that I failed to
>> actually think for a second about what the hardware does with this type
>> of large append data stream.  I feel a bit silly making this juvenile
>> oversight.  Won't happen again. ;)
>>
>>>> External log devices are for systems that modify metadata at rates of
>>>> hundreds of IOs per second.  So don't specify a log device.
>>> Even at hundreds of thousands of IOs per second, external logs don't
>>> provide much in way of benefit thanks to delayed logging. The only
>>> reason for using an external log these days is a fsync heavy or
>>> synchronous write workload. And in most cases a BBWC means even
>>> those worklaods don't need an external log...
>> Which bloke provided us with this journal magic code again?  Can't
>> recall his name... ;)
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs