Re: How to pre-allocate files for sequential access?

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Fri, 06 Apr 2012 02:06:46 -0500

On 4/4/2012 6:57 PM, troby wrote:

The high points:

> 20 TB filesystem
> directory with 10000 pre-allocated 2GB files. 
> written sequentially, writes of about 120KB
> overwritten (not deleted)
> single writer process using a single thread. 
> MongoDB. 
> minimize seek activity
> contiguous file allocation

Current filesystem configuration:

> The filesystem as currently created looks like this:
> 
> meta-data=/dev/sdb1              isize=256    agcount=20, agsize=268435448
> blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=5127012091, imaxpct=1
>          =                       sunit=8      swidth=56 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0

RAID configuration based on xfs_info analysis:

Hardware RAID controller, unknown type, unknown cache configuration
8 * 3TB disks, possibly AF and/or 'green', RPM: 5600-7200, cache 32-64MB
RAID5 w/32KB chunk, 7 spindle stripe

> However what I see is that the earliest created files start about 5TB into
> the filesystem. The files are not being created in contiguous block ranges.
> Here is an xfs_bmap example of three files created in sequence:
> 0: [0..4192255]: 24075083136..24079275391
> 0: [0..4192255]: 26222566720..26226758975
> 0: [0..4192255]: 28370050304..28374242559

This is a result of the inode32 allocator behavior.  Try the inode64
allocator as Eric recommended.

> Using seekwatcher I've determined that the actual I/O pattern, even when a
> small number of files is being written to, is spread over a fairly wide
> range of filesystem offsets, resulting in about 250 seeks per second. I

I don't care for the use of the term "seek" as there is not a 1:1
correlation between these "seeks" and actual disk head seeks.  The
latter is what is always of concern because that's where all the latency
is.  Folks most often become head seek bound due to the extra read seeks
in RMW operations when using parity RAID.  These extra RMW seeks are
completely hidden from Seekwatcher when using hardware RAID, though
should be somewhat visible with md RAID.

> don't know how to determine how long the seeks are. (I tried to upload the
> seekwatcher image but apparently that's not allowed). Seekwatcher shows the
> I/O activity is in a range between 15 and 17 TB into the filesystem. During
> this time there was a set of about 4 files being actively written as far as
> I know.

It's coarse, but you might start with iostat interactively to get a
rough idea WRT overall latency.  Look at the await column,  man iostat.
 This gives 1 second interval.

$ iostat -d 1 -x /dev/sdb1

> I'm guessing that the use of multiple allocation groups may explain the
> non-contiguous block allocation, although I read at one point that even with
> multiple allocation groups, files within a single directory would use the
> same group. 

What you describe is the behavior of the inode64 allocator, which is
optional.  IIRC the [default] inode32 allocator puts all the directories
in the first AG and spreads the files around the remaining AGs, which is
what you are seeing.

> I don't believe I need multiple allocation groups for this
> application due to the single writer and the fact that all files will be
> preallocated before use. Would it be reasonable to force mkfs to use a
> single 20TB allocation group, and would this be likely to produce contiguous
> block allocation?

As Eric mentioned, the max AG size is 1TB, though this may increase in
the future as aerial density increases.

One final thought I'd like to share is that parity RAID, level 5 or 6,
is rarely, if ever, good for write mostly applications, especially for
overwriting applications.  You're preallocating all of your files on a
fresh XFS, so you should be able to avoid the extra seeks of RMW until
you start overwriting them.  Once that happens your request latency will
double, if not more, and your throughput will take a 2x or more nose
dive.  If you can get by with 11-12TB, I'd rebuild those 8 drives as a
RAID10 array.  Here are your advantages:

1.  No RMW penalty, near zero RAID computation, high throughput
2.  RAID5 write throughput drops 5-20x when degraded/rebuilding
3.  RAID10 loses zero performance when degraded
4.  RAID10 rebuild time is 10-20x faster than RAID5
5.  RAID10 suffers only a small performance drop during rebuild

Need all 20TB, or something in between 10-20?  What is your current
hardware setup and allowed upgrade budget, if any?  I'd be willing to
offer you some optimal hardware choices/advice based on your current
setup, to achieve the optimal RAID10, if you would like.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs