Re: raid10n2/xfs setup guidance on write-cache/barrier

pg@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 15 Mar 2012 14:07:25 +0000

>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>> I want to create an xfs filesystem on top of it. The
>>> filesystem will be used as NFS/Samba storage.

Consider also an 'o2' layout (it is probably the same thing for a
3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
one of the few cases where RAID5 may be plausible.

> [ ... ] I've run some benchmarks with dd trying the different
> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
> of=/dev/md0 bs=64k count=655360 oflag=direct

That's for bulk sequential transfers. Random-ish, as in a
fileserver perhaps with many smaller files, may not be the same,
but probably larger chunks are good.

>> [ ... ] What kernel version?  This can make a significant
>> difference in XFS metadata performance.

As an aside, that's a myth that has been propagandaized by DaveC
in his entertaining presentation not long ago.

There have been decent but no major improvements in XFS metadata
*performance*, but weaker implicit *semantics* have been made an
option, and these have a different safety/performance tradeoff
(less implicit safety, somewhat more performance), not "just"
better performance.

http://lwn.net/Articles/476267/
 «In other words, instead of there only being a maximum of 2MB of
  transaction changes not written to the log at any point in time,
  there may be a much greater amount being accumulated in memory.

  Hence the potential for loss of metadata on a crash is much
  greater than for the existing logging mechanism.

  It should be noted that this does not change the guarantee that
  log recovery will result in a consistent filesystem.

  What it does mean is that as far as the recovered filesystem is
  concerned, there may be many thousands of transactions that
  simply did not occur as a result of the crash.

  This makes it even more important that applications that care
  about their data use fsync() where they need to ensure
  application level data integrity is maintained.»

>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>> need that much in memory journal buffer space anyway.

That's probably true, but does no harm.

>>  XFS uses relatime which is equivalent to noatime WRT IO
>> reduction performance, so don't specify 'noatime'.

Uhm, not so sure, and 'noatime' does not hurt either.

> I just wanted to be explicit about it so that I know what is
> set just in case the defaults change

That's what I do as well, because relying on remembering exactly
what the defaults are can cause sometimes confusion. But it is a
matter of taste to a large degree, like 'noatime'.

>> In fact, it appears you don't need to specify anything in
>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.

For NFS/Samba, especially with ACLs (SMB protocol), and
especially if one expects largish directories, and in general I
would recommend a larger inode size, at least 1024B, if not even
2048B.

Also, as a rule I want to make sure that the sector size is set
to 4096B, for future proofing (and recent drives not only have
4096B sectors but usually lie).

>>  And the one thing that might actually increase your
>> performance a little bit you didn't specify--sunit/swidth.

Especially 'sunit', as XFS ideally would align metadata on chunk
boundaries.

>>  However, since you're using mdraid, mkfs.xfs will calculate
>> these for you (which is nice as mdraid10 with odd disk count
>> can be a tricky calculation).

Ambiguous more than tricky, and not very useful, except the chunk
size.

>>> Will my files be safe even on sudden power loss?

The answer is NO, if you mean "absolutely safe". But see the
discussion at the end.

>> [ ... ]  Application write behavior does play a role.

Indeed, see the discussion at the end and ways to mitigate.

>>  UPS with shutdown scripts, and persistent write cache prevent
>> this problem. [ ... ]

There is always the problem of system crashes that don't depend
on power....

>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd

>> Disabling drive write caches does decrease the likelihood of
>> data loss.

>>> I tried it but performance is horrendous.

>> And this is why you should leave them enabled and use
>> barriers.  Better yet, use a RAID card with BBWC and disable
>> the drive caches.

> Budget does not allow for RAID card with BBWC

You'd be surprised by how cheap you can get one. But many HW host
adapters with builtin cache have bad performance or horrid bugs,
so you'd have to be careful.

In any case that's not the major problem you have.

>>> Am I better of with ext4? Data safety/integrity is the
>>> priority and optimization affecting it is not acceptable.

XFS is the filesystem of the future ;-). I would choose it over
'ext4' in every plausible case.

> nightly backups will be stored on an external USB disk

USB is an unreliable, buggy transport, and slow, eSATA is
enormously better and faster.

> is xfs going to be prone to more data loss in case the
> non-redundant power supply goes out?

That's the wrong question entirely. Data loss can happen for many
other reasons, and XFS is probably one of the safest designs, if
properly used and configured. The problems are elsewhere.

> I just updated the kernel to 3.0.0-16.  Did they take out
> barrier support in mdraid? or was the implementation replaced
> with FUA?  Is there a definitive test to determine if the off
> the shelf consumer sata drives honor barrier or cache flush
> requests?

Usually they do, but that's the least of your worries. Anyhow a
test that occurs to me is to write a know pattern to a file,
let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
power off. Then check whether the whole 1GiB is the known pattern.

> I think I'd like to go with device cache turned ON and barrier
> enabled.

That's how it is supposed to work.

As to general safety issues, there seem to be some misunderstanding, 
and I'll try to be more explicit than "lob the grenade" notion.

It matters a great deal what "safety" means in your mind and that
of your users. As a previous comment pointed out, that usually
involves backups, that is data that has already been stored.

But your insistence on power off and disk caches etc. seems to
indicate that "safety" in your mind means "when I click the
'Save' button it is really saved and not partially".

As to that there quite a lot of qualifiers:

  * Most users don't understand that even in the best scenario a
    file is really saved not when they *click* the 'Save' button,
    but when they get the "Saved!" message. In between anything
    can happen. Also, work in progress (not yet saved explicitly)
    is fair game.

  * "Really saved" is an *application* concern first and foremost.
    The application *must* say (via 'fsync') that it wants the
    data really saved. Unfortunately most applications don't do
    that because "really saved" is a very expensive operation, and
    usually sytems don't crash, so the application writer looks
    like a genius if he has an "optimistic" attitude. If you do a
    web search look for various O_PONIES discussions. Some intros:

      http://lwn.net/Articles/351422/
      http://lwn.net/Articles/322823/

  * XFS (and to a point 'ext4') is designed for applications that
    work correctly and issue 'fsync' appropriately, and if they do
    it is very safe, because it tries hard to ensure that either
    'fsync' means "really saved" or you know that it does not. XFS
    takes advantage of the assumption that applications do the
    right thing to do various latency-based optimizations between
    calls to 'fsync'.

  * Unfortunately most GUI applications don't do the right thing,
    but fortunately you can compensate for that. The key here is
    to make sure that the flusher's parameter are set for rather
    more frequent flushing than the default, which is equivalent
    to issuing 'fsync' systemwide fairly frequently. Ideally set
    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
    rate (and in reversal on some of my previous advice leave
    'vm/dirty_background_bytes' to something quite large unless
    you *really* want safety), and to shorten significantly
    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
    This defeats some XFS optimizations, but that's inevitable.

  * In any case you are using NFS/Samba, and that opens a much
    bigger set of issues, because caching happens on the clients
    too: http://www.sabi.co.uk/0707jul.html#070701b

Then Von Neuman help you if your users or you decide to store lots
of messages in MH/Maildir style mailstores, or VM images on
"growable" virtual disks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html