Re: raid10n2/xfs setup guidance on write-cache/barrier

Jessie Evangelista <jessie.evangelista@xxxxxxxxx> · Fri, 16 Mar 2012 00:18:41 +0800

Hey Peter,

On Thu, Mar 15, 2012 at 10:07 PM, Peter Grandi <pg@xxxxxxxxxxxxxxxxxxx> wrote:
>>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>>> I want to create an xfs filesystem on top of it. The
>>>> filesystem will be used as NFS/Samba storage.
>
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

Thanks for reminding me about raid5. I'll probably give it a try and
do some benchmarks. I'd also like to try raid10f2.

>> [ ... ] I've run some benchmarks with dd trying the different
>> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
>> of=/dev/md0 bs=64k count=655360 oflag=direct
>
> That's for bulk sequential transfers. Random-ish, as in a
> fileserver perhaps with many smaller files, may not be the same,
> but probably larger chunks are good.
>>> [ ... ] What kernel version?  This can make a significant
>>> difference in XFS metadata performance.
>
> As an aside, that's a myth that has been propagandaized by DaveC
> in his entertaining presentation not long ago.
>
> There have been decent but no major improvements in XFS metadata
> *performance*, but weaker implicit *semantics* have been made an
> option, and these have a different safety/performance tradeoff
> (less implicit safety, somewhat more performance), not "just"
> better performance.
>
> http://lwn.net/Articles/476267/
>  «In other words, instead of there only being a maximum of 2MB of
>  transaction changes not written to the log at any point in time,
>  there may be a much greater amount being accumulated in memory.
>
>  Hence the potential for loss of metadata on a crash is much
>  greater than for the existing logging mechanism.
>
>  It should be noted that this does not change the guarantee that
>  log recovery will result in a consistent filesystem.
>
>  What it does mean is that as far as the recovered filesystem is
>  concerned, there may be many thousands of transactions that
>  simply did not occur as a result of the crash.
>
>  This makes it even more important that applications that care
>  about their data use fsync() where they need to ensure
>  application level data integrity is maintained.»
>
>>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>>> need that much in memory journal buffer space anyway.
>
> That's probably true, but does no harm.
>
>>>  XFS uses relatime which is equivalent to noatime WRT IO
>>> reduction performance, so don't specify 'noatime'.
>
> Uhm, not so sure, and 'noatime' does not hurt either.
>
>> I just wanted to be explicit about it so that I know what is
>> set just in case the defaults change
>
> That's what I do as well, because relying on remembering exactly
> what the defaults are can cause sometimes confusion. But it is a
> matter of taste to a large degree, like 'noatime'.
>
>>> In fact, it appears you don't need to specify anything in
>>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.
>
> For NFS/Samba, especially with ACLs (SMB protocol), and
> especially if one expects largish directories, and in general I
> would recommend a larger inode size, at least 1024B, if not even
> 2048B.

thanks for this tip. will look into adjusting inode size.

>
> Also, as a rule I want to make sure that the sector size is set
> to 4096B, for future proofing (and recent drives not only have
> 4096B sectors but usually lie).
>

it seems the 1TB drivers that I have still have 512byte sectors

>>>  And the one thing that might actually increase your
>>> performance a little bit you didn't specify--sunit/swidth.
>
> Especially 'sunit', as XFS ideally would align metadata on chunk
> boundaries.
>
>>>  However, since you're using mdraid, mkfs.xfs will calculate
>>> these for you (which is nice as mdraid10 with odd disk count
>>> can be a tricky calculation).
>
> Ambiguous more than tricky, and not very useful, except the chunk
> size.
>
>>>> Will my files be safe even on sudden power loss?
>
> The answer is NO, if you mean "absolutely safe". But see the
> discussion at the end.
>
>>> [ ... ]  Application write behavior does play a role.
>
> Indeed, see the discussion at the end and ways to mitigate.
>
>>>  UPS with shutdown scripts, and persistent write cache prevent
>>> this problem. [ ... ]
>
> There is always the problem of system crashes that don't depend
> on power....
>
>>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd
>
>>> Disabling drive write caches does decrease the likelihood of
>>> data loss.
>
>>>> I tried it but performance is horrendous.
>
>>> And this is why you should leave them enabled and use
>>> barriers.  Better yet, use a RAID card with BBWC and disable
>>> the drive caches.
>
>> Budget does not allow for RAID card with BBWC
>
> You'd be surprised by how cheap you can get one. But many HW host
> adapters with builtin cache have bad performance or horrid bugs,
> so you'd have to be careful.

could you please suggest a hardware raid card with BBU that's cheap?

>
> In any case that's not the major problem you have.
>
>>>> Am I better of with ext4? Data safety/integrity is the
>>>> priority and optimization affecting it is not acceptable.
>
> XFS is the filesystem of the future ;-). I would choose it over
> 'ext4' in every plausible case.
>
>> nightly backups will be stored on an external USB disk
>
> USB is an unreliable, buggy transport, and slow, eSATA is
> enormously better and faster.
>
>> is xfs going to be prone to more data loss in case the
>> non-redundant power supply goes out?
>
> That's the wrong question entirely. Data loss can happen for many
> other reasons, and XFS is probably one of the safest designs, if
> properly used and configured. The problems are elsewhere.

Can you please elaborate how xfs can be properly used and configured?
>
>> I just updated the kernel to 3.0.0-16.  Did they take out
>> barrier support in mdraid? or was the implementation replaced
>> with FUA?  Is there a definitive test to determine if the off
>> the shelf consumer sata drives honor barrier or cache flush
>> requests?
>
> Usually they do, but that's the least of your worries. Anyhow a
> test that occurs to me is to write a know pattern to a file,
> let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
> power off. Then check whether the whole 1GiB is the known pattern.
>
>> I think I'd like to go with device cache turned ON and barrier
>> enabled.
>
> That's how it is supposed to work.
>
> As to general safety issues, there seem to be some misunderstanding,
> and I'll try to be more explicit than "lob the grenade" notion.
>
> It matters a great deal what "safety" means in your mind and that
> of your users. As a previous comment pointed out, that usually
> involves backups, that is data that has already been stored.
>
> But your insistence on power off and disk caches etc. seems to
> indicate that "safety" in your mind means "when I click the
> 'Save' button it is really saved and not partially".
>
let me define safety as needed by the usecase:
fileA is a 2MB open office document file already existing on the file system.
userA opens fileA locally, modifies a lot of lines and attempts to save it.
as the saving operation is proceeding, the PSU goes haywire and power
is cut abruptly.
When the system is turned on, i expect some sort of recovery process
to bring the filesystem to a consistent state.
I expect fileA should be as it was before the save operation and
should not be corrupted in anyway.
Am I asking/expecting too much?

> As to that there quite a lot of qualifiers:
>
>  * Most users don't understand that even in the best scenario a
>    file is really saved not when they *click* the 'Save' button,
>    but when they get the "Saved!" message. In between anything
>    can happen. Also, work in progress (not yet saved explicitly)
>    is fair game.
>
>  * "Really saved" is an *application* concern first and foremost.
>    The application *must* say (via 'fsync') that it wants the
>    data really saved. Unfortunately most applications don't do
>    that because "really saved" is a very expensive operation, and
>    usually sytems don't crash, so the application writer looks
>    like a genius if he has an "optimistic" attitude. If you do a
>    web search look for various O_PONIES discussions. Some intros:
>
>      http://lwn.net/Articles/351422/
>      http://lwn.net/Articles/322823/
>
>  * XFS (and to a point 'ext4') is designed for applications that
>    work correctly and issue 'fsync' appropriately, and if they do
>    it is very safe, because it tries hard to ensure that either
>    'fsync' means "really saved" or you know that it does not. XFS
>    takes advantage of the assumption that applications do the
>    right thing to do various latency-based optimizations between
>    calls to 'fsync'.
>
>  * Unfortunately most GUI applications don't do the right thing,
>    but fortunately you can compensate for that. The key here is
>    to make sure that the flusher's parameter are set for rather
>    more frequent flushing than the default, which is equivalent
>    to issuing 'fsync' systemwide fairly frequently. Ideally set
>    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
>    rate (and in reversal on some of my previous advice leave
>    'vm/dirty_background_bytes' to something quite large unless
>    you *really* want safety), and to shorten significantly
>    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
>    This defeats some XFS optimizations, but that's inevitable.
>
>  * In any case you are using NFS/Samba, and that opens a much
>    bigger set of issues, because caching happens on the clients
>    too: http://www.sabi.co.uk/0707jul.html#070701b
>
> Then Von Neuman help you if your users or you decide to store lots
> of messages in MH/Maildir style mailstores, or VM images on
> "growable" virtual disks.

what's wrong with VM images on "growable" virtual disks. are you
saying not to rely on lvm2 volumes?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html