On 09/19/12 00:20, Nicholas A. Bellinger wrote:
Are you enabling emulate_write_cache=1 with your iblock
backends..? This can have a gigantic effect on initiator
performance for both MSFT + Linux SCSI clients.
That sounds interesting, but also potentially rather dangerous,
unless there is a very reliable implementation of IO barriers.
Just like with enabling write caches on real disks...
Not exactly. The name of the 'emulate_write_cache' device attribute is
a bit mis-leading here. This bit simply reports (to the SCSI client)
that the WCE=1 bit is set during SCSI mode sense (caching page) is read
during the initial LUN scan.
Then can I say that the default is wrong?
You are declaring writethrough a device that is almost certainly a
writeback (because at least HDDs will have caches).
If power is lost at the iscsi target, there WILL be data loss. People do
not expect that. Change the default!
Besides this, I don't understand how declaring an iscsi target as
writethrough could slow down operations volountarily by initiators. That
would be a bug of the initiators because writethrough is "better" than
writeback for all purposes: initiators should just skip the queue drain
/ flush / FUA, and all the rest should be the same.
[ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
RAID to make sure the WRITEs are striped aligned to get best
performance with software MD raid.
That does not quite ensure that the writes are stripe aligned,
but perhaps a larger stripe cache would help.
I'm talking about what MD raid has chosen as it's underlying
max_sectors_kb to issue I/O to the underlying raid member devices. This
depends on what backend storage hardware is in use, this may end up as
'127', which will result in ugly mis-aligned writes that ends up killing
performance.
Interesting observation.
For local processes writing, probably MD waits enough time for other
requests to come and fill a stripe before initiating a rmw; but maybe
iscsi is too slow for that and MD initiates an rmw for each request
which would be a zillion of RMWs.
Can that be? Anyone knows MD enough to say if MD waits a little bit for
more data in the attempt of filling an entire stripe before proceeding
with rmw? If yes, can such timeout be set?
We've (RTS) changed this with a one-liner patch to raid456.c code on .32
basded distro kernels in the past to get proper stripe aligned writes,
and it obviously makes a huge difference with fast storage hardware.
This value is writable via sysfs, why do you need a patch?
That's exactly what I'm talking about.
With buffered FILEIO enabled a incoming WRITE payload will have already
been ACKs back to the SCSI fabric and up the storage -> filesystem
stack, but if a power loss was to occur before that data has been
written out (using a battery back-up unit for example), then the FS on
the client will have (silently) lost data.
This is why we removed the buffered FILEIO from mainline in the first
place, but in retrospect if people understand the consequences and still
want to use buffered FILEIO for performance reasons they should be able
to do so.
If you declare the target as writeback and implement flush+FUA, no data
loss should occur AFAIU, isn't that so?
AFAIR, hard disks do normally declare all operations to be complete
immediately after you submit (while they are still in the cache in
reality), but if you issue a flush+FUA they make an exception to this
rule and make sure that this operation and all previously submitted
operations are indeed on the platter before returning. Do I remember
correctly?
Can you do the same for buffered FILEIO?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html