Re: SYNCHRONIZE_CACHE from sd_preppare_flush does not have retries.!

Ric Wheeler <rwheeler@xxxxxxxxxx> · Thu, 29 Apr 2010 16:13:03 -0400

On 04/19/2010 02:45 PM, James Bottomley wrote:
On Mon, 2010-04-19 at 20:14 +0200, Bernd Schubert wrote:

On Monday 19 April 2010, Mike Christie wrote:

On 04/19/2010 06:32 AM, Desai, Kashyap wrote:

I am facing one issue with scsi stack.
Here is a background of my test.

Mount ext3 file system with journaling support with barrier=1, commit=5
Now, with this setup file system will do submit_bh with  WRITE_BARRIER
flag set for interval of 5 seconds. (This is a part of journaling.)
Eventually it will call queue_flush() which will generate SCSI command of
CDB: SYNCHRONIZE_CAHCE and insert it into the request queue. I observed
that creation of SYNCHRONIZE_CACHE is a part of sd_prepare_flush(). Here
we have timeout set to SD_TIMEOUT but retries are not set. Because of
retries of the request is not set, there is no retries allowed for
SYNCHRONIZE_CACHE at mid layer.

Because of zero retries for SYNCHRONIZE_CACHE command at mid-layer, it is
creating trouble for file system. In current situation, Even though LLD
send back commands with DID_RESET, SYNCHRONIZE_CACHE will fail
immediately without going for any retries, when HBA is in recovery state.
Eventually this information goes to File system and it sees
SYNCHRONIZE_CAHCE is failed and file system goes to Read only mode.

My question is "Can we add in sd_prepare_flush(), rq->retries = X" some
reasonable retries value ?

I am not sure where we want it, but I think we want to be able to set
both the retries and timeout. I have seen where a sync cache can take
longer than the default 30 secs.

Do you think we want to the block layer to manage retries/timeouts for
all block device flushes or is this more device specific? I was thinking
that we may want to create a sysfs interface under the block dirs and
have blk-sysfs.c and blk-barrier.c handle this. queue_flush could set
the timeout and retries that is set by some new files under
/sys/block/sdX/queue/ ?

Good that now also other people run into it. 30s is far too small for any
hardware raid unit with SATA disks.

It's far too short for just about any HW RAID since they all tend to
have multi-megabytes to gigabytes of cache (some of the high end have
terrabytes).  It has to be said that most arrays with battery backed
caches lie when asked to flush the cache, but we probably need to get
users into the habit of not using flush barriers with external Arrays.

As you say, a lot of arrays don't do anything with these requests. We 
should be using barriers (and sending down the synch commands) when the 
device advertises a write back cache.

For a simple hardware RAID card, it is not sufficient simply to flush 
the write cache on the HBA. It must also either disable the write cache 
on the downstream devices (S-ATA/SAS disks, etc) or send cache flush 
commands to each device.

In those cases, this could be a quite lengthy process...

http://markmail.org/message/ewicheafcvgwm4p7

I wrote this patch while having trouble with Infortrend Raids, but it also
comes up with DDN storage if the write back cache is enabled.
Shall I update the patch, add retries and then resend the entire series?

rq->timeout is the timeout of the request triggering the flush ... it's
likely the wrong value since it's for a fast completing r/w operation,
whereas this is a slow drain operation.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html