Hi, similar suspicious code path can be found in the queuecommand functions in other drivers too these are - pmcraid.c arcmsr_hba.c cc-ing maintainers - (but both drivers seem to be unmaintained for a while - I've added Ching for arcmsr and Raghava for pmcraid) please read this thread and verify that your driver and firmware is safe. It is expected that a raid card controls the disk write cache (switches it off) but this might not be true for all modes of operation a raid adapter handles - pass through/non-RAID etc. In such case when disk write cache is enabled and your driver skips the SYNCHRONIZE_CACHE command a FS corruption is very much possible during unexpected power loss or even a clean shutdown. tomash On 17.10.2016 19:51, James Bottomley wrote: > On Mon, 2016-10-17 at 23:06 +0530, Kashyap Desai wrote: >>> -----Original Message----- >>> From: James Bottomley [mailto:jejb@xxxxxxxxxxxxxxxxxx] >>> Sent: Monday, October 17, 2016 10:50 PM >>> To: Ric Wheeler; Hannes Reinecke; Sumit Saxena; >>> linux-scsi@xxxxxxxxxxxxxxx >>> Cc: martin.petersen@xxxxxxxxxx; thenzl@xxxxxxxxxx; >>> kashyap.desai@xxxxxxxxxxxx; Christoph Hellwig; Martin K. Petersen; >>> Jeff >>> Moyer; Gris Ge; Ewan Milne; Jens Axboe >>> Subject: Re: [PATCH 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE >>> command >>> to firmware >>> >>> On Mon, 2016-10-17 at 12:27 -0400, Ric Wheeler wrote: >>>> On 10/17/2016 12:20 PM, James Bottomley wrote: >>>>> On Mon, 2016-10-17 at 09:01 -0400, Ric Wheeler wrote: >>>>>> On 10/17/2016 07:34 AM, Hannes Reinecke wrote: >>>>>>> On 10/17/2016 12:24 PM, Sumit Saxena wrote: >>>>>>>> megaraid_sas driver returns SYNCHRONIZE_CACHE command >>>>>>>> back to >>>>>>>> SCSI layer without sending it to firmware as firmware >>>>>>>> takes >>>>>>>> care of flushing cache. This patch will change the >>>>>>>> driver >>>>>>>> behavior wrt SYNCHRONIZE_CACHE handling. If underlying >>>>>>>> firmware has support to handle the SYNCHORNIZE_CACHE, >>>>>>>> driver >>>>>>>> will send it for firmware otherwise complete it back to >>>>>>>> SCSI >>>>>>>> layer with SUCCESS immediately. If Firmware handle >>>>>>>> SYNCHORNIZE_CACHE for both VD and JBOD >>>>>>>> "canHandleSyncCache" >>>>>>>> bit in scratch pad register(offset >>>>>>>> 0x00B4) will be set. >>>>>>>> >>>>>>>> This behavior can be controlled via module parameter and >>>>>>>> user >>>>>>>> can fallback to old behavior of returning >>>>>>>> SYNCHRONIZE_CACHE by >>>>>>>> driver only without sending it to firmware. >>>>>>>> >>>>>>>> Signed-off-by: Sumit Saxena <sumit.saxena@xxxxxxxxxxxx> >>>>>>>> Signed-off-by: Kashyap Desai <kashyap.desai@xxxxxxxxxxxx> >>>>>>>> --- >>>>>>>> drivers/scsi/megaraid/megaraid_sas.h | 3 +++ >>>>>>>> drivers/scsi/megaraid/megaraid_sas_base.c | 14 >>>>>>>> ++++++--- >>>>>>>> ----- >>>>>>>> drivers/scsi/megaraid/megaraid_sas_fusion.c | 3 +++ >>>>>>>> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 ++ >>>>>>>> 4 files changed, 14 insertions(+), 8 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h >>>>>>>> b/drivers/scsi/megaraid/megaraid_sas.h >>>>>>>> index ca86c88..43fd14f 100644 >>>>>>>> --- a/drivers/scsi/megaraid/megaraid_sas.h >>>>>>>> +++ b/drivers/scsi/megaraid/megaraid_sas.h >>>>>>>> @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT { >>>>>>>> #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT 14 >>>>>>>> #define MR_MAX_MSIX_REG_ARRAY 16 >>>>>>>> #define MR_RDPQ_MODE_OFFSET 0X0 >>>>>>>> 0800 >>>>>>>> 000 >>>>>>>> +#define MR_CAN_HANDLE_SYNC_CACHE_OFFSET 0 >>>>>>>> X010 >>>>>>>> 0000 >>>>>>>> 0 >>>>>>>> + >>>>>>>> /* >>>>>>>> * register set for both 1068 and 1078 controllers >>>>>>>> * structure extended for 1078 registers @@ -2140,6 >>>>>>>> +2142,7 >>>>>>>> @@ struct megasas_instance { >>>>>>>> u8 is_imr; >>>>>>>> u8 is_rdpq; >>>>>>>> bool dev_handle; >>>>>>>> + bool fw_sync_cache_support; >>>>>>>> }; >>>>>>>> struct MR_LD_VF_MAP { >>>>>>>> u32 size; >>>>>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c >>>>>>>> b/drivers/scsi/megaraid/megaraid_sas_base.c >>>>>>>> index 092387f..a4a8e2f 100644 >>>>>>>> --- a/drivers/scsi/megaraid/megaraid_sas_base.c >>>>>>>> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c >>>>>>>> @@ -104,6 +104,10 @@ unsigned int scmd_timeout = >>>>>>>> MEGASAS_DEFAULT_CMD_TIMEOUT; >>>>>>>> module_param(scmd_timeout, int, S_IRUGO); >>>>>>>> MODULE_PARM_DESC(scmd_timeout, "scsi command timeout >>>>>>>> (10 >>>>>>>> -90s), default 90s. See megasas_reset_timer."); >>>>>>>> >>>>>>>> +bool block_sync_cache; >>>>>>>> +module_param(block_sync_cache, bool, S_IRUGO); >>>>>>>> +MODULE_PARM_DESC(block_sync_cache, "Block SYNC CACHE by >>>>>>>> driver >>>>>>>> Default: 0(Send it to firmware)"); >>>>>>>> + >>>>>>>> MODULE_LICENSE("GPL"); >>>>>>>> MODULE_VERSION(MEGASAS_VERSION); >>>>>>>> MODULE_AUTHOR("megaraidlinux.pdl@xxxxxxxxxxxxx"); >>>>>>>> @@ -1700,16 +1704,10 @@ megasas_queue_command(struct >>> Scsi_Host >>>>>>>> *shost, struct scsi_cmnd *scmd) >>>>>>>> goto out_done; >>>>>>>> } >>>>>>>> >>>>>>>> - switch (scmd->cmnd[0]) { >>>>>>>> - case SYNCHRONIZE_CACHE: >>>>>>>> - /* >>>>>>>> - * FW takes care of flush cache on its >>>>>>>> own >>>>>>>> - * No need to send it down >>>>>>>> - */ >>>>>>>> + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && >>>>>>>> + (!instance->fw_sync_cache_support)) { >>>>>>>> scmd->result = DID_OK << 16; >>>>>>>> goto out_done; >>>>>>>> - default: >>>>>>>> - break; >>>>>>>> } >>>>>>>> >>>>>>>> return instance->instancet >>>>>>>> ->build_and_issue_cmd(instance, scmd); >>>>>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c >>>>>>>> b/drivers/scsi/megaraid/megaraid_sas_fusion.c >>>>>>>> index 2159f6a..8237580 100644 >>>>>>>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c >>>>>>>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c >>>>>>>> @@ -747,6 +747,9 @@ megasas_ioc_init_fusion(struct >>>>>>>> megasas_instance *instance) >>>>>>>> ret = 1; >>>>>>>> goto fail_fw_init; >>>>>>>> } >>>>>>>> + if (!block_sync_cache) >>>>>>>> + instance->fw_sync_cache_support = >>>>>>>> (scratch_pad_2 >>>>>>>> & >>>>>>>> + MR_CAN_HANDLE_SYNC_CACHE_OFFSET) >>>>>>>> ? 1 >>>>>>>> : >>>>>>>> 0; >>>>>>>> >>>>>>>> IOCInitMessage = >>>>>>>> dma_alloc_coherent(&instance->pdev->dev, >>>>>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.h >>>>>>>> b/drivers/scsi/megaraid/megaraid_sas_fusion.h >>>>>>>> index e3bee04..2857154 100644 >>>>>>>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.h >>>>>>>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.h >>>>>>>> @@ -72,6 +72,8 @@ >>>>>>>> #define MPI2_SUP_REPLY_POST_HOST_INDEX_OFFSET >>>>>>>> (0x0000030C) >>>>>>>> #define MPI2_REPLY_POST_HOST_INDEX_OFFSET (0x00 >>>>>>>> 0000 >>>>>>>> 6C) >>>>>>>> >>>>>>>> +extern bool block_sync_cache; >>>>>>>> + >>>>>>>> /* >>>>>>>> * Raid context flags >>>>>>>> */ >>>>>>>> >>>>>>> Be extra careful with that. >>>>>>> >>>>>>> SYNCHRONIZE_CACHE has (potentially) a global scope, as >>>>>>> there >>>>>>> might be an array-wide cache, and a cache flush would >>>>>>> affect the >>>>>>> entire cache. Linux is using SYNCHRONIZE_CACHE as a per >>>>>>> device >>>>>>> setting, ie it assumes that the effects of a cache flush is >>>>>>> restricted to the device in question. >>>>>>> >>>>>>> If this is _not_ the case I'd rather not enable it. >>>>>>> Have you checked that enabling this functionality doesn't >>>>>>> have >>>>>>> negative performance impact? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Hannes >>>>>> This must go in - without this fix, there is no data >>>>>> integrity for >>>>>> any file system. >>>>> That's not what I get from the change log. What it says to me >>>>> is >>>>> that the caches are currently firmware managed. Barring >>>>> firmware >>>>> bugs, that means that we currently don't have any integrity >>>>> issues. >>>> Your understanding (or the change log) is incorrect. >>> There's no current way in English of construing "as firmware takes >>> care of >>> flushing cache" to mean its inverse, so the changelog needs >>> updating to >>> explain >>> that firmware does *not* take care of cache flushing, so >>> effectively >>> nothing >>> does and we'll need a cc to stable if the problem is that nothing >>> takes >>> care of >>> flushing the drive caches. >>> >>> James >> Sorry for confusion. More accurate to say - >> >> Current megaraid_sas driver returns SYNCHRONIZE_CACHE command back to >> SCSI layer without sending it down to firmware as firmware supposed >> to takes care of flushing cache for all Virtual Disk at the time of >> system reboot/shutdown. Because of above FW rule driver coded wrongly >> and it added bug that SYNCHRONIZE_CACHE is not passed to FW for JBOD >> as well. (MR Firmware wanted OS to manage SYNCHRONIZE_CACHE and pass >> the same to the Firmware. ). We figure out that even for VD, why >> driver should send back SYNC_CACHE command...let's have the same in >> FW (giving all control in FW) >> >> New behavior described - >> >> IF application requests SYNCH_CACHE >> IF any FW which supports new API bit called canHandleSyncCache >> Driver sends SYNCH_CACHE command to the FW >> IF 'VirtualDisk' >> FW does not send SYNCH_CACHE to drives >> FW returns SUCCESS >> ELSE IF 'JBOD' >> FW sends SYNCH_CACHE to drive >> FW obtains status from drive and returns same status back to >> driver >> ENDIF >> >> Fixing this is robust if FW and driver changes are inline. See below >> for more detail. >> >> Case - 1 >> This patch is attempt to fix one level problem where Virtual Disks >> are not managed correctly in MR FW. There are some MR FW (E.a Gen2 >> and Gen2.5 FW), which set WCE bit for Virtual Disk but FW is not >> reply correct for SYNCH_CACHE. This was handled in past and carry >> forward (in driver + FW ) to all new generation products as well. We >> tried to collect all possible details why it was handled such way to >> provide better driver fix for this issue(mainly to avoid this FW >> checks and module parameters etc..). But now it looks like to make >> generic fix (Device ID based etc..)is even risky, so went with this >> protective approach. It is very risky to find out which Device ID >> and FW has capability to manage SYNC_CACHE, so we managed to figure >> out which are the new FW using FW capability bit. >> >> E.g Liberator/Thunderbolt MR FW SYNC_CACHE is write Unsupported >> command (this is a firmware bug) and immediately it will be failed to >> SML. This will cause File system to go Read-only. > That's a better description ... what you're saying is that the patch > doesn't actually fix the bug Ric is worrying about. > >> Case -2 >> One more thing which we are trying and known driver can send >> "SYNC_CACHE" unconditionally to all generation of FW. For this we >> are doing more unit testing on various controllers/FW and planning to >> provide another patch to fix JBOD path for any FW. (It will not be >> based on FW capability/module parameter). > OK, but we really need this ASAP. As Ric said, data integrity is at > risk until this is fixed. Can you also reference the commit that > introduced the problem so we know how far to do the sable backports? > >> If we remove module parameter, we can ask customer to disable WCE on >> drive to get similar impact. > Thanks, > > James > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html