On 7/14/20 7:11 AM, Maciej S. Szmigiero wrote: > On 14.07.2020 16:02, Guenter Roeck wrote: >> On 7/11/20 1:41 PM, Maciej S. Szmigiero wrote: >>> It has been observed that Toshiba DT01ACA family drives have >>> WRITE FPDMA QUEUED command timeouts and sometimes just freeze until >>> power-cycled under heavy write loads when their temperature is getting >>> polled in SCT mode. The SMART mode seems to be fine, though. >>> >>> Let's make sure we don't use SCT mode for these drives then. >>> >>> While only the 3 TB model was actually caught exhibiting the problem let's >>> play safe here to avoid data corruption and extend the ban to the whole >>> family. >>> >>> Fixes: 5b46903d8bf3 ("hwmon: Driver for disk and solid state drives with temperature sensors") >>> Cc: stable@xxxxxxxxxxxxxxx >>> Signed-off-by: Maciej S. Szmigiero <mail@xxxxxxxxxxxxxxxxxxxxx> >>> --- >>> >>> Notes: >>> This behavior was observed on two different DT01ACA3 drives. >>> >>> Usually, a series of queued WRITE FPDMA QUEUED commands just time out, >>> but sometimes the whole drive freezes. Merely disconnecting and >>> reconnecting SATA interface cable then does not unfreeze the drive. >>> >>> One has to disconnect and reconnect the drive power connector for the >>> drive to be detected again (suggesting the drive firmware itself has >>> crashed). >>> >>> This only happens when the drive temperature is polled very often (like >>> every second), so occasional SCT usage via smartmontools is probably >>> safe. >>> >>> Changes from v1: >>> 'SCT blacklist' -> 'SCT avoid models' >>> >>> drivers/hwmon/drivetemp.c | 37 +++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 37 insertions(+) >>> >>> diff --git a/drivers/hwmon/drivetemp.c b/drivers/hwmon/drivetemp.c >>> index 0d4f3d97ffc6..da9cfcbecc96 100644 >>> --- a/drivers/hwmon/drivetemp.c >>> +++ b/drivers/hwmon/drivetemp.c >>> @@ -285,6 +285,36 @@ static int drivetemp_get_scttemp(struct drivetemp_data *st, u32 attr, long *val) >>> return err; >>> } >>> >>> +static const char * const sct_avoid_models[] = { >>> +/* >>> + * These drives will have WRITE FPDMA QUEUED command timeouts and sometimes just >>> + * freeze until power-cycled under heavy write loads when their temperature is >>> + * getting polled in SCT mode. The SMART mode seems to be fine, though. >>> + * >>> + * While only the 3 TB model was actually caught exhibiting the problem >>> + * let's play safe here to avoid data corruption and ban the whole family. >>> + */ >>> + "TOSHIBA DT01ACA0", >>> + "TOSHIBA DT01ACA1", >>> + "TOSHIBA DT01ACA2", >>> + "TOSHIBA DT01ACA3", >>> +}; >>> + >>> +static bool drivetemp_sct_avoid(struct drivetemp_data *st) >>> +{ >>> + struct scsi_device *sdev = st->sdev; >>> + unsigned int ctr; >>> + >>> + if (!sdev->model) >>> + return false; >>> + >>> + for (ctr = 0; ctr < ARRAY_SIZE(sct_avoid_models); ctr++) >>> + if (strncmp(sdev->model, sct_avoid_models[ctr], 16) == 0) >> >> Why strncmp, and why length 16 ? Both strings are, as far as I can see, >> 0 terminated. A fixed length only asks for trouble later on as more models >> are added to the list. > > The first 16 bytes of sdev->model contain the model number, the rest > seems to be the drive serial number. > There is no NULL separator between them. > If the "16" is based on some SCSI standard, there should be a define for it somewhere. Otherwise, the comparison should use strlen(sct_avoid_models[ctr]) and explain the reason. In the latter case, it might possibly make sense to match "TOSHIBA DT01ACA" to also catch later models. Thanks, Guenter > See how the SCSI device model is printed for the sysfs "model" > attribute: > https://elixir.bootlin.com/linux/v5.8-rc4/source/drivers/scsi/scsi_sysfs.c#L654 > >> Also, please use "!" instead of "== 0". > > Will do. > > Thanks, > Maciej >