On 14.07.2020 19:14, Guenter Roeck wrote: > On 7/14/20 7:11 AM, Maciej S. Szmigiero wrote: >> On 14.07.2020 16:02, Guenter Roeck wrote: >>> On 7/11/20 1:41 PM, Maciej S. Szmigiero wrote: >>>> It has been observed that Toshiba DT01ACA family drives have >>>> WRITE FPDMA QUEUED command timeouts and sometimes just freeze until >>>> power-cycled under heavy write loads when their temperature is getting >>>> polled in SCT mode. The SMART mode seems to be fine, though. >>>> >>>> Let's make sure we don't use SCT mode for these drives then. >>>> >>>> While only the 3 TB model was actually caught exhibiting the problem let's >>>> play safe here to avoid data corruption and extend the ban to the whole >>>> family. >>>> >>>> Fixes: 5b46903d8bf3 ("hwmon: Driver for disk and solid state drives with temperature sensors") >>>> Cc: stable@xxxxxxxxxxxxxxx >>>> Signed-off-by: Maciej S. Szmigiero <mail@xxxxxxxxxxxxxxxxxxxxx> >>>> --- >>>> >>>> Notes: >>>> This behavior was observed on two different DT01ACA3 drives. >>>> >>>> Usually, a series of queued WRITE FPDMA QUEUED commands just time out, >>>> but sometimes the whole drive freezes. Merely disconnecting and >>>> reconnecting SATA interface cable then does not unfreeze the drive. >>>> >>>> One has to disconnect and reconnect the drive power connector for the >>>> drive to be detected again (suggesting the drive firmware itself has >>>> crashed). >>>> >>>> This only happens when the drive temperature is polled very often (like >>>> every second), so occasional SCT usage via smartmontools is probably >>>> safe. >>>> >>>> Changes from v1: >>>> 'SCT blacklist' -> 'SCT avoid models' >>>> >>>> drivers/hwmon/drivetemp.c | 37 +++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 37 insertions(+) >>>> >>>> diff --git a/drivers/hwmon/drivetemp.c b/drivers/hwmon/drivetemp.c >>>> index 0d4f3d97ffc6..da9cfcbecc96 100644 >>>> --- a/drivers/hwmon/drivetemp.c >>>> +++ b/drivers/hwmon/drivetemp.c >>>> @@ -285,6 +285,36 @@ static int drivetemp_get_scttemp(struct drivetemp_data *st, u32 attr, long *val) >>>> return err; >>>> } >>>> >>>> +static const char * const sct_avoid_models[] = { >>>> +/* >>>> + * These drives will have WRITE FPDMA QUEUED command timeouts and sometimes just >>>> + * freeze until power-cycled under heavy write loads when their temperature is >>>> + * getting polled in SCT mode. The SMART mode seems to be fine, though. >>>> + * >>>> + * While only the 3 TB model was actually caught exhibiting the problem >>>> + * let's play safe here to avoid data corruption and ban the whole family. >>>> + */ >>>> + "TOSHIBA DT01ACA0", >>>> + "TOSHIBA DT01ACA1", >>>> + "TOSHIBA DT01ACA2", >>>> + "TOSHIBA DT01ACA3", >>>> +}; >>>> + >>>> +static bool drivetemp_sct_avoid(struct drivetemp_data *st) >>>> +{ >>>> + struct scsi_device *sdev = st->sdev; >>>> + unsigned int ctr; >>>> + >>>> + if (!sdev->model) >>>> + return false; >>>> + >>>> + for (ctr = 0; ctr < ARRAY_SIZE(sct_avoid_models); ctr++) >>>> + if (strncmp(sdev->model, sct_avoid_models[ctr], 16) == 0) >>> >>> Why strncmp, and why length 16 ? Both strings are, as far as I can see, >>> 0 terminated. A fixed length only asks for trouble later on as more models >>> are added to the list. >> >> The first 16 bytes of sdev->model contain the model number, the rest >> seems to be the drive serial number. >> There is no NULL separator between them. >> > > If the "16" is based on some SCSI standard, there should be > a define for it somewhere. The "model" field seems to contain just the raw SCSI INQUIRY "product identification" field, which is space-filled to 16 bytes, but is not NULL-terminated. The SCSI layer seems to just open-code the number 16 everywhere, can't see any specific define for this. > Otherwise, the comparison should > use strlen(sct_avoid_models[ctr]) and explain the reason. > In the latter case, it might possibly make sense to match > "TOSHIBA DT01ACA" to also catch later models. Will use strlen() for prefix matching and match on the "TOSHIBA DT01ACA" prefix then. > Thanks, > Guenter Thanks, Maciej