On Tue, Nov 12, 2019 at 12:56:21AM +0900, Akinobu Mita wrote: > 2019年11月11日(月) 1:30 Guenter Roeck <linux@xxxxxxxxxxxx>: > > > [ ... ] > > > Example output from the "sensors" command: > > > > > > nvme-pci-0100 > > > Adapter: PCI adapter > > > Composite: +53.0 C (low = -273.0 C, high = +70.0 C) > > > (crit = +80.0 C) > > > Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) > > > Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) > > > Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) > > > > > > > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing > > any minimum limit on the Composite temperature sensor results in a temperature > > warning, and that warning is sticky until I reset the controller. > > I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others. > > I have Crucial CT500P1SSD8 and WDC WDS512G1X0C-00ENX0, and I have no > problem with these devices. > > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > nvme-pci-0100 > > Adapter: PCI adapter > > Composite: +30.0°C (low = -273.0°C, high = +70.0°C) > > (crit = +80.0°C) > > > > root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > nvme-pci-0100 > > Adapter: PCI adapter > > Composite: +30.0°C (low = +0.0°C, high = +70.0°C) ALARM > > (crit = +80.0°C) > > > > It doesn't seem to matter which temperature I write; writing -273000 has > > the same result. > > > > [This is actually why I didn't use the features commands; not that I had observed > > the problem, but I was concerned that problems like this would show up.] > > Maybe we should introduce a new quirk so that we can avoid changing > temperature threshold for such devices. Could you tell SSDPEKKW512G7's > vendor and device ID? Quick googling answers it's 8086:f1a5, but I want > to make sure. Yes, that is correct. 01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03) I'll see if I can test this tonight on my other NVMEs. I also dug up an old NVMe drive from Toshiba; I'll see if I can connect and test it as well. [ ... ] > > > */ > > > switch (attr) { > > > case hwmon_temp_max: > > > - *val = (data->ctrl->wctemp - 273) * 1000; > > > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > > + if (err) > > > + *val = (data->ctrl->wctemp - 273) * 1000; > > > > This would report WCTEMP for all sensors on errors, including errors seen while > > the controller is resetting. I think it should be something like > > > > int err = 0; > > ... > > > > if (!channel) > > *val = (data->ctrl->wctemp - 273) * 1000; > > else > > err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > return err; > > > > assuming we keep using ctrl->wctemp (see below). If changing the upper Composite > > temperature sensor limit changes wctemp, and we don't update it, we should not > > use it at all after registration and just report the error. > > > > > return 0; > > > + case hwmon_temp_min: > > > + return nvme_get_temp_thresh(data->ctrl, channel, true, val); > > > case hwmon_temp_crit: > > > *val = (data->ctrl->cctemp - 273) * 1000; > > > return 0; > > > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > > > return err; > > > } > > > > > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, > > > + u32 attr, int channel, long val) > > > +{ > > > + struct nvme_hwmon_data *data = dev_get_drvdata(dev); > > > + > > > + switch (attr) { > > > + case hwmon_temp_max: > > > + return nvme_set_temp_thresh(data->ctrl, channel, false, val); > > > > Does this change WCTEMP if written on channel 0 ? If so, we would have to update > > the cached value of ctrl->wctemp (or never use it after registration). > > At least for the devices I have, setting the over temperature threshold > doesn't change the WCTEMP. > I have checked with 'nvme id-ctrl /dev/nvme0 | grep ctemp'. > Interesting. I just tested this, and the result is the same with Samsung SSD 970 EVO. With that in mind, maybe we should really not use wctemp at all after initialization, as I had suggested above. What do you think ? Thanks, Guenter