Re: XFS and nobarriers on Intel SSD

Richard Bade <hitrich@xxxxxxxxx> · Mon, 14 Sep 2015 12:30:43 +1200

Hi Everyone,
I updated the 
firmware on 3 S3710 drives (one host) last Tuesday and have not seen any ATA 
resets or Task Aborts on that host in the 5 days since.
I also set nobarriers on another host on Wednesday and have only seen one Task Abort, and that was on an S3710.
I have seen 18 
ATA resets or Task Aborts on the two hosts that I made no changes on.
It looks like this firmware has fixed my issues, but it looks like nobarriers also improves the situation significantly. Which seems to Correlate with your experience Christian.
Thanks everyone for the info in this thread, I plan to update the firmware on the remainder of the S3710 drives this week and also set nobarriers.
Regards,
Richard

On 8 September 2015 at 14:27, Richard Bade <hitrich@xxxxxxxxx> wrote:
Hi Christian,

On 8 September 2015 at 14:02, Christian Balzer <chibi@xxxxxxx> wrote:Indeed. But first a word about the setup where I'm seeing this.

These are 2 mailbox server clusters (2 nodes each), replicating via DRBD

over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster

with the Samsung DC SSDs, one with the Intel S3610.

2 of these chassis to be precise:

https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm

We are using the same box, but DC0R (no infiniband) so I guess not surprising we're seeing the same thing happening.

Of course latest firmware and I tried this with any kernel from Debian

3.16 to stock 4.1.6.

With nobarrier I managed to trigger the error only once yesterday on the

DRBD replication target, not the machine that actual has the FS mounted.

Usually I'd be able to trigger quite a bit more often during those tests.

So this morning I updated the firmware of all S3610s on one node and

removed the nobarrier flag. It took a lot of punishment, but eventually

this happened:

---

Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task abort! scmd(ffff880fdc85b680)

Sep  8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB: Write(10) 2a 00 0e 9a fb b8 00 00 08 00

Sep  8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)

Sep  8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1: enclosure_logical_id(0x5003048019e98d00), slot(1)

Sep  8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort: SUCCESS scmd(ffff880fdc85b680)

---

Note that on the un-patched node (DRBD replication target) I managed to

trigger this bug 3 times in the same period.

So unless Intel has something to say (and given that this happens with

Samsungs as well), I'd still look beady eyed at LSI/Avago...

Yes, I think there may be more than one issue here. The reduction in occurrences seems to prove there is an issue fixed by the Intel firmware, but something is still happening.
Once I have updated the firmware on the drives on one of our hosts tonight, hopefully I can get some more statistics and pinpoint if there is another issue specifically with the LSI3008.
I'd be interested to know if the combination of nobarriers and the updated firmware fixes the issue.

Regards,
Richard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com