Are you absolutely sure there's nothing in dmesg before this? There seems to be something missing. Is this from dmesg or a different log? There should be something before that. Usually if a drive drops out there is I/O error (itself caused by a timed-out SCSI command), and then the error recovery kicks in and emits such messages. But this message by itself just should not be there. Or is that with the debugging already enabled? In that case it's a red herring, not _the_ problem. Synchronize cache is a completely ordinary command - you _want_ it in there absolutely. The only case when you could avoid it is if you trust the capacitors on the drives _and_ the OS to order the requests right (a bold assumption IMO) by disabling barriers (btw that will not work for a journal on block device). How often does this happen? You could try recording the events with "btrace" so you know what the block device is really doing from the kernel block device perspective. In any case, this command should be harmless and is expected to occur quite often, so LSI telling you "don't do it" is like Ford telling me "your brakes are broken so don't use them when driving". I'm getting real angry at LSI. We have problems with them as well and their support is just completely uselesss. And for the record, _ALL_ the drives I tested are faster on Intel SAS than on LSI (2308) and often faster on a regular SATA AHCI then on their "high throughput" HBAs. The drivers have barely documented parameters and if you google a bit you'll find many people having problems with them (not limited to linux). I'll definitely avoid LSI HBAs in the future if I can. Feel free to mail me off-list, I'm very interested in your issue because I have the same combination (LSI + Intels) in my cluster right now. Seem to work fine though. Jan
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com