OSD crashed due to filestore EIO

GuangYang <yguang11@xxxxxxxxxxx> · Wed, 29 Oct 2014 13:27:15 +0000

Recently we observed an OSD crash due to file corruption in filesystem, which leads to an assertion failure at FileStore::read as EIO is not tolerated. As file corruption is normal in large deployment, I am thinking if that behavior is too aggressive, especially for EC pool.

After searching, I found this flag might help : filestore_fail_eio, which can make the OSD survive an EIO failure, it is true by default though. I haven't tested it yet.

Does it make sense to adjust the behavior a little bit, if the filestore read fail due to file corruption, return back the failure and at the same time mark the PG as inconsistent, due the redundancy (replication or EC), the request can still be served, and at the same time, we can get alert saying there is inconsistency and manually trigger a PG repair?

Thanks,
Guang
  		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html