Le 23/11/2015 21:01, Jose Tavares a
écrit :
Ceph currently uses filesystems to store its data. As there are no supported filesystems/software layer handling badblocks dynamically, you *will* have some OSD filesystems being remounted read-only and OSD failures as soon as you hit one sector misbehaving (if they already emptied the reserve you are almost guaranteed to get new defective sectors later, see below). If your bad drives are distributed over your whole cluster, you will have far more chances of simultaneous failures and degraded or inactive pgs (which will freeze any IO to them). You will then have to manually put these OSDs back online to recover (unfreeze IO). If you don't succeed because the drives failed to the point that you can't recover the OSD content, you will simply lose data. From what I can read here, the main filesystems for Ceph are XFS, Btrfs and Ext4 with some people using ZFS. Of those 4, only ext4 has support for manually setting badblocks on an umounted filesystem. If you don't have the precise offset for each of them, you'll have to scan the whole device (e2fsck -c) before e2fsck can *try* to put your filesystem in shape after any bad block is detected. You will have to be very careful to remove any file using a bad block to avoid corrupting data before restarting the OSD (hopefully e2fsck should move them for you to lost+found). You might not be able to restart the OSD depending on the actual files missing. Finally at least XFS and Btrfs don't have any support for bad blocks AFAIK. So you simply can't use your drives with these 2 filesystems without the filesystems failing and fsck not working. MD raid won't help you either as it has zero support for badblocks. The fact that badblocks support is almost non-existent is simple to understand from past history. Only old filesystems that were used when drives didn't have internal reserves to handle badblocks transparently and bad blocks were a normal occurence still have support for keeping tabs on bad sectors (ext4 got it from ext2, vfat/fat32 has it too, ...). Today a disk drive which starts to report bad sectors on reads has emptied its reserve so it has a large history of bad sectors already. It isn't failing one sector, it's in the process of failing thousands of them, so there's no reason to expect it to behave correctly anymore : all the application layers above (md, lvm, filesystems, ...) just don't try to fight a battle that can't be won and would add complexity and diminish performance in the case of a normal working drive. Lionel |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com