Re: CEPH over SW-RAID

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Mon, Nov 23, 2015 at 6:40 PM, Lionel Bouton <lionel-subscription@xxxxxxxxxxx> wrote:
Le 23/11/2015 21:01, Jose Tavares a écrit :



> My new question regarding Ceph is if it isolates this bad sectors where it found bad data when scrubbing? or there will be always a replica of something over a known bad block..?
Ceph OSDs don't know about bad sectors, they delegate IO to the filesystems beneath them. Some filesystems can recover from corrupted data from one drive (ZFS or BTRFS when using redundancy at their level) and the same filesystems will refuse to give Ceph OSD data when they detect corruption on non redundant filesystems, Ceph detects this (usually during scrubs) and then manual Ceph repair will rewrite data over the corrupted data (at this time if the underlying drive detected a bad sector it will not reuse it).

Just forget about the hardware bad block remapped list. It got filled as soon as we start to use the drive .. :)

Then you can move this drive to the trash pile/ask for a replacement. It is basically unusable.

Why?
1 (or more) out of 8 drives I see have the remap list full ...
If you isolate the rest using software you can continue to use the drive .. There are no performance issues, etc ..


Ceph currently uses filesystems to store its data. As there are no supported filesystems/software layer handling badblocks dynamically, you *will* have some OSD filesystems being remounted read-only and OSD failures as soon as you hit one sector misbehaving (if they already emptied the reserve you are almost guaranteed to get new defective sectors later, see below). If your bad drives are distributed over your whole cluster, you will have far more chances of simultaneous failures and degraded or inactive pgs (which will freeze any IO to them). You will then have to manually put these OSDs back online to recover (unfreeze IO). If you don't succeed because the drives failed to the point that you can't recover the OSD content, you will simply lose data.

From what I can read here, the main filesystems for Ceph are XFS, Btrfs and Ext4 with some people using ZFS. Of those 4, only ext4 has support for manually setting badblocks on an umounted filesystem. If you don't have the precise offset for each of them, you'll have to scan the whole device (e2fsck -c) before e2fsck can *try* to put your filesystem in shape after any bad block is detected. You will have to be very careful to remove any file using a bad block to avoid corrupting data before restarting the OSD (hopefully e2fsck should move them for you to lost+found). You might not be able to restart the OSD depending on the actual files missing.

Finally at least XFS and Btrfs don't have any support for bad blocks AFAIK. So you simply can't use your drives with these 2 filesystems without the filesystems failing and fsck not working. MD raid won't help you either as it has zero support for badblocks.

The fact that badblocks support is almost non-existent is simple to understand from past history. Only old filesystems that were used when drives didn't have internal reserves to handle badblocks transparently and bad blocks were a normal occurence still have support for keeping tabs on bad sectors (ext4 got it from ext2, vfat/fat32 has it too, ...). Today a disk drive which starts to report bad sectors on reads has emptied its reserve so it has a large history of bad sectors already. It isn't failing one sector, it's in the process of failing thousands of them, so there's no reason to expect it to behave correctly anymore : all the application layers above (md, lvm, filesystems, ...) just don't try to fight a battle that can't be won and would add complexity and diminish performance in the case of a normal working drive.

Lionel

Lionel, you pointed really good points here.

XFS really doesn't have support to isolate bad blocks. I don't know about btrfs. Maybe XFS was designed to be used with a hardware raid that would handle hardware problems. I don't know.

A good strategy would be to use ext4, have some schedule, put the osd down, run fsck with badblocks to isolate the disk problems, put the osd up. If scrubbing finds something bad, it will be replaced, but the bad blocks will not be available anymore, so no data would be put back there. It will be a 2 layer verification.

AFAIK, people are complaining about lots os bad blocks in the new big disks. The hardware list seems to be small and unable to replace theses blocks.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux