Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected

Jan Kara <jack@xxxxxxx> · Wed, 31 Jul 2013 20:52:43 +0200



On Tue 30-07-13 08:31:09, Zheng Liu wrote:
> Hi Jeff,
> 
> On Mon, Jul 29, 2013 at 11:28:38AM -0400, Jeff Moyer wrote:
> > Zheng Liu <gnehzuil.liu@xxxxxxxxx> writes:
> > 
> > > My idea is to let file system can ignore the currurted block.  Namely,
> > > when we meet a currupted block, we will track it as bad block in bad
> > > block inode and find another block to save data.  This currupted block
> > > will never be used.  The first step in my mind is to detect a currpted
> > > block and mark it as bad block.  After reading the thread and Darrick's
> > > original patch, I think Darrick's patch is a good start.
> > 
> > I think it's important to call out the exact failure scenario you're
> > trying to address.  For hard disks, if you get a read error, it can
> > typically be recovered by re-writing the block.  I imagine this is what
> > fsck would be doing for metadata repair.  So, I'm not at all sure why
> > you'd want to track bad blocks in the file system itself.  Could you
> > elaborate, please?
> 
> In our product system at Taobao, we have a large CDN system around the
> country.  These servers cache the most of web pages, images, etc....
> These servers have some disks, and the disk must break down at some
> time.  Now we need to umount this disk, and the whole disk just be left
> in server until the whole server is dropped.  But as you have pointed
> out, when we meet a disk failure, the whole disk might still works.  So
> we hope that the file system could track the bad block, doesn't allocate
> them, and the rest of spaces also can be used.  This can help us to
> reduce the cost.
  Well, before spending too much time with this, try finding some study
(I've read some from Google I think, just I don't have the url at hand) on
what is the estimated lifetime of a disk after bad sectors start appearing. 
What I remember is that usually when bad sectors start appearing the disk
is going to die within weeks with high probability. So I'm not sure if the
cost saving of additional few weeks of lifetime is worth the trouble. As
Ted said, there may be other reasons why you'd want a feature like this -
kernel error causing bitmap corruption - or just that you need to keep the
machine up for a few more hours before you can take it down for
maintenance.

								Honza
-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html