Re: [Lsf-pc] [LSF/MM TOPIC] improving writeback error handling

Trond Myklebust <trondmy@hammer.space> · Thu, 19 Apr 2018 02:12:33 +0000

On Wed, 2018-04-18 at 18:57 -0700, Matthew Wilcox wrote:
> On Thu, Apr 19, 2018 at 01:47:49AM +0000, Trond Myklebust wrote:
> > If the main use case is something like Postgresql, where you care
> > about
> > just one or two critical files, rather than monitoring the entire
> > filesystem could we perhaps use a dedicated mmap() mode? It should
> > be
> > possible to throw up a bitmap that displays the exact blocks or
> > pages
> > that are affected, once the file has been damaged.
> 
> Perhaps we need to have a quick summary of the postgres problem ...
> they're not concerned with "one or two files", otherwise they could
> just keep those files open and the wb_err mechanism would work fine.
> The problem is that they have too many files to keep open in their
> checkpointer process, and when they come along and open the files,
> they don't see the error..

I thought I understood that there were at least two issues here:

1) Monitoring lots of files to figure out which ones may have an error.
2) Drilling down to see what might be wrong with an individual file.

Unless you are in a situation where you can have millions of files all
go wrong at the same time, it would seems that the former is the
operation that needs to scale. Once you're talking about large numbers
of files all getting errors, it would appear that an fsck-like recovery
 would be necessary. Am I wrong?

Cheers
  Trond