Re: Handling Failed flushes in write-behind

Prashanth Pai <ppai@xxxxxxxxxx> · Mon, 5 Oct 2015 03:19:41 -0400 (EDT)

----- Original Message -----
> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> To: "Prashanth Pai" <ppai@xxxxxxxxxx>
> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Thiago da Silva" <thiago@xxxxxxxxxx>
> Sent: Monday, October 5, 2015 11:37:00 AM
> Subject: Re:  Handling Failed flushes in write-behind
> 
> 
> 
> ----- Original Message -----
> > From: "Prashanth Pai" <ppai@xxxxxxxxxx>
> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Thiago da Silva"
> > <thiago@xxxxxxxxxx>
> > Sent: Wednesday, September 30, 2015 11:38:38 AM
> > Subject: Re:  Handling Failed flushes in write-behind
> > 
> > > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > > write
> > > > > failures, roll-back to last checkpoint and replay writes from that
> > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > > application. However, glusterfs retrying writes cannot be a complete
> > > > > solution as the error-condition we've run into might never get
> > > > > resolved
> > > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > > some
> > > > > time.
> > 
> > The application should not be expected to replay writes. glusterfs must be
> > retrying the failed write.
> 
> Well, failed writes can fail due to two categories of errors:
> 
> 1. The error condition can be transient or file-system can do something to
> alleviate the error.
> 2. The error condition can be permanent or file-system has no control over
> how to recover from the failure condition. For eg., Network failure.
> 
> The best a file-system can do in scenario 1 is:
> 1. try to do things to alleviate the error.
> 2. retry the writes
> 
> For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some
> space by freeing some extents (again extents are managed by filesystem) and
> retries. Again this retry is only once after failure. After that page is
> marked with error.
> 
> As far as failure scenarios 2, there is no point in retrying and it is
> difficult to have a well defined policy on how long we can keep retrying.
> The purpose of this mail is to identify errors that fall into scenario 1
> above and have a recovery policy. I am afraid, glusterfs cannot do much in
> scenario 2. If you've ideas that can help for scenario 2, I am open to
> incorporate them.
> 
> I did a quick look at how various filesystems handle writeback failures (this
> is not extensive research and hence there might be some incorrectness):
> 
> 1. FUSE:
>    ======
>  FUSE implemented write-back from kernel version 3.15. In its current
>  version, it doesn't replay the writes at all on writeback failure.
> 
> 2. xfs:
>    ====
>  xfs seem to have an intelligent failure handling mechanism on writeback
>  failure. It marks the pages as dirty again after writeback failure for some
>  errors. For other errors, it doesn't retry. I couldn't look into details of
>  what errors are retried and what errors are not
> 
> 3. ext4:
>    =====
>   Only ENOSPC errors are retried. That too, only once.
> 
> Also, please note that to the best of my knowledge, POSIX only guarantees
> writes that are checkpointed by fsync to have been persisted. Given the
> above constraints I am curious to know how the applications handle similar
> issues on other filesystems.
> 
> > In gluster-swift, we had hit into a case where the application would get
> > EIO
> > but the write had actually failed because of ENOSPC.
> 
> From linux kernel source tree,
> 
> static inline void mapping_set_error(struct address_space *mapping, int
> error)
> {
>         if (unlikely(error)) {
>                 if (error == -ENOSPC)
> 			set_bit(AS_ENOSPC, &mapping->flags);
>                 else
>                         set_bit(AS_EIO, &mapping->flags);
>         }
> }
> 
> Seems like only ENOSPC is stored. Rest of the errors are transformed into
> EIO. Again, we are ready to comply to whatever is the standard practise.

Very informative, thanks for looking into this.

On getting ENOSPC or EDQUOT from glusterfs, gluster-swift gracefully tells the HTTP client that the particular account (a volume) has run out of space. With write-behind turned on, because we get EIO instead, the application code that handles ENOSPC/EDQUOT in a specific way has no chance to do it's job. It is good enough if the errno returned by the filesystem (ENOSPC/EDQUOT to be specific) on the brick is propagated as is to the upper layers by write-behind xlator.

Thanks.

> 
> > https://bugzilla.redhat.com/show_bug.cgi?id=986812
> > 
> > Regards,
> >  -Prashanth Pai
> > 
> > ----- Original Message -----
> > > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > > To: "Vijay Bellur" <vbellur@xxxxxxxxxx>
> > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Ben Turner"
> > > <bturner@xxxxxxxxxx>, "Ira Cooper" <icooper@xxxxxxxxxx>
> > > Sent: Tuesday, September 29, 2015 4:56:33 PM
> > > Subject: Re:  Handling Failed flushes in write-behind
> > > 
> > > + gluster-devel
> > > 
> > > > 
> > > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote:
> > > > > Hi All,
> > > > >
> > > > > Currently on failure of flushing of writeback cache, we mark the fd
> > > > > bad.
> > > > > The rationale behind this is that since the application doesn't know
> > > > > which
> > > > > of the writes that are cached failed, fd is in a bad state and cannot
> > > > > possibly do a meaningful/correct read. However, this approach (though
> > > > > posix-complaint) is not acceptable for long standing applications
> > > > > like
> > > > > QEMU [1]. So, a two part solution was decided:
> > > > >
> > > > > 1. No longer mark the fd bad during failures while flushing data to
> > > > > backend
> > > > > from write-behind cache.
> > > > > 2. retry the writes
> > > > >
> > > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > > write
> > > > > failures, roll-back to last checkpoint and replay writes from that
> > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > > application. However, glusterfs retrying writes cannot be a complete
> > > > > solution as the error-condition we've run into might never get
> > > > > resolved
> > > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > > some
> > > > > time.
> > > > >
> > > > > It would be helpful if you give your inputs on how other writeback
> > > > > systems
> > > > > (Eg., kernel page-cache, nfs, samba, ceph, lustre etc) behave in this
> > > > > scenario and what would be a sane policy for glusterfs.
> > > > >
> > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1200862
> > > > >
> > > > > regards,
> > > > > Raghavendra
> > > > >
> > > > 
> > > > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel@xxxxxxxxxxx
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > 
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel