----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "Prashanth Pai" <ppai@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Thiago da Silva" <thiago@xxxxxxxxxx> > Sent: Monday, October 5, 2015 11:37:00 AM > Subject: Re: Handling Failed flushes in write-behind > > > > ----- Original Message ----- > > From: "Prashanth Pai" <ppai@xxxxxxxxxx> > > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Thiago da Silva" > > <thiago@xxxxxxxxxx> > > Sent: Wednesday, September 30, 2015 11:38:38 AM > > Subject: Re: Handling Failed flushes in write-behind > > > > > > > As for as 2, goes, application can checkpoint by doing fsync and on > > > > > write > > > > > failures, roll-back to last checkpoint and replay writes from that > > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the > > > > > application. However, glusterfs retrying writes cannot be a complete > > > > > solution as the error-condition we've run into might never get > > > > > resolved > > > > > (For eg., running out of space). So, glusterfs has to give up after > > > > > some > > > > > time. > > > > The application should not be expected to replay writes. glusterfs must be > > retrying the failed write. > > Well, failed writes can fail due to two categories of errors: > > 1. The error condition can be transient or file-system can do something to > alleviate the error. > 2. The error condition can be permanent or file-system has no control over > how to recover from the failure condition. For eg., Network failure. > > The best a file-system can do in scenario 1 is: > 1. try to do things to alleviate the error. > 2. retry the writes > > For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some > space by freeing some extents (again extents are managed by filesystem) and > retries. Again this retry is only once after failure. After that page is > marked with error. > > As far as failure scenarios 2, there is no point in retrying and it is > difficult to have a well defined policy on how long we can keep retrying. > The purpose of this mail is to identify errors that fall into scenario 1 > above and have a recovery policy. I am afraid, glusterfs cannot do much in > scenario 2. If you've ideas that can help for scenario 2, I am open to > incorporate them. > > I did a quick look at how various filesystems handle writeback failures (this > is not extensive research and hence there might be some incorrectness): > > 1. FUSE: > ====== > FUSE implemented write-back from kernel version 3.15. In its current > version, it doesn't replay the writes at all on writeback failure. > > 2. xfs: > ==== > xfs seem to have an intelligent failure handling mechanism on writeback > failure. It marks the pages as dirty again after writeback failure for some > errors. For other errors, it doesn't retry. I couldn't look into details of > what errors are retried and what errors are not > > 3. ext4: > ===== > Only ENOSPC errors are retried. That too, only once. > > Also, please note that to the best of my knowledge, POSIX only guarantees > writes that are checkpointed by fsync to have been persisted. Given the > above constraints I am curious to know how the applications handle similar > issues on other filesystems. > > > In gluster-swift, we had hit into a case where the application would get > > EIO > > but the write had actually failed because of ENOSPC. > > From linux kernel source tree, > > static inline void mapping_set_error(struct address_space *mapping, int > error) > { > if (unlikely(error)) { > if (error == -ENOSPC) > set_bit(AS_ENOSPC, &mapping->flags); > else > set_bit(AS_EIO, &mapping->flags); > } > } > > Seems like only ENOSPC is stored. Rest of the errors are transformed into > EIO. Again, we are ready to comply to whatever is the standard practise. Very informative, thanks for looking into this. On getting ENOSPC or EDQUOT from glusterfs, gluster-swift gracefully tells the HTTP client that the particular account (a volume) has run out of space. With write-behind turned on, because we get EIO instead, the application code that handles ENOSPC/EDQUOT in a specific way has no chance to do it's job. It is good enough if the errno returned by the filesystem (ENOSPC/EDQUOT to be specific) on the brick is propagated as is to the upper layers by write-behind xlator. Thanks. > > > https://bugzilla.redhat.com/show_bug.cgi?id=986812 > > > > Regards, > > -Prashanth Pai > > > > ----- Original Message ----- > > > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > > To: "Vijay Bellur" <vbellur@xxxxxxxxxx> > > > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Ben Turner" > > > <bturner@xxxxxxxxxx>, "Ira Cooper" <icooper@xxxxxxxxxx> > > > Sent: Tuesday, September 29, 2015 4:56:33 PM > > > Subject: Re: Handling Failed flushes in write-behind > > > > > > + gluster-devel > > > > > > > > > > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote: > > > > > Hi All, > > > > > > > > > > Currently on failure of flushing of writeback cache, we mark the fd > > > > > bad. > > > > > The rationale behind this is that since the application doesn't know > > > > > which > > > > > of the writes that are cached failed, fd is in a bad state and cannot > > > > > possibly do a meaningful/correct read. However, this approach (though > > > > > posix-complaint) is not acceptable for long standing applications > > > > > like > > > > > QEMU [1]. So, a two part solution was decided: > > > > > > > > > > 1. No longer mark the fd bad during failures while flushing data to > > > > > backend > > > > > from write-behind cache. > > > > > 2. retry the writes > > > > > > > > > > As for as 2, goes, application can checkpoint by doing fsync and on > > > > > write > > > > > failures, roll-back to last checkpoint and replay writes from that > > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the > > > > > application. However, glusterfs retrying writes cannot be a complete > > > > > solution as the error-condition we've run into might never get > > > > > resolved > > > > > (For eg., running out of space). So, glusterfs has to give up after > > > > > some > > > > > time. > > > > > > > > > > It would be helpful if you give your inputs on how other writeback > > > > > systems > > > > > (Eg., kernel page-cache, nfs, samba, ceph, lustre etc) behave in this > > > > > scenario and what would be a sane policy for glusterfs. > > > > > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1200862 > > > > > > > > > > regards, > > > > > Raghavendra > > > > > > > > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxxx > > > http://www.gluster.org/mailman/listinfo/gluster-devel > > > > > > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel