On 10/23/07, Kevan Benson <kbenson@xxxxxxxxxxxxxxx> wrote: > > Alexey Filin wrote: > > correctly, if afr xlator node doesn't crash during writing. If it > crashes > > the close() is not issued at all and version attribute is not updated > > (according to your description). If children update version without > > assistance as a workaround after afr xlator node crash, the new versions > of > > replicas are equal but data can be different (because operations are > queued > > in interconnect and issued sequentially as datagrams come out). Such a > > situation can't occur if every operation is atomic relative to the > version > > attribute i.e. the attribute is updated instantly after every operation. > > > > I'll be happy if don't know something what helps to handle the situation > > correctly in current implementation. > > > > Actually, I just thought of a major problem with this. I think the > extended attributes need to be set as atomic operations. Imagine the > case where two processes are writing the file at the same time, the op > counters could get very messed up. atomic operations is an ideal which is not possible on practice sometimes, ideal hardware exists in mind only, developers choose a compromise between complexity, performance, reliability, flexibility etc on existing hardware always. to provide operation-counter(or version if it is updated after each operation) consistency the concurrent access to the same file is to be done: * with one thread (to allow concurrent operations with _one_ file to be serviced by _one_ thread only) which can provide atomicity with explicit queuing * or with sync primitive(s) for many threads. io threads help to decrease latencies when many clients use the same brick (as e.g. a glfs doc says) or to overlap network/disk io to increase performance per a client (is it implemented in glfs?) Another solution comes to mind. Just set another extended attribute > denoting that the file is being written to currently (and unset it > afterwards). If the AFR subvolume notices that the file islisted as > being written to but no clients have it open (I hope this is easily > determinable) a flag is returned for the file. If all subvolumes return > this flag for the file in the AFR (and all the trusted_afr_versions are > the same), choose one version of the file (for example from the first > AFR subvolume) as the legit copy and copy it to the other AFR nodes. It > doesn't matter which version is the most up to date, they will all be > fairly close, and since this is from a failed write operation there was > no guarantee the file was in a valid state after the write. it's > doesn't matter which copy you get, as long as it's consistent across AFR > members. I like it more op counter, advantage to op counter is that the flag is set only two times (open()/close()) so an overhead is minimal (concurrent access to the flag is to be synchronized), the disadvantage is if not closed file is enough big it has to be copied sometimes when it is not required, it is acceptable if afr crashes rare P.S. > For those still unsure what we are referring to, it's the case where a > write to an AFR fails, so no AFR subvolume finishes and calls close(). > In this case the trusted_afr_version hasn't been incremented, but the > actual data in the files may not be consistent across AFR subvolumes. > As I've seen in prior testing, subsequent operations on the file will > happen independently on each subvolume, and the files may continue to > stay out of sync. The data in the file may not be entirely trusted do > to the failed write, but it should at least be consistent across AFR > subvolumes. An AFR subvolume failure should not change what data is > returned. surely replicas consistency after afr crash is the core of my anxiety -- > > -Kevan Benson > -A-1 Networks > Regards, Alexey.