----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Cc: "Vijaikumar Mallikarjuna" <vmallika@xxxxxxxxxx>, "Sachin Pandit" <spandit@xxxxxxxxxx> > Sent: Friday, 22 May, 2015 10:50:14 AM > Subject: Enhancing Quota enforcement during parallel writes > > All, > > As pointed by [1], parallel writes can result in incorrect quota enforcement. > [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]: > > in_progress_writes is updated _after_ we fetch the size. Due to this, two > writes can see the same size and hence the issue is not solved. What we > should be doing is to update in_progress_writes even before we fetch the > size. If we do this, it is guaranteed that at-least one write sees the > other's size accounted in in_progress_writes. This approach has two issues: > > 1. since we had added current write size to in_progress_writes, current write > would already be accounted in the size of the directory. This is a minor > issue and can be solved by subtracting the size of the current write from > the resultant cluster-wide in-progress-size of the directory. > > 2. We might prematurely fail the writes even though there is some space > available. Assume there is a 5MB of free space. If two 5MB writes are issued > in parallel, both might fail as both might see each other's size already > accounted, though none of them has succeeded. Of course, we can go with this limitation as we are erring on conservative side if the following logic seems too complicated. > To solve this issue, I am > proposing following algo: > > * we assign an identity that is unique across the cluster for each write - > say uuid > * Among all the in-progress-writes we pick a write. The policy used can be > a random criteria like smallest of all the uuids. So, each brick selects > a candidate among its own in-progress-writes _AND_ incoming candidate > (see the psuedocode of get_dir_size below for more clarity). It sends > back this candidate along with size of directory. The brick also > remembers the last candidate it approved. clustering translators like dht > pick one write among these replies, using the same logic bricks had used. > Now along with size we also get a candidate to choose from in-progress > writes. However, there might be a new write on the brick in the > time-window where we try to fetch size which could be the candidate. We > should compare the resultant cluster_wide candidate with the per-brick > candidate. So, the enforcement logic will be as below: > > > /* Both enforcer and get_dir_size are executed in brick process. I've left > out logic of get_dir_size in cluster translators like dht */ > enforcer () > { > /* Note that this logic is executed independently for each directory on > which quota limit is set. All the in-progress writes, sizes, candidates > are valid in the context of > that directory > */ > > my_delta = iov_length (input_iovec, input_count); > my_id = getuuid(); > > add_my_delta_to_in_progress_size (); > > get_dir_size (my_id, &size, &in_progress_size, &cluster_candidate); > > in_progress_size -= my_delta; > > if (((size + my_delta) < quota_limit) && ((size + in_progress_size + > my_delta) > quota_limit) { > > /* we've to choose among in-progress writes */ > > brick_candidate = least_of_uuids > (directory->in_progress_write_list, > directory->last_winning_candidate); > > if ((my_id == cluster_candidate) && (my_id == brick_candidate)) { > /* 1. subtract my_delta from per-brick in-progress writes > 2. add my_delta to per-brick sizes of all parents > 3. allow-write > > getting brick_candidate above, 1 and 2 should be done > atomically > */ > } else { > /* 1. subtract my_delta from per-brick in-progress writes > 2. fail_write > */ > } else if ((size + my_delta) < quota_limit) { > /* 1. subtract my_delta from per-brick in-progress writes > 2. add my_delta to per-brick sizes of all parents > 3. allow-write > > 1 and 2 should be done atomically > */ > } else { > > fail_write (); > > } > > } > > get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate, > ...) > { > directory->last_winning_candidate = winning_candidate = least_uuid > (directory->in_progress_write_list, incoming_candidate_id); > > .... > } > > Comments? > > [1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html > [2] http://review.gluster.org/#/c/6220/ > > regards, > Raghavendra. _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel