Re: Orangefs ABI documentation

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sun, 24 Jan 2016 04:05:29 +0000

On Sun, Jan 24, 2016 at 12:16:15AM +0000, Al Viro wrote:
> On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
> > AV> IOW, would the following do the right thing?
> > AV> That would've left us with only one caller of
> > AV> handle_io_error()...
> > 
> > It works.  With your simplified code all
> > the needed things still happen: complete and
> > bufmap_put...
> > 
> > I've never had an error there unless I forgot
> > to turn on the client-core...
> > 
> > You must be looking for a way to get rid of
> > another macro <g>...
> 
> That as well, but mostly I want to sort the situation with cancels out and
> get a better grasp on when can that code be reached.  BTW, an error at that
> spot is trivial to arrange - just pass read() a destination with munmapped
> page in the middle and it'll trigger just fine.  IOW,
> 	p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	munmap(p + 16384, 16384);
> 	read(fd, p, 65536);
> with fd being a file on orangefs should step into that.

Hmm...  I've just realized that I don't understand why you are waiting in
the orangefs_devreq_write_iter() at all.  After all, you've reserved the
slot in wait_for_direct_io() and do not release it until right after you've
done complete(&op->done).  Why should server block while your read(2) copies
the data from bufmap to its real destination?  After all, any further
request hitting the same slot won't come until the slot is released anyway,
right?  What (and why) would the server be doing to that slot in the
meanwhile?  It's already copied whatever data it was going to copy as part of
that file_read...

Speaking of your slot allocator - surely it would be better to maintain the
count of unused slots there?  For example, replace that waitq with semaphore,
initialize it to number of slots, and have wait_for_a_slot() do
	if (down_interruptible(slargs->slots_sem)) {
		whine "interrupted"
		return -EINTR;
	}
	/* now we know that there's an empty slot for us */
	spin_lock(slargs->slots_lock);
	n = find_first_zero_bit(slargs->slots_bitmap, slargs->slot_count);
	set_bit(slargs->slots_bitmap, n);
	spin_unlock(slargs->slots_lock);
	// or a lockless variant thereof, for that matter - just need to be
	// carefull with barriers
	return n;
with put_back_slot() being
	spin_lock
	clear_bit
	spin_unlock
	up

Wait a minute...  What happens if you have a daemon disconnect while somebody
is holding a bufmap slot?  Each of those (as well as whoever's waiting for
a slot to come free) is holding a reference to bufmap.  devreq_mutex won't
do a damn thing, contrary to the comment in ->release() - it's _not_ held
across the duration of wait_for_direct_io().

We'll just decrement the refcount of bufmap, do nothing since it hasn't
reached zero, proceed to mark all ops as purged, wake each service_operation()
up and sod off.  Now, the holders of those slots will call
orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
wait_for_direct_io() noticing that, freeing the slot and going into restart.
And if there was the only one, we are fine, but what if there were several?

Suppose there had been two read() going on at the time of disconnect.
The first one drops the slot.  Good, refcount of bufmap is down to 1 now.
And we go back to orangefs_bufmap_get().  Which calls orangefs_bufmap_ref().
Which sees that __orangefs_bufmap is still non-NULL.  And promptly regains
the reference and allocates the slot in that sucker.  Then it's back to
service_operations()...  Which will check is_daemon_in_service(), possibly
getting "yes, it is" (if the new one got already started).  Ouch.

The fun part in all of that is that new daemon won't be able to get new
bufmap in place until all users of the old one run down.  What a mess...

Ho-hum...  So we need to
	a) have ->release() wait for all slots to get freed
	b) have orangefs_bufmap_get() recognize that situation and
_not_ get new reference to bufmap that is already going down.
	c) move the "wait for new daemon to come and install a new bufmap"
to some point past the release of the slot - service_operation() is too
early for that to work.

Or am I missing something simple that makes the scenario above not go that
way?  Confused...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html