On 2019/7/26 下午9:47, Michael S. Tsirkin wrote:
On Fri, Jul 26, 2019 at 08:53:18PM +0800, Jason Wang wrote:
On 2019/7/26 下午8:38, Michael S. Tsirkin wrote:
On Fri, Jul 26, 2019 at 08:00:58PM +0800, Jason Wang wrote:
On 2019/7/26 下午7:49, Michael S. Tsirkin wrote:
On Thu, Jul 25, 2019 at 10:25:25PM +0800, Jason Wang wrote:
On 2019/7/25 下午9:26, Michael S. Tsirkin wrote:
Exactly, and that's the reason actually I use synchronize_rcu() there.
So the concern is still the possible synchronize_expedited()?
I think synchronize_srcu_expedited.
synchronize_expedited sends lots of IPI and is bad for realtime VMs.
Can I do this
on through another series on top of the incoming V2?
Thanks
The question is this: is this still a gain if we switch to the
more expensive srcu? If yes then we can keep the feature on,
I think we only care about the cost on srcu_read_lock() which looks pretty
tiny form my point of view. Which is basically a READ_ONCE() + WRITE_ONCE().
Of course I can benchmark to see the difference.
if not we'll put it off until next release and think
of better solutions. rcu->srcu is just a find and replace,
don't see why we need to defer that. can be a separate patch
for sure, but we need to know how well it works.
I think I get here, let me try to do that in V2 and let's see the numbers.
Thanks
It looks to me for tree rcu, its srcu_read_lock() have a mb() which is too
expensive for us.
I will try to ponder using vq lock in some way.
Maybe with trylock somehow ...
Ok, let me retry if necessary (but I do remember I end up with deadlocks
last try).
If we just worry about the IPI,
With synchronize_rcu what I would worry about is that guest is stalled
Can this synchronize_rcu() be triggered by guest? If yes, there are several
other MMU notifiers that can block. Is vhost something special here?
Sorry, let me explain: guests (and tasks in general)
can trigger activity that will
make synchronize_rcu take a long time.
Yes, I get this.
Thus blocking
an mmu notifier until synchronize_rcu finishes
is a bad idea.
The question is, MMU notifier are allowed to be blocked on
invalidate_range_start() which could be much slower than
synchronize_rcu() to finish.
Looking at amdgpu_mn_invalidate_range_start_gfx() which calls
amdgpu_mn_invalidate_node() which did:
r = reservation_object_wait_timeout_rcu(bo->tbo.resv,
true, false, MAX_SCHEDULE_TIMEOUT);
...
because system is busy because of other guests.
With expedited it's the IPIs...
The current synchronize_rcu() can force a expedited grace period:
void synchronize_rcu(void)
{
...
if (rcu_blocking_is_gp())
return;
if (rcu_gp_is_expedited())
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);
An admin can force rcu to finish faster, trading
interrupts for responsiveness.
Yes, so when set, all each synchronize_rcu() will go for
synchronize_rcu_expedited().
can we do something like in
vhost_invalidate_vq_start()?
if (map) {
/* In order to avoid possible IPIs with
* synchronize_rcu_expedited() we use call_rcu() +
* completion.
*/
init_completion(&c.completion);
call_rcu(&c.rcu_head, vhost_finish_vq_invalidation);
wait_for_completion(&c.completion);
vhost_set_map_dirty(vq, map, index);
vhost_map_unprefetch(map);
}
?
Why would that be faster than synchronize_rcu?
No faster but no IPI.
Sorry I still don't see the point.
synchronize_rcu doesn't normally do an IPI either.
Not the case of when rcu_expedited is set. This can just 100% make sure
there's no IPI.
There's one other thing that bothers me, and that is that
for large rings which are not physically contiguous
we don't implement the optimization.
For sure, that can wait, but I think eventually we should
vmap large rings.
Yes, worth to try. But using direct map has its own advantage: it can use
hugepage that vmap can't
Thanks
Sure, so we can do that for small rings.
Yes, that's possible but should be done on top.
Thanks
Absolutely. Need to fix up the bugs first.
Yes.
Thanks