Re: [PATCH] xfs: set aside allocation btree blocks from block reservation

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 19 Feb 2021 09:09:19 -0500

On Fri, Feb 19, 2021 at 01:24:11PM +1100, Dave Chinner wrote:
> On Thu, Feb 18, 2021 at 08:25:20AM -0500, Brian Foster wrote:
> > On Thu, Feb 18, 2021 at 11:34:51AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 17, 2021 at 08:23:39AM -0500, Brian Foster wrote:
...
> 
> > > > Note that the counter uses a small percpu batch size to allow the
> > > > allocation paths to keep the primary count accurate enough that the
> > > > reservation path doesn't ever need to lock and sum the counter.
> > > > Absolute accuracy is not required here, just that the counter
> > > > reflects the majority of unavailable blocks so the reservation path
> > > > fails first.
> > > 
> > > And this makes the per-cpu counter scale almost no better than an
> > > simple atomic counter, because a spinlock requires two atomic
> > > operations (lock and unlock). Hence a batch count of 4 only reduces
> > > the atomic op count by half but introduces at lot of extra
> > > complexity. It won't make a difference to the scalability of
> > > workloads that hammer the btree block count because contention on
> > > the internal counter spinlock will occur at close to the same
> > > concurrency rate as would occur on an atomic counter.
> > > 
> > 
> > Right, but percpu_counter_read_positive() allows a fast read in the
> > xfs_mod_fdblocks() path. I didn't use an atomic because I was concerned
> > about introducing overhead in that path. If we're Ok with whatever
> > overhead an atomic read might introduce (a spin lock in the worst case
> > for some arches), then I don't mind switching over to that. I also don't
> 
> The generic definition of atomic_read() is this:
> 
> /**
>  * atomic_read - read atomic variable
>  * @v: pointer of type atomic_t
>  *
>  * Atomically reads the value of @v.
>  */
> #ifndef atomic_read
> #define atomic_read(v)  READ_ONCE((v)->counter)
> #endif
> 
> And the only arch specific implementations (x86 and arm64) both have
> the same implementation.
> 

That's the 32-bit variant, FWIW. It looks like the x86-64 and arm64
atomic64 variants are essentially the same, but the generic 64-bit
implementation is:

s64 atomic64_read(const atomic64_t *v)
{
        unsigned long flags;
        raw_spinlock_t *lock = lock_addr(v);
        s64 val;

        raw_spin_lock_irqsave(lock, flags);
        val = v->counter;
        raw_spin_unlock_irqrestore(lock, flags);
        return val;
}

Arm, powerpc, and s390 appear to have custom implementations which I
assume are more efficient than this. x86 has an arch_atomic64_read()
that falls through a maze of directives with at least a couple
underlying implementations. One appears to be atomic64_read_cx8() which
uses the cache line lock prefix thing. It's not clear to me where the
other goes, or if this ever falls back to the generic implementation..

> And percpu_counter_read_positive() is:
> 
> /*
>  * It is possible for the percpu_counter_read() to return a small negative
>  * number for some counter which should never be negative.
>  *
>  */
> static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
> {
>         /* Prevent reloads of fbc->count */
>         s64 ret = READ_ONCE(fbc->count);
> 
>         if (ret >= 0)
>                 return ret;
>         return 0;
> }
> 
> IOWs, atomic_read() has lower read side overhead than the percpu
> counter. atomic reads do not require spinlocks or even locked memory
> accesses - they are only needed on the write side. Hence the only
> reason for using a pcp counter over an atomic is that the atomic is
> update rate bound....
> 
> FWIW, generic atomics don't use spin locks anymore - they use
> cmpxchg() which is generally much more efficient than a spin lock,
> even when emulated on platforms that don't have native support for
> cmpxchg(). And, really, we don't care one bit about performance or
> scalability on those niche platforms; all the high CPU count
> platforms where scalability matters have native atomic cmpxchg
> operations.
> 
> > mind defining a new spin lock and explicitly implementing the lockless
> > read in xfs_mod_fdblocks(), I just thought it was extra code for little
> > benefit over the percpu counter. Preference?
> 
> Spinlocks really hurt scalability in fast paths. atomics are much,
> much less hurty, and so the threshold for needing percpu counters is
> *much higher* than the threshold for needing lockless algorithms to
> avoid catastrophic lock contention.[*]
> 

As shown above, this is why I wanted to avoid introducing a spinlock on
the read side. ;) IOW, the functional behavior I was aiming for was:

update code:
	spin_lock();
	counter += delta;
	spin_unlock();

read code:
	READ_ONCE(counter);

I was initially going to pass a batch size of 0 because performance of
the update side is not important. The use of percpu was not for
scalability reasons at all. It was a somewhat lazy reuse of code to
avoid defining a new spinlock just for this.

In any event, the atomic64 read side is essentially equivalent to this
on x86-64 and arm64. If we're comfortable with the remaining custom
atomic64 implementations for other common arches (ppc64, s390x, x86,
etc.) and simply don't care enough about the additional overhead on the
arches that might fall back to the generic implementation, then that is
good enough reason to me to switch to an atomic...

Brian

> Cheers,
> 
> Dave.
> 
> [*] For example, I just added an atomic counter into
> xlog_cil_insert_items() that is incremented on every transaction
> commit (provides global ordering of items in the CIL checkpoint
> placed on per-cpu lists). With the spinlocks from the CIL commit
> path, we increase performance from 700k commits/s to 1.4 million
> commits/s on a 32p machine and only increase CPU time in
> xlog_cil_insert_items() from 1.8% to 3.3%.
> 
> IOWs, the atomic counter has almost zero overhead compared to a set
> of critical sections protected by multiple spinlocks that were
> consuming crazy amounts of CPU time.
> 
> Before, at 700,000 commits/sec:
> 
>  -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
>     - 46.35% xfs_log_commit_cil
>        - 41.54% _raw_spin_lock
>           - 67.30% do_raw_spin_lock
>                66.96% __pv_queued_spin_lock_slowpath
> 
> 
> After, at 1.4 million commits/sec:
> 
> -   21.78%     3.32%  [kernel]            [k] xlog_cil_commit
>    - 17.32% xlog_cil_commit
>       - 9.04% xfs_log_ticket_ungrant
>            2.22% xfs_log_space_wake
>         2.13% memcpy_erms
>       - 1.42% xfs_buf_item_committing
>          - 1.44% xfs_buf_item_release
>             - 0.63% xfs_buf_unlock
>                  0.63% up
>               0.55% xfs_buf_rele
>         1.06% xfs_inode_item_format
>         1.00% down_read
>         0.77% up_read
>         0.60% xfs_buf_item_format
> 
> You can see the atomic cmpxchg loops to update the grant heads in
> xfs_log_ticket_ungrant() is now looking like the highest contended
> cachelines in the fast path, and the CIL ctx rwsem is showing up on
> the profile, too. But there's no spin lock contention, and the
> atomics I added to the xlog_cil_commit() fast path aren't having any
> adverse impact on CPU usage at this point.
> 
> FWIW, the grant head accounting is definitely getting hot here,
> as the xfs_trans_reserve side is showing:
> 
>      7.35%     7.29%  [kernel]            [k] xlog_grant_add_space 
> 
> Which indicates at least 15% of the CPU time on this workload is now
> being spend in the grant head accounting loops. Those atomic
> counters are where I'd be considering per-cpu counters now as they
> are showing signs of contention at ~2.8 million updates/s each.
> 
> However, the reason I didn't use per-cpu counters way back when I
> made this stuff lockless was that we need an accurate sum of the
> grant head space frequently. Hence the summing cost of a per-cpu
> counter is just too great for a hot limit accounting path like this,
> so I used cmpxchg loops. PCP counters are still too expensive to sum
> accurately regularly, so if we want to support a higher transaction
> throughput rate we're going to have to get creative here.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>