Re: [PATCH 1/3] lockdep: Apply crossrelease to PG_locked locks

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 24 Nov 2017 09:11:49 +0100

On Fri 24-11-17 12:02:36, Byungchul Park wrote:
> On Thu, Nov 16, 2017 at 02:07:46PM +0100, Michal Hocko wrote:
> > On Thu 16-11-17 21:48:05, Byungchul Park wrote:
> > > On 11/16/2017 9:02 PM, Michal Hocko wrote:
> > > > for each struct page. So you are doubling the size. Who is going to
> > > > enable this config option? You are moving this to page_ext in a later
> > > > patch which is a good step but it doesn't go far enough because this
> > > > still consumes those resources. Is there any problem to make this
> > > > kernel command line controllable? Something we do for page_owner for
> > > > example?
> > > 
> > > Sure. I will add it.
> > > 
> > > > Also it would be really great if you could give us some measures about
> > > > the runtime overhead. I do not expect it to be very large but this is
> > > 
> > > The major overhead would come from the amount of additional memory
> > > consumption for 'lockdep_map's.
> > 
> > yes
> > 
> > > Do you want me to measure the overhead by the additional memory
> > > consumption?
> > > 
> > > Or do you expect another overhead?
> > 
> > I would be also interested how much impact this has on performance. I do
> > not expect it would be too large but having some numbers for cache cold
> > parallel kbuild or other heavy page lock workloads.
> 
> Hello Michal,
> 
> I measured 'cache cold parallel kbuild' on my qemu machine. The result
> varies much so I cannot confirm, but I think there's no meaningful
> difference between before and after applying crossrelease to page locks.
> 
> Actually, I expect little overhead in lock_page() and unlock_page() even
> after applying crossreleas to page locks, but only expect a bit overhead
> by additional memory consumption for 'lockdep_map's per page.
> 
> I run the following instructions within "QEMU x86_64 4GB memory 4 cpus":
> 
>    make clean
>    echo 3 > drop_caches
>    time make -j4

Maybe FS people will help you find a more representative workload. E.g.
linear cache cold file read should be good as well. Maybe there are some
tests in fstests (or how they call xfstests these days).

> The results are:
> 
>    # w/o page lock tracking
> 
>    At the 1st try,
>    real     5m28.105s
>    user     17m52.716s
>    sys      3m8.871s
> 
>    At the 2nd try,
>    real     5m27.023s
>    user     17m50.134s
>    sys      3m9.289s
> 
>    At the 3rd try,
>    real     5m22.837s
>    user     17m34.514s
>    sys      3m8.097s
> 
>    # w/ page lock tracking
> 
>    At the 1st try,
>    real     5m18.158s
>    user     17m18.200s
>    sys      3m8.639s
> 
>    At the 2nd try,
>    real     5m19.329s
>    user     17m19.982s
>    sys      3m8.345s
> 
>    At the 3rd try,
>    real     5m19.626s
>    user     17m21.363s
>    sys      3m9.869s
> 
> I think thers's no meaningful difference on my small machine.

Yeah, this doesn't seem to indicate anything. Maybe moving the build to
shmem to rule out IO cost would tell more. But as I've said previously
page I do not really expect this would be very visible. It was more a
matter of my curiosity than an acceptance requirement. I think it is
much more important to make this runtime configurable because almost
nobody is going to enable the feature if it is only build time. The cost
is jut too high.

-- 
Michal Hocko
SUSE Labs