Re: [PATCH] zram: remove global tb_lock by using lock-free CAS

Weijie Yang <weijie.yang.kh@xxxxxxxxx> · Wed, 7 May 2014 17:16:52 +0800

On Wed, May 7, 2014 at 4:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> On Wed, May 07, 2014 at 03:51:35PM +0800, Weijie Yang wrote:
>> On Tue, May 6, 2014 at 6:22 AM, Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>> > On Mon, 2014-05-05 at 13:46 -0700, Andrew Morton wrote:
>> >> On Mon, 05 May 2014 11:00:44 -0700 Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>> >>
>> >> > > > @@ -339,12 +338,14 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
>> >> > > >         unsigned long handle;
>> >> > > >         u16 size;
>> >> > > >
>> >> > > > -       read_lock(&meta->tb_lock);
>> >> > > > +       while(atomic_cmpxchg(&meta->table[index].state, IDLE, ACCESS) != IDLE)
>> >> > > > +               cpu_relax();
>> >> > > > +
>> >> > >
>> >> > > So... this might be dumb question, but this looks like a spinlock
>> >> > > implementation.
>> >> > >
>> >> > > What advantage does this have over a standard spinlock?
>> >> >
>> >> > I was wondering the same thing. Furthermore by doing this you'll loose
>> >> > the benefits of sharing the lock... your numbers do indicate that it is
>> >> > for the better. Also, note that hopefully rwlock_t will soon be updated
>> >> > to be fair and perform up to par with spinlocks, something which is long
>> >> > overdue. So you could reduce the critical region by implementing the
>> >> > same granularity, just don't implement your own locking schemes, like
>> >> > this.
>>
>> Actually, the main reason I use a CAS rather than a standard lock here is
>> that I want to minimize the meta table memory overhead. A tiny reason is
>> my fuzzy memory that CAS is more efficient than spinlock (please correct me
>> if I am wrong).
>>
>> Anyway, I changed the CAS to spinlock and rwlock, re-test them:
>>
>>       Test       lock-free       spinlock     rwlock
>> ------------------------------------------------------
>>  Initial write   1424141.62   1426372.84   1423019.21
>>        Rewrite   1652504.81   1623307.14   1653682.04
>>           Read  11404668.35  11242885.05  10938125.00
>>        Re-read  11555483.75   11253906.6  10837773.50
>>   Reverse Read   8394478.17   8277250.34   7768057.39
>>    Stride read   9372229.95   9010498.53   8692871.77
>>    Random read   9187221.90   8988080.55   8661184.60
>> Mixed workload   5843370.85   5414729.54   5451055.03
>>   Random write   1608947.04   1572276.64   1588866.51
>>         Pwrite   1311055.32   1302463.04   1302001.06
>>          Pread   4652056.11   4555802.18   4469672.34
>
> I'd like to clear it out.
> The spinlock and rwlock you mentioned is per-meta entry lock like state
> you added or global lock for meta? If it's latter, rwlock means base?

The spinlock and rwlock is per-meta entry lock like state.
Because the base data is added in the first mail, I don't mention
them here.

>>
>> And I cann't say which one is the best, they have the similar performance.
>
> Most popular use of zram is the in-memory swap for small embedded system
> so I don't want to increase memory footprint without good reason although
> it makes synthetic benchmark. Alhought it's 1M for 1G, it isn't small if we
> consider compression ratio and real free memory after boot(But data I have
> an interest is mixed workload enhancement. It would be important for heavy
> memory pressure for latency so it attractives me a lot. Anyway, I need number
> for back up the justification with real swap usecase rather than zram-blk.
> Recently, I have considered per-process reclaim  based on zram so maybe I will
> have a test for that).
> But recently, I have received private mail from some server folks to use
> zram-blk, not zram-swap so in case of that, such enhancement would be
> desirable so my point is I'm not saying the drop of the patch and let's
> find proper solution to meet both and gather more data.
>
>>
>> Wait, iozone will create temporary files for every test thread, so there is no
>> possibility that these threads access the same table[index] concurrenctly.
>> So, I use fio to test the raw zram block device.
>> To enhance the possibility of access the same table[index] conflictly, I set zram
>> with a small disksize(10M) and let thread run with large loop count.
>>
>> On the same test machine, the fio test command is:
>> fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
>> --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
>> --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
>> --name=seq-read --rw=read --stonewall --name=seq-readwrite
>> --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
>>
>>     Test      base    lock-free   spinlock   rwlock
>> ------------------------------------------------------
>> seq-write   935109.2   999580.5   998134.8   994384.6
>>  seq-read  5598064.6  6444011.5  6243184.6  6197514.2
>>    seq-rw  1403963.0  1635673.0  1633823.0  1635972.2
>>   rand-rw  1389864.4  1612520.4  1613403.6  1612129.8
>
> What's the difference between base and rwlock?
> Base means global rwlock while rwlock means per-meta entry rwlock?
>

Base means  global rwlock, it is the 3.15.0-rc3 code.
rwlock means per-meta entry rwlock.

Sorry to confuse you.

>>
>> This result(KB/s, average of 5 tests) shows the performance improvement
>> on base version, however, I cann't say which method is the best.
>>
>> >>
>> >> It sounds like seqlocks will match this access pattern pretty well?
>> >
>> > Indeed. And after a closer look, except for zram_slot_free_notify(),
>> > that lock is always shared. So, unless fine graining it implies taking
>> > the lock exclusively like in this patch (if so, that needs to be
>> > explicitly documented in the changelog), we would ideally continue to
>> > share it. That _should_ provide nicer performance numbers when using the
>> > correct lock.
>> >
>>
>> Andrew mentioned seqlocks, however, I think it is hard the use seqlocks here
>> after I recheck the codes. No matter use it as a meta global lock or a
>> table[index] lock. The main reason is the writer will free the handle rather than
>> just change some value.
>> If I misunderstand you, please let me know.
>>
>> Now, I am in a delimma. For minimizing the memory overhead, I like to use CAS.
>> However, it is not a standard way.
>>
>> Any complaint or suggestions are welcomed.
>>
>> Regards,
>>
>> >
>> >
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>