On 11/27/2013 03:31 AM, Marcelo Tosatti wrote: > On Tue, Nov 26, 2013 at 11:21:37AM +0800, Xiao Guangrong wrote: >> On 11/26/2013 02:12 AM, Marcelo Tosatti wrote: >>> On Mon, Nov 25, 2013 at 02:29:03PM +0800, Xiao Guangrong wrote: >>>>>> Also, there is no guarantee of termination (as long as sptes are >>>>>> deleted with the correct timing). BTW, can't see any guarantee of >>>>>> termination for rculist nulls either (a writer can race with a lockless >>>>>> reader indefinately, restarting the lockless walk every time). >>>>> >>>>> Hmm, that can be avoided by checking dirty-bitmap before rewalk, >>>>> that means, if the dirty-bitmap has been set during lockless write-protection, >>>>> it�s unnecessary to write-protect its sptes. Your idea? >>>> This idea is based on the fact that the number of rmap is limited by >>>> RMAP_RECYCLE_THRESHOLD. So, in the case of adding new spte into rmap, >>>> we can break the rewalk at once, in the case of deleting, we can only >>>> rewalk RMAP_RECYCLE_THRESHOLD times. >>> >>> Please explain in more detail. >> >> Okay. >> >> My proposal is like this: >> >> pte_list_walk_lockless() >> { >> restart: >> >> + if (__test_bit(slot->arch.dirty_bitmap, gfn-index)) >> + return; >> >> code-doing-lockless-walking; >> ...... >> } >> >> Before do lockless-walking, we check the dirty-bitmap first, if >> it is set we can simply skip write-protection for the gfn, that >> is the case that new spte is being added into rmap when we lockless >> access the rmap. > > The dirty bit could be set after the check. > >> For the case of deleting spte from rmap, the number of entry is limited >> by RMAP_RECYCLE_THRESHOLD, that is not endlessly. > > It can shrink and grow while lockless walk is performed. Yes, indeed. Hmmm, another idea in my mind to fix this is encoding the position into the reserved bits of desc->more pointer, for example: +------+ +------+ +------+ rmapp -> |Desc 0| -> |Desc 1| -> |Desc 2| +------+ +------+ +------+ There are 3 descs on the rmap, and: rmapp = &desc0 | 1UL | 3UL << 50; desc0->more = desc1 | 2UL << 50; desc1->more = desc0 | 1UL << 50 desc2->more = &rmapp | 1UL; (The nulls pointer) We will walk to the next desc only if the "position" of current desc is >= the position of next desc. That can make sure we can reach the last desc anyway. And in order to avoiding doing too many "rewalk", we will goto the slow path (do walk with holding the lock) instead when retry the walk more that N times. Thanks all you guys in thanksgiving day. :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html