On 08/28/2013 04:58 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote: >> On 08/28/2013 04:12 PM, Gleb Natapov wrote: >> >>>> + >>>> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); >>>> + desc = (struct pte_list_desc *)(*pte_list & ~1ul); >>>> + >>>> + /* No empty position in the desc. */ >>>> + if (desc->sptes[PTE_LIST_EXT - 1]) { >>>> + struct pte_list_desc *new_desc; >>>> + new_desc = mmu_alloc_pte_list_desc(vcpu); >>>> + new_desc->more = desc; >>>> + desc = new_desc; >>>> + *pte_list = (unsigned long)desc | 1; >>>> } >>>> - return count; >>>> + >>>> + free_pos = find_first_free(desc); >>>> + desc->sptes[free_pos] = spte; >>>> + return count_spte_number(desc); >>> Should it be count_spte_number(desc) - 1? The function should returns >>> the number of pte entries before the spte was added. >> >> Yes. We have handled it count_spte_number(), we count the number like this: >> >> return first_free + desc_num * PTE_LIST_EXT; >> >> The first_free is indexed from 0. >> > Suppose when pte_list_add() is called there is one full desc, so the > number that should be returned is PTE_LIST_EXT, correct? But since > before calling count_spte_number() one more desc will be added and > desc->sptes[0] will be set in it the first_free in count_spte_number > will be 1 and PTE_LIST_EXT + 1 will be returned. Oh, yes, you are right. Will fix it in the next version, thanks for you pointing it out. > >> Maybe it is clearer to let count_spte_number() return the real number. >> >>> >>>> } >>>> >>>> static void >>>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc, >>>> - int i, struct pte_list_desc *prev_desc) >>>> +pte_list_desc_remove_entry(unsigned long *pte_list, >>>> + struct pte_list_desc *desc, int i) >>>> { >>>> - int j; >>>> + struct pte_list_desc *first_desc; >>>> + int last_used; >>>> + >>>> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul); >>>> + last_used = find_last_used(first_desc); >>>> >>>> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j) >>>> - ; >>>> - desc->sptes[i] = desc->sptes[j]; >>>> - desc->sptes[j] = NULL; >>>> - if (j != 0) >>>> + /* >>>> + * Move the entry from the first desc to this position we want >>>> + * to remove. >>>> + */ >>>> + desc->sptes[i] = first_desc->sptes[last_used]; >>>> + first_desc->sptes[last_used] = NULL; >>>> + >>> What if desc == first_desc and i < last_used. You still move spte >>> backwards so lockless walk may have already examined entry at i and >>> will miss spte that was moved there from last_used position, no? >> >> Right. I noticed it too and fixed in the v2 which is being tested. >> I fixed it by bottom-up walk desc, like this: >> >> pte_list_walk_lockless(): >> >> desc = (struct pte_list_desc *)(pte_list_value & ~1ul); >> while (!desc_is_a_nulls(desc)) { >> /* >> * We should do bottom-up walk since we always use the >> * bottom entry to replace the deleted entry if only >> * one desc is used in the rmap when a spte is removed. >> * Otherwise the moved entry will be missed. >> */ > I would call it top-down walk since we are walking from big indices to > smaller once. Okay, will fix the comments. > >> for (i = PTE_LIST_EXT - 1; i >= 0; i--) >> fn(desc->sptes[i]); >> >> desc = ACCESS_ONCE(desc->more); >> >> /* It is being initialized. */ >> if (unlikely(!desc)) >> goto restart; >> } >> >> How about this? >> > Tricky, very very tricky :) > >>> >>>> + /* No valid entry in this desc, we can free this desc now. */ >>>> + if (!first_desc->sptes[0]) { >>>> + struct pte_list_desc *next_desc = first_desc->more; >>>> + >>>> + /* >>>> + * Only one entry existing but still use a desc to store it? >>>> + */ >>>> + WARN_ON(!next_desc); >>>> + >>>> + mmu_free_pte_list_desc(first_desc); >>>> + first_desc = next_desc; >>>> + *pte_list = (unsigned long)first_desc | 1ul; >>>> return; >>>> - if (!prev_desc && !desc->more) >>>> - *pte_list = (unsigned long)desc->sptes[0]; >>>> - else >>>> - if (prev_desc) >>>> - prev_desc->more = desc->more; >>>> - else >>>> - *pte_list = (unsigned long)desc->more | 1; >>>> - mmu_free_pte_list_desc(desc); >>>> + } >>>> + >>>> + WARN_ON(!first_desc->sptes[0]); >>>> + >>>> + /* >>>> + * Only one entry in this desc, move the entry to the head >>>> + * then the desc can be freed. >>>> + */ >>>> + if (!first_desc->sptes[1] && !first_desc->more) { >>>> + *pte_list = (unsigned long)first_desc->sptes[0]; >>>> + mmu_free_pte_list_desc(first_desc); >>>> + } >>>> } >>>> >>>> static void pte_list_remove(u64 *spte, unsigned long *pte_list) >>>> { >>>> struct pte_list_desc *desc; >>>> - struct pte_list_desc *prev_desc; >>>> int i; >>>> >>>> if (!*pte_list) { >>>> - printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte); >>>> - BUG(); >>>> - } else if (!(*pte_list & 1)) { >>>> + WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte); >>> Why change BUG() to WARN() here and below? >> >> WARN(1, "xxx") can replace two lines in the origin code. And personally, >> i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to >> get the full log by using kdump. >> >> If you object it, i will change it back in the next version. :) >> > For debugging WARN() is doubtlessly better, but outside of development > you do not want to allow kernel to run after serious MMU corruption is > detected. It may be exploitable further, we do not know, so the safe > choice is to stop the kernel. Okay, will keep BUG() in the next version. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html