On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote: > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote: >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney >> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote: >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> >> >> Function is going to be used in transport over RDMA module >> >> >> in subsequent patches. >> >> >> >> >> >> Function returns next element in round-robin fashion, >> >> >> i.e. head will be skipped. NULL will be returned if list >> >> >> is observed as empty. >> >> >> >> >> >> Signed-off-by: Roman Pen <roman.penyaev@xxxxxxxxxxxxxxxx> >> >> >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> >> >> >> Cc: linux-kernel@xxxxxxxxxxxxxxx >> >> >> --- >> >> >> include/linux/rculist.h | 19 +++++++++++++++++++ >> >> >> 1 file changed, 19 insertions(+) >> >> >> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> >> >> index 127f534fec94..b0840d5ab25a 100644 >> >> >> --- a/include/linux/rculist.h >> >> >> +++ b/include/linux/rculist.h >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, >> >> >> }) >> >> >> >> >> >> /** >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. >> >> >> + * @head: the head for the list. >> >> >> + * @ptr: the list head to take the next element from. >> >> >> + * @type: the type of the struct this is embedded in. >> >> >> + * @memb: the name of the list_head within the struct. >> >> >> + * >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, >> >> >> + * but if list is observed as empty, NULL will be returned. >> >> >> + * >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). >> >> > >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that >> >> > are round-robining a given list must all be under the same RCU read-side >> >> > critical section. For example, the following will break badly: >> >> > >> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) >> >> > { >> >> > struct foo *ret; >> >> > >> >> > rcu_read_lock(); >> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); >> >> > rcu_read_unlock(); /* BUG */ >> >> > return ret; >> >> > } >> >> > >> >> > You need a big fat comment stating this, at the very least. The resulting >> >> > bug can be very hard to trigger and even harder to debug. >> >> > >> >> > And yes, I know that the same restriction applies to list_next_rcu() >> >> > and friends. The difference is that if you try to invoke those in an >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit >> >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings >> >> > might tempt people to do something broken like take_rr_step() above. >> >> >> >> Hi Paul, >> >> >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA >> >> connection from the list in order to send a request, i.e. my code is >> >> something like the following: >> >> >> >> static struct conn *get_and_set_next_conn(void) >> >> { >> >> struct conn *conn; >> >> >> >> conn = rcu_dereferece(rcu_conn); >> >> if (unlikely(!conn)) >> >> return conn; >> > >> > Wait. Don't you need to restart from the beginning of the list in >> > this case? Or does the list never have anything added to it and is >> > rcu_conn initially the first element in the list? >> >> Hi Paul, >> >> No, I continue from the pointer, which I assigned on the previous IO >> in order to send IO fairly and keep load balanced. >> >> Initially @rcu_conn points to the first element, but elements can >> be deleted from the list and list can become empty. >> >> The deletion code is below. >> >> > >> >> conn = list_next_or_null_rr_rcu(&conn_list, >> >> &conn->entry, >> >> typeof(*conn), >> >> entry); >> >> rcu_assign_pointer(rcu_conn, conn); >> > >> > Linus is correct to doubt this code. You assign a pointer to the current >> > element to rcu_conn, which is presumably a per-CPU or global variable. >> > So far, so good ... >> >> I use per-CPU, in the first example I did not show that not to overcomplicate >> the code. >> >> > >> >> return conn; >> >> } >> >> >> >> rcu_read_lock(); >> >> conn = get_and_set_next_conn(); >> >> if (unlikely(!conn)) { >> >> /* ... */ >> >> } >> >> err = rdma_io(conn, request); >> >> rcu_read_unlock(); >> > >> > ... except that some other CPU might well remove the entry referenced by >> > rcu_conn at this point. It would have to wait for a grace period (e.g., >> > synchronize_rcu()), but the current CPU has exited its RCU read-side >> > critical section, and therefore is not blocking the grace period. >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it >> > might well be referencing the freelist, or, even worse, some other type >> > of structure. >> > >> > What is your code doing to prevent this from happening? (There are ways, >> > but I want to know what you were doing in this case.) >> >> Probably I should have shown the way of removal at the very beginning, >> my fault. So deletion looks as the following (a bit changed and >> simplified for the sake of clearness): > > Thank you! Let's see... > >> static void remove_connection(conn) >> { >> bool need_to_wait = false; >> int cpu; >> >> /* Do not let RCU list add/delete happen in parallel */ >> mutex_lock(&conn_lock); >> >> list_del_rcu(&conn->entry); >> >> /* Make sure everybody observes element removal */ >> synchronize_rcu(); > > At this point, any reader who saw the element in the list is done, as you > comment in fact says. But there might be a pointer to that element in the > per-CPU variables, however, from this point forward it cannot be the case > that one of the per-CPU variables gets set to the newly deleted element. > Which is your next block of code... > >> /* >> * At this point nobody sees @conn in the list, but >> still we have >> * dangling pointer @rcu_conn which _can_ point to @conn. Since >> * nobody can observe @conn in the list, we guarantee >> that IO path >> * will not assign @conn to @rcu_conn, i.e. @rcu_conn >> can be equal >> * to @conn, but can never again become @conn. >> */ >> >> /* >> * Get @next connection from current @conn which is going to be >> * removed. >> */ >> next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, >> typeof(*next), entry); >> >> /* >> * Here @rcu_conn can be changed by reader side, so use @cmpxchg >> * in order to keep fairness in load-balancing and do not touch >> * the pointer which can be already changed by the IO path. >> * >> * Current path can be faster than IO path and the >> following race >> * exists: >> * >> * CPU0 CPU1 >> * ---- ---- >> * conn = rcu_dereferece(rcu_conn); >> * next = list_next_or_null_rr_rcu(conn) >> * >> * conn == >> cmpxchg(rcu_conn, conn, next); >> * synchronize_rcu(); >> * >> * rcu_assign_pointer(rcu_conn, next); >> * ^^^^^^^^^^^^^^^^^^ >> * >> * Here @rcu_conn is already equal to @next (done by >> @cmpxchg), >> * so assignment to the same pointer is harmless. >> * >> */ >> for_each_possible_cpu(cpu) { >> struct conn **rcu_conn; >> >> rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); >> if (*rcu_conn != conn) >> /* >> * This @cpu will never again pick up @conn, >> * so it is safe just to choose next CPU. >> */ >> continue; > > ... Someone else might have picked up rcu_conn at this point... > >> if (conn == cmpxchg(rcu_conn, conn, next)) >> /* >> * @rcu_conn was successfully replaced >> with @next, >> * that means that someone can also hold a @conn >> * and dereferencing it, so wait for a >> grace period >> * is required. >> */ >> need_to_wait = true; > > ... But if there was any possibility of that, need_to_wait is true, and it > still cannot be the case that a reader finds the newly deleted element > in the list, so they cannot find that element, so the pcpu_rcu_conn > variables cannot be set to it. > >> } >> if (need_to_wait) >> synchronize_rcu(); > > And at this point, the reader that might have picked up rcu_conn > just before the cmpxchg must have completed. (Good show, by the way! > Many people miss the fact that they need this second synchronize_rcu().) > > Hmmm... What happens if this was the last element in the list, and > the relevant pcpu_rcu_conn variable references that newly removed > element? Taking a look at list_next_or_null_rcu() and thus at > list_next_or_null_rcu(), and it does appear that you get NULL in that > case, as is right and good. Thanks for explicit comments. What I always lack is a good description. Indeed it is worth to mention that @next can become NULL if that was the last element, will add comments then. > >> mutex_unlock(&conn_lock); >> >> kfree(conn); >> } >> >> >> > >> >> i.e. usage of the @next pointer is under an RCU critical section. >> >> >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like >> >> > macro that makes it more obvious that the whole thing need to be under >> >> > a single RCU read-side critical section? Such a macro would of course be >> >> > an infinite loop if the list never went empty, so presumably there would >> >> > be a break or return statement in there somewhere. >> >> >> >> The difference is that I do not need a loop, I take the @next conn pointer, >> >> save it for the following IO request and do IO for current IO request. >> >> >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body >> >> of the loop does not look nice, I personally do not like it, i.e.: >> >> >> >> >> >> static struct conn *get_and_set_next_conn(void) >> >> { >> >> struct conn *conn; >> >> >> >> conn = rcu_dereferece(rcu_conn); >> >> if (unlikely(!conn)) >> >> return conn; >> >> list_for_each_entry_rr_rcu(conn, &conn_list, >> >> entry) { >> >> break; >> >> } >> >> rcu_assign_pointer(rcu_conn, conn); >> >> return conn; >> >> } >> >> >> >> >> >> or maybe I did not fully get your idea? >> > >> > That would not help at all because you are still leaking the pointer out >> > of the RCU read-side critical section. That is completely and utterly >> > broken unless you are somehow cleaning up rcu_conn when you remove >> > the element. And getting that cleanup right is -extremely- tricky. >> > Unless you have some sort of proof of correctness, you will get a NACK >> > from me. >> >> I understand all the consequences of the leaking pointer, and of course >> wrapped loop with RCU lock/unlock is simpler, but in order to keep >> load-balancing and IO fairness avoiding any locks on IO path I've come >> up with these RCU tricks and list_next_or_null_rr_rcu() macro. > > At first glance, it appears that you have handled this correctly. But I > can make mistakes just as easily as the next guy, so what have you done > to validate your algorithm? What we only have is a set of unit-tests which run every night. For this particular case there is a special stress test which adds/removes RDMA connections in a loop while IO is performing. Unfortunately I did not write any synthetic test-case just for testing this isolated algorithm. (e.g. module with only these RCU functions can be created and list modification can be easily simulated. Should not be very much difficult) Do you think it is worth to do? Unfortunately it also does not prove correctness, like Linus said it is fragile as hell, but for sure I can burn CPUs testing it for couple of days. >> > More like this: >> > >> > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { >> > do_something_with(conn); >> > if (done_for_now()) >> > break; >> > } >> > >> >> >> + */ >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ >> >> >> +({ \ >> >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ >> >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ >> >> > >> >> > Are there any uses for this outside of RDMA? If not, I am with Linus. >> >> > Define this within RDMA, where a smaller number of people can more >> >> > easily be kept aware of the restrictions on use. If it turns out to be >> >> > more generally useful, we can take a look at exactly what makes sense >> >> > more globally. >> >> >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): >> >> >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 >> >> >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing >> >> my list_next_or_null_rr_rcu() variant? >> > >> > Let's start with the basics: It absolutely does not make sense to leak >> > pointers across rcu_read_unlock() unless you have arranged something else >> > to protect the pointed-to data in the meantime. There are a number of ways >> > of implementing this protection. Again, what protection are you using? >> > >> > Your code at the above URL looks plausible to me at first glance: You >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then >> > rcu_read_unlock(). But at second glance, it looks like htcx->queue >> > might have the same vulnerability as rcu_conn in your earlier code. >> >> I am not the author of the code at the URL I specified. I provided the >> link answering the question to show other possible users of round-robin >> semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' >> case I can't use a loop, I leak the pointer and indeed have to be very >> careful. But perhaps we can come up with some generic solution to cover >> both cases: -rr loop and -rr next. > > Ah. Could you please check their update-side code to make sure that it > looks correct to you? BTW authors of this particular -rr loop are in CC, but they keep silence so far :) According to my shallow understanding @queue can't disappear from the list on this calling path, i.e. existence of a @queue should be guaranteed by the fact that the queue has no IO in-flights and only then it can be removed. -- Roman -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html