On 07/30/2012 06:05 PM, Josh Durgin wrote: > On 07/26/2012 12:12 PM, Alex Elder wrote: >> There is a bug in the way __rbd_init_snaps_header() handles newly- >> discovered snapshots, in the unusual case where the new snapshot id >> is less than an existing rbd snapshot id. >> >> >> When a new snapshot for an rbd image is created it is assigned a new >> snapshot id. The client requests the id from a monitor, and later >> updates the osds with information about the snapshot. The sequence >> of snapshot ids is monotonically increasing, which guarantees each >> is unique. And typically all new snapshots will have higher ids >> than any others a client has encountered. >> >> Because of the delay between getting assigned a snapshot id and >> recording the snapshot, there is a possible race between two clients >> requesting new snapshots on the same rbd image. It's possible in >> this case for the client granted a higher id to be the first to >> update the set of snapshots for the rbd image. This leaves a sort >> of gap in the set of snapshots until the other client's update gets >> recorded. >> >> This is an unlikely scenario, but it appears to be possible. I'm >> not sure whether we even support multiple concurrent clients for the >> same rbd image. However there is code in place to handle this case >> and it's wrong, so it needs to be fixed. > > To fully handle this race, we'd need to sort the snapshot context and > set its seq to the maximum id, instead of the lower id set by the race. I had assumed both of these were true. In fact, I was under the impression that the value of "seq" was precisely that--the maximum snapshot id issued for this rbd image by the server. > If we don't do this, the race makes the image unwritable since the > non-sorted snapshot context would be rejected by the osds. Looking at the kernel client only (which was how I found this bug, via inspection), this code is still buggy and could cause a kernel fault if it were to be hit. The risk is low right now, though. So I'll just set this patch aside now until we can get a chance to discuss the best course of action. -Alex > This would complicate things a bunch, so I'm not sure if we should > try to handle this bug on the client side. I've got a fix to prevent > it from happening in the first place, and as you said, it's rare, > and requires you to be taking snapshots of the same image from multiple > clients at once, which is generally not a good idea (that is, writing > to the image from multiple clients). > >> The job of __rbd_init_snaps_header() is to compare an rbd's existing >> list of snapshots to a new set, and remove any existing snapshots >> that are not present in the new set and create any new ones not >> present in the existing set. >> >> If any new snapshots exist after the entire list of existing ones >> has been examined, all the new ones that remain need to be created. >> The logic for handling these last snapshots correct. >> >> However, if a new snapshot is found whose id is less than an >> existing one, the logic for handling this is wrong. The basic >> problem is that the name and current id used here are taken from >> a place one position beyond where they should be. In the event >> this block of code is executed the first pass through the outer >> loop, for example, these values will refer to invalid memory. >> >> The fix is to make this block of code more closely match the >> block that handles adding new snapshots at the end. There are >> three differences: >> - we use a do..while loop because we know we initially have an >> entry to process. >> - we insert the new entry at a position within the list instead >> of at the beginning >> - because we use it in the loop condition, we need to update >> our notion of "current id" each pass through. >> >> Signed-off-by: Alex Elder <elder@xxxxxxxxxxx> >> --- >> drivers/block/rbd.c | 55 >> +++++++++++++++++++++++++++++---------------------- >> 1 file changed, 31 insertions(+), 24 deletions(-) >> >> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c >> index 71e3f3b..da09c0d 100644 >> --- a/drivers/block/rbd.c >> +++ b/drivers/block/rbd.c >> @@ -2090,13 +2090,14 @@ static int __rbd_init_snaps_header(struct >> rbd_device *rbd_dev) >> { >> const char *name, *first_name; >> int i = rbd_dev->header.total_snaps; >> - struct rbd_snap *snap, *old_snap = NULL; >> + struct rbd_snap *snap; >> struct list_head *p, *n; >> >> first_name = rbd_dev->header.snap_names; >> name = first_name + rbd_dev->header.snap_names_len; >> >> list_for_each_prev_safe(p, n, &rbd_dev->snaps) { >> + struct rbd_snap *old_snap; >> u64 cur_id; >> >> old_snap = list_entry(p, struct rbd_snap, node); >> @@ -2104,12 +2105,16 @@ static int __rbd_init_snaps_header(struct >> rbd_device *rbd_dev) >> if (i) >> cur_id = rbd_dev->header.snapc->snaps[i - 1]; >> >> + /* >> + * If old id is not present in the new context, or >> + * if there are no more snapshots in the new >> + * context, this snapshot should be removed. >> + */ >> if (!i || old_snap->id < cur_id) { >> /* >> - * old_snap->id was skipped, thus was >> - * removed. If this rbd_dev is mapped to >> - * the removed snapshot, record that it no >> - * longer exists, to prevent further I/O. >> + * If this rbd_dev is mapped to the removed >> + * snapshot, record that it no longer exists, >> + * to prevent further I/O. >> */ >> if (rbd_dev->snap_id == old_snap->id) >> rbd_dev->snap_exists = false; >> @@ -2122,34 +2127,36 @@ static int __rbd_init_snaps_header(struct >> rbd_device *rbd_dev) >> name = rbd_prev_snap_name(name, first_name); >> continue; >> } >> - for (; i > 0; >> - i--, name = rbd_prev_snap_name(name, first_name)) { >> - if (!name) { >> - WARN_ON(1); >> + >> + /* >> + * A new snapshot (or possibly more than one) >> + * appeared in the middle of our set of snapshots. >> + * This could happen of two hosts raced while > > of -> if, and hosts -> clients (could be the same host, i.e. via the > rbd snap create). > >> + * creating snapshots, and the one that was granted >> + * a higher snapshot id managed to get its snapshot >> + * saved before the other. >> + */ >> + do { >> + i--; >> + name = rbd_prev_snap_name(name, first_name); >> + if (WARN_ON(!name)) >> return -EINVAL; >> - } >> - cur_id = rbd_dev->header.snapc->snaps[i]; >> - /* snapshot removal? handle it above */ >> - if (cur_id >= old_snap->id) >> - break; >> - /* a new snapshot */ >> - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); >> + snap = __rbd_add_snap_dev(rbd_dev, i, name); >> if (IS_ERR(snap)) >> return PTR_ERR(snap); >> - >> /* note that we add it backward so using n and not p */ >> list_add(&snap->node, n); >> - p = &snap->node; >> - } >> + >> + cur_id = rbd_dev->header.snapc->snaps[i]; >> + } while (i && cur_id < old_snap->id); >> } >> /* we're done going over the old snap list, just add what's left */ >> - for (; i > 0; i--) { >> + while (i) { >> + i--; >> name = rbd_prev_snap_name(name, first_name); >> - if (!name) { >> - WARN_ON(1); >> + if (WARN_ON(!name)) >> return -EINVAL; >> - } >> - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); >> + snap = __rbd_add_snap_dev(rbd_dev, i, name); >> if (IS_ERR(snap)) >> return PTR_ERR(snap); >> list_add(&snap->node, &rbd_dev->snaps); >> > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html