On 07/26/2012 12:12 PM, Alex Elder wrote:
There is a bug in the way __rbd_init_snaps_header() handles newly- discovered snapshots, in the unusual case where the new snapshot id is less than an existing rbd snapshot id. When a new snapshot for an rbd image is created it is assigned a new snapshot id. The client requests the id from a monitor, and later updates the osds with information about the snapshot. The sequence of snapshot ids is monotonically increasing, which guarantees each is unique. And typically all new snapshots will have higher ids than any others a client has encountered. Because of the delay between getting assigned a snapshot id and recording the snapshot, there is a possible race between two clients requesting new snapshots on the same rbd image. It's possible in this case for the client granted a higher id to be the first to update the set of snapshots for the rbd image. This leaves a sort of gap in the set of snapshots until the other client's update gets recorded. This is an unlikely scenario, but it appears to be possible. I'm not sure whether we even support multiple concurrent clients for the same rbd image. However there is code in place to handle this case and it's wrong, so it needs to be fixed.
To fully handle this race, we'd need to sort the snapshot context and set its seq to the maximum id, instead of the lower id set by the race. If we don't do this, the race makes the image unwritable since the non-sorted snapshot context would be rejected by the osds. This would complicate things a bunch, so I'm not sure if we should try to handle this bug on the client side. I've got a fix to prevent it from happening in the first place, and as you said, it's rare, and requires you to be taking snapshots of the same image from multiple clients at once, which is generally not a good idea (that is, writing to the image from multiple clients).
The job of __rbd_init_snaps_header() is to compare an rbd's existing list of snapshots to a new set, and remove any existing snapshots that are not present in the new set and create any new ones not present in the existing set. If any new snapshots exist after the entire list of existing ones has been examined, all the new ones that remain need to be created. The logic for handling these last snapshots correct. However, if a new snapshot is found whose id is less than an existing one, the logic for handling this is wrong. The basic problem is that the name and current id used here are taken from a place one position beyond where they should be. In the event this block of code is executed the first pass through the outer loop, for example, these values will refer to invalid memory. The fix is to make this block of code more closely match the block that handles adding new snapshots at the end. There are three differences: - we use a do..while loop because we know we initially have an entry to process. - we insert the new entry at a position within the list instead of at the beginning - because we use it in the loop condition, we need to update our notion of "current id" each pass through. Signed-off-by: Alex Elder <elder@xxxxxxxxxxx> --- drivers/block/rbd.c | 55 +++++++++++++++++++++++++++++---------------------- 1 file changed, 31 insertions(+), 24 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 71e3f3b..da09c0d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2090,13 +2090,14 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) { const char *name, *first_name; int i = rbd_dev->header.total_snaps; - struct rbd_snap *snap, *old_snap = NULL; + struct rbd_snap *snap; struct list_head *p, *n; first_name = rbd_dev->header.snap_names; name = first_name + rbd_dev->header.snap_names_len; list_for_each_prev_safe(p, n, &rbd_dev->snaps) { + struct rbd_snap *old_snap; u64 cur_id; old_snap = list_entry(p, struct rbd_snap, node); @@ -2104,12 +2105,16 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) if (i) cur_id = rbd_dev->header.snapc->snaps[i - 1]; + /* + * If old id is not present in the new context, or + * if there are no more snapshots in the new + * context, this snapshot should be removed. + */ if (!i || old_snap->id < cur_id) { /* - * old_snap->id was skipped, thus was - * removed. If this rbd_dev is mapped to - * the removed snapshot, record that it no - * longer exists, to prevent further I/O. + * If this rbd_dev is mapped to the removed + * snapshot, record that it no longer exists, + * to prevent further I/O. */ if (rbd_dev->snap_id == old_snap->id) rbd_dev->snap_exists = false; @@ -2122,34 +2127,36 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) name = rbd_prev_snap_name(name, first_name); continue; } - for (; i > 0; - i--, name = rbd_prev_snap_name(name, first_name)) { - if (!name) { - WARN_ON(1); + + /* + * A new snapshot (or possibly more than one) + * appeared in the middle of our set of snapshots. + * This could happen of two hosts raced while
of -> if, and hosts -> clients (could be the same host, i.e. via the rbd snap create).
+ * creating snapshots, and the one that was granted + * a higher snapshot id managed to get its snapshot + * saved before the other. + */ + do { + i--; + name = rbd_prev_snap_name(name, first_name); + if (WARN_ON(!name)) return -EINVAL; - } - cur_id = rbd_dev->header.snapc->snaps[i]; - /* snapshot removal? handle it above */ - if (cur_id >= old_snap->id) - break; - /* a new snapshot */ - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); + snap = __rbd_add_snap_dev(rbd_dev, i, name); if (IS_ERR(snap)) return PTR_ERR(snap); - /* note that we add it backward so using n and not p */ list_add(&snap->node, n); - p = &snap->node; - } + + cur_id = rbd_dev->header.snapc->snaps[i]; + } while (i && cur_id < old_snap->id); } /* we're done going over the old snap list, just add what's left */ - for (; i > 0; i--) { + while (i) { + i--; name = rbd_prev_snap_name(name, first_name); - if (!name) { - WARN_ON(1); + if (WARN_ON(!name)) return -EINVAL; - } - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); + snap = __rbd_add_snap_dev(rbd_dev, i, name); if (IS_ERR(snap)) return PTR_ERR(snap); list_add(&snap->node, &rbd_dev->snaps);
-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html