I messed up a crush map the other day, mixing components of different
types in a single rule. The crushmap compiler didn't complain, but mons
and osds would crash when applying those rules. I had to use this patch
to recover the cluster. Only the second hunk was relevant, but I
figured a BUG_ON that stops you from fixing the problem is best avoided
;-)
--- Begin Message ---
- Subject: [PATCH 2/2] [crush] don't BUG_ON within crush_choose
- From: Alexandre Oliva <lxoliva@xxxxxxxxx>
- Date: Sun, 29 Jan 2012 05:15:18 -0200
It's very hard to recover from an invalid crushmap if mons fail
assertions while processing the map, and osds crash while advancing
past an already-fixed map. Skip such broken rules instead of
aborting.
Signed-off-by: Alexandre Oliva <oliva@xxxxxxxxxxxxxxxxx>
---
src/crush/mapper.c | 14 +++++++++++---
1 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 1e475b40..6ce4c97 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -354,7 +354,11 @@ static int crush_choose(const struct crush_map *map,
item = bucket_perm_choose(in, x, r);
else
item = crush_bucket_choose(in, x, r);
- BUG_ON(item >= map->max_devices);
+ if (item >= map->max_devices) {
+ dprintk(" bad item %d\n", item);
+ skip_rep = 1;
+ break;
+ }
/* desired type? */
if (item < 0)
@@ -365,8 +369,12 @@ static int crush_choose(const struct crush_map *map,
/* keep going? */
if (itemtype != type) {
- BUG_ON(item >= 0 ||
- (-1-item) >= map->max_buckets);
+ if (item >= 0 ||
+ (-1-item) >= map->max_buckets) {
+ dprintk(" bad item type %d\n", type)
+ skip_rep = 1;
+ break;
+ }
in = map->buckets[-1-item];
retry_bucket = 1;
continue;
--
1.7.7.6
--- End Message ---
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist Red Hat Brazil Compiler Engineer