Multiple skiplist bugs found, patches attached

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 12, 2007 at 12:34:34AM +1100, Bron Gondwana wrote:
> Anyway - here it is.  A "recovery()" that copes if the logstart
> parameter in the database header is wrong.  No, I don't have a
> clue how that happened unless lseek() lied.  Maybe it sometimes
> lies, I don't know.  I'll be writing a test case for that soon
> too!

I have some more suspicions now, but I wrote it all up in the
patch header, so here's the bugfixes only patch, a "robustness"
extras patch and the tool I used for testing.

Ken, I know you've done some other work on the file changing
types.  I'd like to be even more agressive and convert just
about everything to bit32 and also rename some variables, but
I restricted myself in this to only fixing the most ugly case:
offset = htonl(offset).

These patches are all against 2.3.10 (in this order), and may
need some fuzz fixing to apply against your latest CVS thanks
to those changes - sorry I haven't done that, but it's getting
on 1am for me, and I've just finished doing a lot of testing
and paring these down to simple and clear patches that don't
touch more than they need to fix the issues.

cyrus-skiplist-bugfixes-2.3.10.diff:

  Ken, please review this patch and consider it for pushing out
  in the next release, preferably soon.  There really are a lot
  of issues I found reviewing the code, and even more so with
  the attached tool that can be used to "hammer" a file with
  all sorts of interesting requests.  There are some really
  nasty skiplist corruptions available if even one process
  is ever killed or segfaults half way through a transaction
  and the first operation to touch said file after this is
  a write.

cyrus-skiplist-robustify-2.3.10.diff:

  If you have been running skiplist on your systems and suspect
  that you may have corrupted databases, you probably want this.
  It adds extra robustness and fixupability to recovery().  It's
  still not going to be crash proof reading line-noise, but it
  detects and fixes all the corruptions I have personally seen.

  (I say this having tested it on all the bogus DBs I had, eg:)

  DBERROR: skiplist recovery /tmp/mb2/mailboxes.db.1194508562: 
  -> 8E6DD8 should be ADD or DELETE

  became:

  skiplist recovery /tmp/mb2/mailboxes.db.1194508562: 
  ->  skipped 136 bytes of zeros at 8E6DD8
  skiplist: recovered /tmp/mb2/mailboxes.db.1194508562 
  ->  (44594 records, 9547108 bytes) in 1 second
  skiplist: checkpointed /tmp/mb2/mailboxes.db.1194508562
  ->  (44594 records, 5350556 bytes) in 1 second

  and:

  skiplist recovery /tmp/mailboxes.db.fail: 
  -> 288C00 should be ADD or DELETE

  became:

  skiplist recovery /tmp/mailboxes.db.fail: 
  -> incorrect logstart 288BD8 changed to 356F94
  skiplist: recovered /tmp/mailboxes.db.fail
  -> (28811 records, 4109664 bytes) in 1 second

  In both cases a second run of the same command (I used
  cyr_dbtool 'show') came up clean - no issues remaining
  and no log entries.


cyrus-hammer-skiplist-2.3.10.diff:

  I used this command to hammer a skiplist database like so:

  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db &
  ...

  I've turned down the "forget about this transaction" option
  to a lot less common than my original tests.  It should still
  fire a couple of times per hammer, but it creates log entries 
  even on the fully patched code (rolling back incomplete txn),
  so I didn't want to spam the logs.

Enjoy,

Bron.
SKIPLIST bugfixes

In the past we have had issues with bugs in skiplist on seen
files, and we truncated files at the offset with the issue
since they were only seen data.

Lately, we have had more tools updating mailboxes.db more
often, and have lost multiple mailboxes.db files.

There are two detectable issues:

1) incorrect header "logstart" values causing recovery to
   fail with either unexpected ADD/DELETE records or
   unexpected INORDER records depending which side of the
   correct location the logstart value is wrong.
2) a bunch of zero bytes between transactions in the log
   section.

The attached patch fixes the following issues:

a) recovery failed to update db->map_base if it truncated
   a partial transaction.  This reliably recreated the
   zero bytes issues above by having the next store command
   lseek to a location past the new end of the file, and
   hence fill the remainder with blanks.

b) the logic in the "delete" handler for detecting "no
   record exists" (ptr == db->map_base) was backwards,
   meaning that a delete on a record which didn't exist
   caused reads of PTR(db->map_base, i), which is bogus
   and nasty.  This is the suspect for logstart breakage
   though I haven't proven this yet.

c) unsure if this is a real risk, but added a ftruncate
   to checkpoint to ensure new file really is empty,
   since we don't open it with O_EXCL.

d.1) when abort is called it needs to update_lock() to
     ensure that the records it's about to rollback are
     actually locked.  This fix stopped segfaults in
     my testing.

d.2) delete didn't check retry_write for success, and
     also suffered from the same problem as:

d.3) if retry_writev in store failed, then it called
     abort, but without the records actually written,
     abort had no way of knowing which offsets needed
     to be switched back, meaning bogus pointers could
     be left in the file until the next recovery().
     Changed it to write the records first, then only
     once that succeeded update the pointers.  This way
     abort will do the right thing regardless.


*PHEW* - that's three days I want back, but it has survived
10 concurrent processes doing nasty things to it for 10000
operations each and it's still going strong:
skiplist: recovered /tmp/hammer.db (143177 records, 5965096 bytes) in 0 seconds
Index: cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c
===================================================================
--- cyrus-imapd-2.3.10.orig/lib/cyrusdb_skiplist.c	2007-11-12 07:25:23.000000000 -0500
+++ cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c	2007-11-12 07:25:51.000000000 -0500
@@ -1086,7 +1086,7 @@
     int addrectype = htonl(ADD);
     int delrectype = htonl(DELETE);
     bit32 todelete;
-    bit32 newoffset;
+    bit32 newoffset, netnewoffset;
     int r, i;
 
     assert(db != NULL);
@@ -1174,17 +1174,7 @@
     klen = htonl(keylen);
     dlen = htonl(datalen);
     
-    newoffset = htonl(newoffset);
-
-    /* set pointers appropriately */
-    for (i = 0; i < lvl; i++) {
-	/* write pointer updates */
-	/* FORWARD(updates[i], i) = newoffset; */
-	lseek(db->fd, 
-	      PTR(db->map_base + updateoffsets[i], i) - db->map_base,
-	      SEEK_SET);
-	retry_write(db->fd, (char *) &newoffset, 4);
-    }
+    netnewoffset = htonl(newoffset);
 
     WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *) &addrectype, 4);
     WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *) &klen, 4);
@@ -1212,6 +1202,17 @@
     }
     tp->logend += r;		/* update where to write next */
 
+    /* update pointers after writing record so abort is guaranteed to
+     * see which records need reverting */
+    for (i = 0; i < lvl; i++) {
+	/* write pointer updates */
+	/* FORWARD(updates[i], i) = newoffset; */
+	lseek(db->fd,
+	      PTR(db->map_base + updateoffsets[i], i) - db->map_base,
+	      SEEK_SET);
+	retry_write(db->fd, (char *) &netnewoffset, 4);
+    }
+
     if (tid) {
 	if (!*tid) {
 	    /* return the txn structure */
@@ -1290,12 +1291,28 @@
     }
 
     ptr = find_node(db, key, keylen, updateoffsets);
-    if (ptr == db->map_base ||
+    if (ptr != db->map_base &&
 	!db->compar(KEY(ptr), KEYLEN(ptr), key, keylen)) {
 	/* gotcha */
 	offset = ptr - db->map_base;
 
-	/* update pointers */
+	/* log the deletion */
+	getsyncfd(db, tp);
+	lseek(tp->syncfd, tp->logend, SEEK_SET);
+	writebuf[0] = delrectype;
+	writebuf[1] = htonl(offset);
+
+	/* update end-of-log */
+	r = retry_write(tp->syncfd, (char *) writebuf, 8);
+	if (r < 0) {
+	    syslog(LOG_ERR, "DBERROR: retry_write(): %m");
+	    myabort(db, tp);
+	    return CYRUSDB_IOERROR;
+	}
+	tp->logend += r;
+
+	/* update pointers after writing record so abort is guaranteed to
+	 * see which records need reverting */
 	for (i = 0; i < db->curlevel; i++) {
 	    int newoffset;
 
@@ -1308,15 +1325,6 @@
 		  SEEK_SET);
 	    retry_write(db->fd, (char *) &newoffset, 4);
 	}
-
-	/* log the deletion */
-	getsyncfd(db, tp);
-	lseek(tp->syncfd, tp->logend, SEEK_SET);
-	writebuf[0] = delrectype;
-	writebuf[1] = htonl(offset);
-
-	/* update end-of-log */
-	tp->logend += retry_write(tp->syncfd, (char *) writebuf, 8);
     }
 
     if (tid) {
@@ -1429,6 +1437,9 @@
     int r = 0;
 
     assert(db && tid);
+
+    /* update the mmap so we can see the log entries we need to remove */
+    update_lock(db, tid);
     
     /* look at the log entries we've written, and undo their effects */
     while (tid->logstart != tid->logend) {
@@ -1562,6 +1573,15 @@
 	return CYRUSDB_IOERROR;
     }
 
+    /* truncate it just in case! */
+    r = ftruncate(db->fd, 0);
+    if (r < 0) {
+	syslog(LOG_ERR, "DBERROR: skiplist checkpoint %s: ftruncate %m", fname);
+	if (!locked) unlock(db);
+	db->fd = oldfd;
+	return CYRUSDB_IOERROR;
+    }
+
     /* write dummy record */
     if (!r) {
 	int dsize = DUMMY_SIZE(db);
@@ -2059,6 +2079,10 @@
 		       db->fname);
 		r = CYRUSDB_IOERROR;
 	    }
+
+	    /* set the map size back as well */
+	    db->map_size = offset;
+
 	    break;
 	}
 
Index: cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c
===================================================================
--- cyrus-imapd-2.3.10.orig/lib/cyrusdb_skiplist.c	2007-11-12 08:18:12.000000000 -0500
+++ cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c	2007-11-12 08:44:53.000000000 -0500
@@ -1884,6 +1884,7 @@
     int r = 0;
     time_t start = time(NULL);
     int i;
+    int need_checkpoint = 0;
 
     if (!(flags & RECOVERY_CALLER_LOCKED) && (r = write_lock(db, NULL)) < 0) {
 	return r;
@@ -1946,20 +1947,13 @@
     
     /* reset the data that was written INORDER by the last checkpoint */
     offset = DUMMY_OFFSET(db) + DUMMY_SIZE(db);
-    while (!r && (offset < (bit32) db->logstart)) {
+    while (!r && (offset < db->map_size)
+	      && TYPE(db->map_base + offset) == INORDER) {
 	ptr = db->map_base + offset;
 	offsetnet = htonl(offset);
 
 	db->listsize++;
 
-	/* make sure this is INORDER */
-	if (TYPE(ptr) != INORDER) {
-	    syslog(LOG_ERR, "DBERROR: skiplist recovery: %04X should be INORDER",
-		   offset);
-	    r = CYRUSDB_IOERROR;
-	    continue;
-	}
-	    
 	/* xxx check \0 fill on key */
 
 	/* xxx check \0 fill on data */
@@ -2000,6 +1994,12 @@
 	}
     }
 
+    if (offset != db->logstart) {
+	syslog(LOG_NOTICE, "skiplist recovery %s: incorrect logstart %04X changed to %04X", 
+	       db->fname, db->logstart, offset);
+	db->logstart = offset; /* header will be committed later */
+    }
+
     /* zero out the remaining pointers */
     if (!r) {
 	for (i = 0; !r && i < db->maxlevel; i++) {
@@ -2033,6 +2033,19 @@
 		    db->fname, 0);
 
 	ptr = db->map_base + offset;
+
+	/* bugs in recovery truncates could have left some bogus zeros here */
+	if (TYPE(ptr) == 0) {
+	    int orig = offset;
+	    while (TYPE(ptr) == 0 && offset < db->map_size) {
+		offset += 4;
+		ptr = db->map_base + offset;
+	    }
+	    syslog(LOG_ERR, "skiplist recovery %s: skipped %d bytes of zeros at %04X",
+			    db->fname, offset - orig, orig);
+	    need_checkpoint = 1;
+	}
+
 	offsetnet = htonl(offset);
 
 	/* if this is a commit, we've processed everything in this txn */
@@ -2205,6 +2218,10 @@
 	       db->map_size, diff, diff == 1 ? "" : "s"); 
     }
 
+    if (!r && need_checkpoint) {
+	r = mycheckpoint(db, 1);
+    }
+
     if(r || !(flags & RECOVERY_CALLER_LOCKED)) {
 	unlock(db);
     }
Index: cyrus-imapd-2.3.10/imap/Makefile.in
===================================================================
--- cyrus-imapd-2.3.10.orig/imap/Makefile.in	2007-11-12 05:14:03.000000000 -0500
+++ cyrus-imapd-2.3.10/imap/Makefile.in	2007-11-12 06:33:26.000000000 -0500
@@ -115,7 +115,7 @@
 
 PROGS = imapd lmtpd pop3d \
 	fud smmapd reconstruct quota mbpath ipurge cyr_dbtool cyr_synclog \
-	cyrdump chk_cyrus cvt_cyrusdb deliver ctl_mboxlist \
+	cyrdump chk_cyrus cvt_cyrusdb deliver ctl_mboxlist hammer_skiplist \
 	ctl_deliver ctl_cyrusdb squatter mbexamine cyr_expire arbitron \
 	unexpunge @IMAP_PROGS@
 
@@ -261,6 +261,10 @@
 	$(CC) $(LDFLAGS) -o cyr_dbtool cyr_dbtool.o $(CLIOBJS) \
 	libimap.a $(DEPLIBS) $(LIBS)
 
+hammer_skiplist: hammer_skiplist.o mutex_fake.o libimap.a $(DEPLIBS)
+	$(CC) $(LDFLAGS) -o hammer_skiplist hammer_skiplist.o $(CLIOBJS) \
+	libimap.a $(DEPLIBS) $(LIBS)
+
 cyr_synclog: cyr_synclog.o mutex_fake.o libimap.a $(DEPLIBS)
 	$(CC) $(LDFLAGS) -o cyr_synclog cyr_synclog.o $(CLIOBJS) \
 	libimap.a $(DEPLIBS) $(LIBS)
Index: cyrus-imapd-2.3.10/imap/hammer_skiplist.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ cyrus-imapd-2.3.10/imap/hammer_skiplist.c	2007-11-12 06:35:28.000000000 -0500
@@ -0,0 +1,210 @@
+/* hammer_skiplist.c - tool to harass a skiplist file
+ * 
+ * Copyright (c) 1998-2003 Carnegie Mellon University.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer. 
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The name "Carnegie Mellon University" must not be used to
+ *    endorse or promote products derived from this software without
+ *    prior written permission. For permission or any other legal
+ *    details, please contact  
+ *      Office of Technology Transfer
+ *      Carnegie Mellon University
+ *      5000 Forbes Avenue
+ *      Pittsburgh, PA  15213-3890
+ *      (412) 268-4387, fax: (412) 268-7395
+ *      tech-transfer@xxxxxxxxxxxxxx
+ *
+ * 4. Redistributions of any form whatsoever must retain the following
+ *    acknowledgment:
+ *    "This product includes software developed by Computing Services
+ *     at Carnegie Mellon University (http://www.cmu.edu/computing/)."
+ *
+ * CARNEGIE MELLON UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO
+ * THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY BE LIABLE
+ * FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
+ * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
+ * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ *
+ */
+/*
+ * $Id: hammer_skiplist.c,v 1.4 2007/09/28 02:27:46 murch Exp $
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#ifdef HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/uio.h>
+#include <fcntl.h>
+#include <ctype.h>
+#include <syslog.h>
+
+#include <sys/ipc.h>
+#include <sys/msg.h>
+
+#include "acl.h"
+#include "assert.h"
+#include "auth.h"
+#include "cyrusdb.h"
+#include "exitcodes.h"
+#include "glob.h"
+#include "imap_err.h"
+#include "global.h"
+#include "mailbox.h"
+#include "util.h"
+#include "xmalloc.h"
+
+const int config_need_data = 0;
+struct cyrusdb_backend *DB_OLD = NULL;
+
+void hammer(struct db *db)
+{
+    int c;
+    for (c = 0;c < 10000; c++) { /* should be enough! */
+        struct txn *tid = NULL;
+        struct txn **tp;
+        char key[100];
+        char value[100];
+        int klen, vlen, i, r = 0;
+        int rop;
+
+        /* protect against silly things */
+
+        tp = (rand() % 2) ? &tid : NULL;
+        klen = 1 + (rand() % 6);
+        for (i = 0; i < klen; i++) {
+          key[i] = 'A' + (rand() % 26);
+        }
+        key[klen] = '\0';
+        vlen = rand() % 20;
+        for (i = 0; i < vlen; i++) {
+          value[i] = 'a' + (rand() % 26);
+        }
+        value[vlen] = '\0';
+        rop = rand() % 1000;
+        if (rop >= 999) {
+          if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp);
+          /* forget to commit */
+        }
+        else if (rop >= 800) {
+          if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp);
+          if (!r) r = DB_OLD->delete(db, key, klen, tp, 0);
+          if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp);
+          if (!r && tp) DB_OLD->commit(db, *tp);
+        }
+        else if (rop >= 700) {
+          if (!r) r =  DB_OLD->delete(db, key, klen, tp, 0);
+          if (!r && tp) DB_OLD->commit(db, *tp);
+        }
+        else if (rop >= 600) { /* will fail */
+          if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp);
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+        }
+        else if (rop > 200) {
+          if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'a';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'b';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'd';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'c';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          if (!r && tp) DB_OLD->commit(db, *tp);
+        }
+        else {
+          if (!r) r =  DB_OLD->store(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'a';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'b';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'd';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          key[klen-1] = 'c';
+          if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp);
+          if (!r && tp) DB_OLD->abort(db, *tp);
+        }
+
+    }
+
+
+}
+
+int main(int argc, char *argv[])
+{
+    struct db *odb;
+    const char *old_db;
+    int opt, i, r;
+    char *alt_config = NULL;
+    int db_flags = 0;
+
+    while ((opt = getopt(argc, argv, "C:n")) != EOF) {
+	switch (opt) {
+	case 'C': /* alt config file */
+	    alt_config = optarg;
+	    break;
+	case 'n': /* create new */
+	    db_flags |= CYRUSDB_CREATE;
+	    break;
+	}
+    }
+	
+    if((argc - optind) < 1) {
+	fprintf(stderr, "Usage: %s [-C altconfig] <dbfile>\n", argv[0]);
+
+	exit(-1);
+    }
+
+    old_db = argv[optind];
+
+    if(old_db[0] != '/') {
+	printf("\nSorry, you cannot use this tool with relative path names.\n"
+	       "This is because some database backends (mainly berkeley) do not\n"
+	       "always do what you would expect with them.\n"
+	       "\nPlease use absolute pathnames instead.\n\n");
+	exit(EC_OSERR);
+    }
+
+    for(i=0; cyrusdb_backends[i]; i++) {
+	if(!strcmp(cyrusdb_backends[i]->name, "skiplist")) {
+	    DB_OLD = cyrusdb_backends[i]; break;
+	}
+    }
+    if(!cyrusdb_backends[i]) {
+	fatal("unknown backend", EC_TEMPFAIL);
+    }   
+
+    cyrus_init(alt_config, "hammer_skiplist", 0);
+
+
+    r = (DB_OLD->open)(old_db, db_flags, &odb);
+    if(r != CYRUSDB_OK)
+	fatal("can't open database", EC_TEMPFAIL);
+
+    hammer(odb);
+    
+    cyrus_done();
+
+    return 0;
+}
+
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

[Index of Archives]     [Cyrus SASL]     [Squirrel Mail]     [Asterisk PBX]     [Video For Linux]     [Photo]     [Yosemite News]     [gtk]     [KDE]     [Gimp on Windows]     [Steve's Art]

  Powered by Linux