On Mon, Nov 12, 2007 at 12:34:34AM +1100, Bron Gondwana wrote: > Anyway - here it is. A "recovery()" that copes if the logstart > parameter in the database header is wrong. No, I don't have a > clue how that happened unless lseek() lied. Maybe it sometimes > lies, I don't know. I'll be writing a test case for that soon > too! I have some more suspicions now, but I wrote it all up in the patch header, so here's the bugfixes only patch, a "robustness" extras patch and the tool I used for testing. Ken, I know you've done some other work on the file changing types. I'd like to be even more agressive and convert just about everything to bit32 and also rename some variables, but I restricted myself in this to only fixing the most ugly case: offset = htonl(offset). These patches are all against 2.3.10 (in this order), and may need some fuzz fixing to apply against your latest CVS thanks to those changes - sorry I haven't done that, but it's getting on 1am for me, and I've just finished doing a lot of testing and paring these down to simple and clear patches that don't touch more than they need to fix the issues. cyrus-skiplist-bugfixes-2.3.10.diff: Ken, please review this patch and consider it for pushing out in the next release, preferably soon. There really are a lot of issues I found reviewing the code, and even more so with the attached tool that can be used to "hammer" a file with all sorts of interesting requests. There are some really nasty skiplist corruptions available if even one process is ever killed or segfaults half way through a transaction and the first operation to touch said file after this is a write. cyrus-skiplist-robustify-2.3.10.diff: If you have been running skiplist on your systems and suspect that you may have corrupted databases, you probably want this. It adds extra robustness and fixupability to recovery(). It's still not going to be crash proof reading line-noise, but it detects and fixes all the corruptions I have personally seen. (I say this having tested it on all the bogus DBs I had, eg:) DBERROR: skiplist recovery /tmp/mb2/mailboxes.db.1194508562: -> 8E6DD8 should be ADD or DELETE became: skiplist recovery /tmp/mb2/mailboxes.db.1194508562: -> skipped 136 bytes of zeros at 8E6DD8 skiplist: recovered /tmp/mb2/mailboxes.db.1194508562 -> (44594 records, 9547108 bytes) in 1 second skiplist: checkpointed /tmp/mb2/mailboxes.db.1194508562 -> (44594 records, 5350556 bytes) in 1 second and: skiplist recovery /tmp/mailboxes.db.fail: -> 288C00 should be ADD or DELETE became: skiplist recovery /tmp/mailboxes.db.fail: -> incorrect logstart 288BD8 changed to 356F94 skiplist: recovered /tmp/mailboxes.db.fail -> (28811 records, 4109664 bytes) in 1 second In both cases a second run of the same command (I used cyr_dbtool 'show') came up clean - no issues remaining and no log entries. cyrus-hammer-skiplist-2.3.10.diff: I used this command to hammer a skiplist database like so: sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & sudo -u cyrus ./hammer_skiplist -n /tmp/hammer.db & ... I've turned down the "forget about this transaction" option to a lot less common than my original tests. It should still fire a couple of times per hammer, but it creates log entries even on the fully patched code (rolling back incomplete txn), so I didn't want to spam the logs. Enjoy, Bron.
SKIPLIST bugfixes In the past we have had issues with bugs in skiplist on seen files, and we truncated files at the offset with the issue since they were only seen data. Lately, we have had more tools updating mailboxes.db more often, and have lost multiple mailboxes.db files. There are two detectable issues: 1) incorrect header "logstart" values causing recovery to fail with either unexpected ADD/DELETE records or unexpected INORDER records depending which side of the correct location the logstart value is wrong. 2) a bunch of zero bytes between transactions in the log section. The attached patch fixes the following issues: a) recovery failed to update db->map_base if it truncated a partial transaction. This reliably recreated the zero bytes issues above by having the next store command lseek to a location past the new end of the file, and hence fill the remainder with blanks. b) the logic in the "delete" handler for detecting "no record exists" (ptr == db->map_base) was backwards, meaning that a delete on a record which didn't exist caused reads of PTR(db->map_base, i), which is bogus and nasty. This is the suspect for logstart breakage though I haven't proven this yet. c) unsure if this is a real risk, but added a ftruncate to checkpoint to ensure new file really is empty, since we don't open it with O_EXCL. d.1) when abort is called it needs to update_lock() to ensure that the records it's about to rollback are actually locked. This fix stopped segfaults in my testing. d.2) delete didn't check retry_write for success, and also suffered from the same problem as: d.3) if retry_writev in store failed, then it called abort, but without the records actually written, abort had no way of knowing which offsets needed to be switched back, meaning bogus pointers could be left in the file until the next recovery(). Changed it to write the records first, then only once that succeeded update the pointers. This way abort will do the right thing regardless. *PHEW* - that's three days I want back, but it has survived 10 concurrent processes doing nasty things to it for 10000 operations each and it's still going strong: skiplist: recovered /tmp/hammer.db (143177 records, 5965096 bytes) in 0 seconds Index: cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c =================================================================== --- cyrus-imapd-2.3.10.orig/lib/cyrusdb_skiplist.c 2007-11-12 07:25:23.000000000 -0500 +++ cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c 2007-11-12 07:25:51.000000000 -0500 @@ -1086,7 +1086,7 @@ int addrectype = htonl(ADD); int delrectype = htonl(DELETE); bit32 todelete; - bit32 newoffset; + bit32 newoffset, netnewoffset; int r, i; assert(db != NULL); @@ -1174,17 +1174,7 @@ klen = htonl(keylen); dlen = htonl(datalen); - newoffset = htonl(newoffset); - - /* set pointers appropriately */ - for (i = 0; i < lvl; i++) { - /* write pointer updates */ - /* FORWARD(updates[i], i) = newoffset; */ - lseek(db->fd, - PTR(db->map_base + updateoffsets[i], i) - db->map_base, - SEEK_SET); - retry_write(db->fd, (char *) &newoffset, 4); - } + netnewoffset = htonl(newoffset); WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *) &addrectype, 4); WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *) &klen, 4); @@ -1212,6 +1202,17 @@ } tp->logend += r; /* update where to write next */ + /* update pointers after writing record so abort is guaranteed to + * see which records need reverting */ + for (i = 0; i < lvl; i++) { + /* write pointer updates */ + /* FORWARD(updates[i], i) = newoffset; */ + lseek(db->fd, + PTR(db->map_base + updateoffsets[i], i) - db->map_base, + SEEK_SET); + retry_write(db->fd, (char *) &netnewoffset, 4); + } + if (tid) { if (!*tid) { /* return the txn structure */ @@ -1290,12 +1291,28 @@ } ptr = find_node(db, key, keylen, updateoffsets); - if (ptr == db->map_base || + if (ptr != db->map_base && !db->compar(KEY(ptr), KEYLEN(ptr), key, keylen)) { /* gotcha */ offset = ptr - db->map_base; - /* update pointers */ + /* log the deletion */ + getsyncfd(db, tp); + lseek(tp->syncfd, tp->logend, SEEK_SET); + writebuf[0] = delrectype; + writebuf[1] = htonl(offset); + + /* update end-of-log */ + r = retry_write(tp->syncfd, (char *) writebuf, 8); + if (r < 0) { + syslog(LOG_ERR, "DBERROR: retry_write(): %m"); + myabort(db, tp); + return CYRUSDB_IOERROR; + } + tp->logend += r; + + /* update pointers after writing record so abort is guaranteed to + * see which records need reverting */ for (i = 0; i < db->curlevel; i++) { int newoffset; @@ -1308,15 +1325,6 @@ SEEK_SET); retry_write(db->fd, (char *) &newoffset, 4); } - - /* log the deletion */ - getsyncfd(db, tp); - lseek(tp->syncfd, tp->logend, SEEK_SET); - writebuf[0] = delrectype; - writebuf[1] = htonl(offset); - - /* update end-of-log */ - tp->logend += retry_write(tp->syncfd, (char *) writebuf, 8); } if (tid) { @@ -1429,6 +1437,9 @@ int r = 0; assert(db && tid); + + /* update the mmap so we can see the log entries we need to remove */ + update_lock(db, tid); /* look at the log entries we've written, and undo their effects */ while (tid->logstart != tid->logend) { @@ -1562,6 +1573,15 @@ return CYRUSDB_IOERROR; } + /* truncate it just in case! */ + r = ftruncate(db->fd, 0); + if (r < 0) { + syslog(LOG_ERR, "DBERROR: skiplist checkpoint %s: ftruncate %m", fname); + if (!locked) unlock(db); + db->fd = oldfd; + return CYRUSDB_IOERROR; + } + /* write dummy record */ if (!r) { int dsize = DUMMY_SIZE(db); @@ -2059,6 +2079,10 @@ db->fname); r = CYRUSDB_IOERROR; } + + /* set the map size back as well */ + db->map_size = offset; + break; }
Index: cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c =================================================================== --- cyrus-imapd-2.3.10.orig/lib/cyrusdb_skiplist.c 2007-11-12 08:18:12.000000000 -0500 +++ cyrus-imapd-2.3.10/lib/cyrusdb_skiplist.c 2007-11-12 08:44:53.000000000 -0500 @@ -1884,6 +1884,7 @@ int r = 0; time_t start = time(NULL); int i; + int need_checkpoint = 0; if (!(flags & RECOVERY_CALLER_LOCKED) && (r = write_lock(db, NULL)) < 0) { return r; @@ -1946,20 +1947,13 @@ /* reset the data that was written INORDER by the last checkpoint */ offset = DUMMY_OFFSET(db) + DUMMY_SIZE(db); - while (!r && (offset < (bit32) db->logstart)) { + while (!r && (offset < db->map_size) + && TYPE(db->map_base + offset) == INORDER) { ptr = db->map_base + offset; offsetnet = htonl(offset); db->listsize++; - /* make sure this is INORDER */ - if (TYPE(ptr) != INORDER) { - syslog(LOG_ERR, "DBERROR: skiplist recovery: %04X should be INORDER", - offset); - r = CYRUSDB_IOERROR; - continue; - } - /* xxx check \0 fill on key */ /* xxx check \0 fill on data */ @@ -2000,6 +1994,12 @@ } } + if (offset != db->logstart) { + syslog(LOG_NOTICE, "skiplist recovery %s: incorrect logstart %04X changed to %04X", + db->fname, db->logstart, offset); + db->logstart = offset; /* header will be committed later */ + } + /* zero out the remaining pointers */ if (!r) { for (i = 0; !r && i < db->maxlevel; i++) { @@ -2033,6 +2033,19 @@ db->fname, 0); ptr = db->map_base + offset; + + /* bugs in recovery truncates could have left some bogus zeros here */ + if (TYPE(ptr) == 0) { + int orig = offset; + while (TYPE(ptr) == 0 && offset < db->map_size) { + offset += 4; + ptr = db->map_base + offset; + } + syslog(LOG_ERR, "skiplist recovery %s: skipped %d bytes of zeros at %04X", + db->fname, offset - orig, orig); + need_checkpoint = 1; + } + offsetnet = htonl(offset); /* if this is a commit, we've processed everything in this txn */ @@ -2205,6 +2218,10 @@ db->map_size, diff, diff == 1 ? "" : "s"); } + if (!r && need_checkpoint) { + r = mycheckpoint(db, 1); + } + if(r || !(flags & RECOVERY_CALLER_LOCKED)) { unlock(db); }
Index: cyrus-imapd-2.3.10/imap/Makefile.in =================================================================== --- cyrus-imapd-2.3.10.orig/imap/Makefile.in 2007-11-12 05:14:03.000000000 -0500 +++ cyrus-imapd-2.3.10/imap/Makefile.in 2007-11-12 06:33:26.000000000 -0500 @@ -115,7 +115,7 @@ PROGS = imapd lmtpd pop3d \ fud smmapd reconstruct quota mbpath ipurge cyr_dbtool cyr_synclog \ - cyrdump chk_cyrus cvt_cyrusdb deliver ctl_mboxlist \ + cyrdump chk_cyrus cvt_cyrusdb deliver ctl_mboxlist hammer_skiplist \ ctl_deliver ctl_cyrusdb squatter mbexamine cyr_expire arbitron \ unexpunge @IMAP_PROGS@ @@ -261,6 +261,10 @@ $(CC) $(LDFLAGS) -o cyr_dbtool cyr_dbtool.o $(CLIOBJS) \ libimap.a $(DEPLIBS) $(LIBS) +hammer_skiplist: hammer_skiplist.o mutex_fake.o libimap.a $(DEPLIBS) + $(CC) $(LDFLAGS) -o hammer_skiplist hammer_skiplist.o $(CLIOBJS) \ + libimap.a $(DEPLIBS) $(LIBS) + cyr_synclog: cyr_synclog.o mutex_fake.o libimap.a $(DEPLIBS) $(CC) $(LDFLAGS) -o cyr_synclog cyr_synclog.o $(CLIOBJS) \ libimap.a $(DEPLIBS) $(LIBS) Index: cyrus-imapd-2.3.10/imap/hammer_skiplist.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ cyrus-imapd-2.3.10/imap/hammer_skiplist.c 2007-11-12 06:35:28.000000000 -0500 @@ -0,0 +1,210 @@ +/* hammer_skiplist.c - tool to harass a skiplist file + * + * Copyright (c) 1998-2003 Carnegie Mellon University. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * + * 3. The name "Carnegie Mellon University" must not be used to + * endorse or promote products derived from this software without + * prior written permission. For permission or any other legal + * details, please contact + * Office of Technology Transfer + * Carnegie Mellon University + * 5000 Forbes Avenue + * Pittsburgh, PA 15213-3890 + * (412) 268-4387, fax: (412) 268-7395 + * tech-transfer@xxxxxxxxxxxxxx + * + * 4. Redistributions of any form whatsoever must retain the following + * acknowledgment: + * "This product includes software developed by Computing Services + * at Carnegie Mellon University (http://www.cmu.edu/computing/)." + * + * CARNEGIE MELLON UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO + * THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY BE LIABLE + * FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN + * AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING + * OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + * + */ +/* + * $Id: hammer_skiplist.c,v 1.4 2007/09/28 02:27:46 murch Exp $ + */ + +#include <config.h> + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#ifdef HAVE_UNISTD_H +#include <unistd.h> +#endif +#include <errno.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/uio.h> +#include <fcntl.h> +#include <ctype.h> +#include <syslog.h> + +#include <sys/ipc.h> +#include <sys/msg.h> + +#include "acl.h" +#include "assert.h" +#include "auth.h" +#include "cyrusdb.h" +#include "exitcodes.h" +#include "glob.h" +#include "imap_err.h" +#include "global.h" +#include "mailbox.h" +#include "util.h" +#include "xmalloc.h" + +const int config_need_data = 0; +struct cyrusdb_backend *DB_OLD = NULL; + +void hammer(struct db *db) +{ + int c; + for (c = 0;c < 10000; c++) { /* should be enough! */ + struct txn *tid = NULL; + struct txn **tp; + char key[100]; + char value[100]; + int klen, vlen, i, r = 0; + int rop; + + /* protect against silly things */ + + tp = (rand() % 2) ? &tid : NULL; + klen = 1 + (rand() % 6); + for (i = 0; i < klen; i++) { + key[i] = 'A' + (rand() % 26); + } + key[klen] = '\0'; + vlen = rand() % 20; + for (i = 0; i < vlen; i++) { + value[i] = 'a' + (rand() % 26); + } + value[vlen] = '\0'; + rop = rand() % 1000; + if (rop >= 999) { + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + /* forget to commit */ + } + else if (rop >= 800) { + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + if (!r) r = DB_OLD->delete(db, key, klen, tp, 0); + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + if (!r && tp) DB_OLD->commit(db, *tp); + } + else if (rop >= 700) { + if (!r) r = DB_OLD->delete(db, key, klen, tp, 0); + if (!r && tp) DB_OLD->commit(db, *tp); + } + else if (rop >= 600) { /* will fail */ + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + } + else if (rop > 200) { + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + key[klen-1] = 'a'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'b'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'd'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'c'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + if (!r && tp) DB_OLD->commit(db, *tp); + } + else { + if (!r) r = DB_OLD->store(db, key, klen, value, vlen, tp); + key[klen-1] = 'a'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'b'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'd'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + key[klen-1] = 'c'; + if (!r) r = DB_OLD->create(db, key, klen, value, vlen, tp); + if (!r && tp) DB_OLD->abort(db, *tp); + } + + } + + +} + +int main(int argc, char *argv[]) +{ + struct db *odb; + const char *old_db; + int opt, i, r; + char *alt_config = NULL; + int db_flags = 0; + + while ((opt = getopt(argc, argv, "C:n")) != EOF) { + switch (opt) { + case 'C': /* alt config file */ + alt_config = optarg; + break; + case 'n': /* create new */ + db_flags |= CYRUSDB_CREATE; + break; + } + } + + if((argc - optind) < 1) { + fprintf(stderr, "Usage: %s [-C altconfig] <dbfile>\n", argv[0]); + + exit(-1); + } + + old_db = argv[optind]; + + if(old_db[0] != '/') { + printf("\nSorry, you cannot use this tool with relative path names.\n" + "This is because some database backends (mainly berkeley) do not\n" + "always do what you would expect with them.\n" + "\nPlease use absolute pathnames instead.\n\n"); + exit(EC_OSERR); + } + + for(i=0; cyrusdb_backends[i]; i++) { + if(!strcmp(cyrusdb_backends[i]->name, "skiplist")) { + DB_OLD = cyrusdb_backends[i]; break; + } + } + if(!cyrusdb_backends[i]) { + fatal("unknown backend", EC_TEMPFAIL); + } + + cyrus_init(alt_config, "hammer_skiplist", 0); + + + r = (DB_OLD->open)(old_db, db_flags, &odb); + if(r != CYRUSDB_OK) + fatal("can't open database", EC_TEMPFAIL); + + hammer(odb); + + cyrus_done(); + + return 0; +} +
---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html