On Wed, 11 Jun 2008 15:07:02 +0200, "Rudy Gevaert" <Rudy.Gevaert@xxxxxxxx> said: > Bron Gondwana wrote: > > > Try a 2.6.20 kernel, just for an interesting datapoint. We changed > > back to 2.6.20 (64 bit still) and haven't seen a corrupted seen file > > since. > > I hope to try that still today. > > I'm now running on 2.6.24-2, 32bit. I have cleaned up the users that > were having a corrupted mailbox on replica. Surprisingly I can count > them on both hands. > > So now I'm again running with rolling replication and I'm doing a > sync_client session for each user. When that is finnished I'll try to > downgrade the kernel. > > Btw, I tested my sarge-> etch upgrade in a xen virtual machine, 64bit > kernel + 32 bit userspace. But this was 2.6.18. > > I'm still wondering if I should run 2.6.20 in 32bit or 64bit... It's been fine for us as 64bit for a while now. Though note - 64bit will allow lots more process space, which allows broken cache files to REALLY SCREW WITH YOU. Bah. We have 4Gb core dumps being written into our cores directory - and let me tell you, while something is dumping core it uses some trick which totally nukes all other IO on the same device. It gets ioniced up there really happy. Ouch. The cause - mailbox_cache_size hits a bogus "length" field and returns like 1.7Gb as the size of the record. This then causes an xrealloc to "size * 2", or 3.4Gb. At least in the case of one mailbox that's been causing us fun. In a second I'll gdb that awfully large core and figure out which mailbox is the culprit. One reconstruct later.... > >>> Oh - can you tell me. Did the file checkpoint sometime not too long before it > >>> got corrupted? > >> The cases I saw it did. > > > > Ditto here. Interesting. They also had quite long records, but > > I don't know how common that is. Lots of little bits of seen > > spread around the space. > > I'm not sure how I would see that? I'm not familiar with the internals > of skiplist. I find they show up pretty well as ^@^@^@^@^@^@ in less. The skiplist format doesn't have many all zero blocks otherwise. Lots of other special characters show up for binary bits. Sadly, I can pretty much read a hexdump of a skiplist. Sad because that's a lot of braincells that could be doing something useful like absorbing alcohol. I've written a little patch for the mailbox_cache_size issue that returns 0 if the result ever looks like it's going negative or more than 100 million bytes. Then sync_support is patched to treat a zero cache size as "say we failed to reserve this message". It will do for now... Bron ( also found a theoretical bug in the skiplist code and patched it today, but I might fix the whole function before I submit it upstream. I say theoretical because I don't see that the codepath gets exercised unless you already have a corrupt file, so meh ) -- Bron Gondwana brong@xxxxxxxxxxx ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html