So we've discussed this at length in #cyrus on freenode, and concluded that the issue is that twoskip is doing far too many munmap / mmap calls during an unlocked foreach (which is what the LIST command uses). I've filed a bug: https://github.com/cyrusimap/cyrus-imapd/issues/5 And I'm looking at what it would take to fix this behaviour in twoskip now (yes, I know I should be sleeping, but I'm not going to be able to sleep until I understand how this got broken!) Bron. On Fri, Jul 15, 2016, at 20:41, Hynek Schlawack via Info-cyrus wrote: > Hello, > > we’ve updated one of our Cyrus IMAP backends from 2.4 to 2.5.8 on FreeBSD 10.3 with ZFS and now we have an operational emergency. > > Cyrus IMAPd starts fine and keeps working for about 5 to 20 minutes (rather sluggishly tho). At some point the server load starts growing and explodes eventually until we have to restart the IMAP daemons which gives us another 5 to 20 minutes. > > It doesn’t really matter if we run `reconstruct` in the background or not. > > > # Observations: > > 1. While healthy, the imapd daemons’s states are mostly `select` or `RUN`. Once things get critical they all are mostly in `zfs` (but do occasionally switch). > 2. Customers report that their mail clients are downloading all e-mails. That’s obviously extra bad given we seem to run in some kind of I/O problems. Running `truss` on busy imapd processes seem to confirm that. > 3. Once hell breaks loose, IO collapses even on other file systems/hard disks. > 4. `top` mentions processes in `lock` state – sometimes even more than 200. That’s nothing we see on our other backends. > 5. There seems to be a correlation between processes hanging in `zfs` state and `truss` showing them accessing mailboxes.db. Don’t know if it’s related, but soon after the upgrade, mailboxes.db broke and we had to reconstruct it. > > > # Additional key data: > > - 25,000 accounts > - 4.5 TB data > - 64 GB RAM, no apparent swapping > - 16 cores CPU > - nginx in front of it. > > ## zpool iostat 5 > > capacity operations bandwidth > pool alloc free read write read write > ---------- ----- ----- ----- ----- ----- ----- > tank 4.52T 697G 144 2.03K 1.87M 84.2M > tank 4.52T 697G 84 730 2.13M 3.94M > tank 4.52T 697G 106 904 2.78M 4.52M > tank 4.52T 697G 115 917 3.07M 5.11M > tank 4.52T 697G 101 1016 4.04M 5.06M > tank 4.52T 697G 124 1.03K 3.27M 6.59M > > Which doesn’t look special. > > The data used to be on HDDs and worked fine with an SSD ZIL. After the upgrade and ensuing problems we tried a Hail Mary by replacing the HDDs thru SSDs to no avail (migrated a ZFS snapshot for that). > > So we do *not* believe it’s really a traditional I/O bottleneck since it only started *after* the upgrade to 2.5 and did not go away by adding SSDs. The change notes led us to believe that there shouldn’t be any I/O storm due to mailbox conversions but is it true in any case? How could we double check? Observation #2 from above leads us to believe that there are in fact some meta data problems. We’re reconstructing in the background but that’s going to take days; which is sadly time we don’t really have. > > ## procstat -w 1 of an active imapd > > PID PPID PGID SID TSID THR LOGIN WCHAN EMUL COMM > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor - FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor - FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor - FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor *vm objec FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor - FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor zfs FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor - FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > 45016 43150 43150 43150 0 1 toor select FreeBSD ELF64 imapd > > > Had anyone similar problems (and got them solved, ideally!)? > > Are there any known incompatibilities between Cyrus 2.5.8 and FreeBSD/ZFS? > > Has anyone ever successfully downgraded from 2.5.8 back to 2.4? > > Do we have any other options? > > Any help would be *very much* appreciated! > —h > ---- > Cyrus Home Page: http://www.cyrusimap.org/ > List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ > To Unsubscribe: > https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus -- Bron Gondwana brong@xxxxxxxxxxx ---- Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/ To Unsubscribe: https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus