Statistics on message sizes, folders, etc

Bron Gondwana <brong@xxxxxxxxxxx> · Wed, 30 Sep 2009 14:15:18 +1000

Well, I figure this is interesting anyway - rough
statistics on how many folders users have, how
big their messages are, how much "churn" the folders
have.

Source code attached (including a copy of 
IndexFile.pm which it requires to read the index
files)

I've tried to make it compatible with different hashing
layouts and mailbox setups, but no guarantees, it's
pretty untested!

Here's a dump for the store that my mailbox is on.  300Gb
total mail storage space, sitting at about 85% full.  Note
that the output is totally unsorted because I'm lazy.

Some notes:

* we autocreate INBOX, INBOX.Drafts, INBOX.Trash and
  'INBOX.Sent Items' - hence the huge '4' spike in
  folders per user.

* lots of empty folders, but they don't explain all
  the 'NONE' for gappyness, because it checks for a
  gap between the end and 'LastUid' as well, so
  they'll actually end up with 'ALL' if there was ever
  a message in the folder.

* don't ask me what that 12 byte "message" is.
  Obviously some imap client puked it into a folder.
  I'm not going searching for it!

* if you want different buckets, the format is pretty
  simple, and the specification is at the top of the
  file!  Enjoy.

Bron ( yes, I'd love to see some other people's stats.
       Thanks for asking.  I'm planning to feed these
       stats into my massively realistic IMAP server
       stresstester... )

STATS for /etc/imapd-slots10a1p3-master.conf
Partitions: 1
Users: 3237
Folders: 25619

Folders per user:
 10 - 30              273
 100 - 300            19
 3                    1
 30 - 100             70
 300 - 1000           1
 4                    1974
 5                    491
 6                    190
 7                    89
 8                    71
 9                    57
 > 1000               1

Messages per folder:
 0                    8856
 1                    2055
 1,000 - 3,000        343
 10 - 30              3368
 10,000 - 30,000      34
 100 - 300            1548
 2                    1476
 20,000 - 100,000     15
 3                    958
 3,000 - 10,000       142
 30 - 100             2885
 300 - 1,000          853
 4                    759
 5                    595
 6                    535
 7                    444
 8                    421
 9                    330
 > 100,000            2

Message sizes:
 1 - 3 KB             1094127
 1 - 3 MB             22340
 10 - 30 KB           1073445
 100 - 300 B          40
 100 - 300 KB         89560
 12                   1
 3 - 10 KB            1555633
 3 - 10 MB            13849
 30 - 100 KB          307347
 300 B - 1 KB         124574
 300 KB - 1 MB        46155
 > 10 MB              2335

Ratio of gaps between UIDs: (expunge tracking)
 ALL                  4143
 HIGH                 3494
 LOW                  3399
 MEDIUM               2056
 NONE                 12527

#!/usr/bin/perl -c

package Cyrus::IndexFile;

use strict;
use warnings;

use IO::File;
use IO::Handle;
use String::CRC32 qw(crc32);

=pod

=head1 NAME

Cyrus::IndexFile - A pure perl interface to the "cyrus.index" file
format as generated by Cyrus IMAPd.

=head1 EXAMPLES

 use Cyrus::IndexFile;

 # Note: requires IO::File::fcntl module installed for locking support
 my $index = Cyrus::IndexFile->new_file("$path/cyrus.index", ['lock_ex', 5]);

 print "EXISTS: " . $index->header('Exists') . "\n";
 while (my $record = $index->next_record_hash()) {
   print "$record->{Uid}: $record->{MessageGuid} $record->{Size}\n";
 }

=head1 SUPPORTED FORMAT VERSIONS

 Definitions:
 ============

 * int32 4 - 32 bit value taking 4 octets on disk.  Visible in perl as an integer
 * int64 8 - 64 bit value taking 8 octets on disk.  Visible in perl as an integer
 * time_t 4 - same as int32
 * bitmap N - a bitmap taking up N octets on disk.  Visible in perl as a string of 1s and 0s.
 * hex N - a big value taking up N octets on disk.  Visible in perl as a hexadecimal string (0-9a-f)

 These values can be referenced by name using the hash API, or by index using the array API.
 You can also use the 'raw' API to get the record in on-disk format.

 All numbers are in network byte order as per Cyrus standard encoding.  Bitmap and hex values are
 layed out as octets on disk and encoded directly in order.

 Version 9:
 ==========

 Header:
  0: Generation            int32  4
  1: Format                int32  4
  2: MinorVersion          int32  4
  3: StartOffset           int32  4
  4: RecordSize            int32  4
  5: Exists                int32  4
  6: LastAppenddate        time_t 4
  7: LastUid               int32  4
  8: QuotaUsed             int64  8
  9: Pop3LastLogin         time_t 4
 10: UidValidity           int32  4
 11: Deleted               int32  4
 12: Answered              int32  4
 13: Flagged               int32  4
 14: Options               bitmap 4
 15: LeakedCache           int32  4
 16: HighestModseq         int64  8
 17: Spare0                int32  4
 18: Spare1                int32  4
 19: Spare2                int32  4
 20: Spare3                int32  4
 21: Spare4                int32  4

 Record:
  0: Uid                   int32  4
  1: InternalDate          time_t 4
  2: SentDate              time_t 4
  3: Size                  int32  4
  4: HeaderSize            int32  4
  5: ContentOffset         int32  4
  6: CacheOffset           int32  4
  7: LastUpdated           time_t 4
  8: SystemFlags           bitmap 4
  9: UserFlags             bitmap 16
 10: ContentLines          int32  4
 11: CacheVersion          int32  4
 12: MessageUuid           hex    12
 13: Modseq                int64  8

 Version 10:
 ===========

 Header:
  0: Generation            int32  4
  1: Format                int32  4
  2: MinorVersion          int32  4
  3: StartOffset           int32  4
  4: RecordSize            int32  4
  5: Exists                int32  4
  6: LastAppenddate        time_t 4
  7: LastUid               int32  4
  8: QuotaUsed             int64  8
  9: Pop3LastLogin         time_t 4
 10: UidValidity           int32  4
 11: Deleted               int32  4
 12: Answered              int32  4
 13: Flagged               int32  4
 14: Options               bitmap 4
 15: LeakedCache           int32  4
 16: HighestModseq         int64  8
 17: Spare0                int32  4
 18: Spare1                int32  4
 19: Spare2                int32  4
 20: Spare3                int32  4
 21: Spare4                int32  4

 Record:
  0: Uid                   int32  4
  1: InternalDate          time_t 4
  2: SentDate              time_t 4
  3: Size                  int32  4
  4: HeaderSize            int32  4
  5: ContentOffset         int32  4
  6: CacheOffset           int32  4
  7: LastUpdated           time_t 4
  8: SystemFlags           bitmap 4
  9: UserFlags             bitmap 16
 10: ContentLines          int32  4
 11: CacheVersion          int32  4
 12: MessageGuid           hex    20
 13: Modseq                int64  8

=cut

# Set up header and record formatting information {{{

my $VersionFormats = {
  9 => {
    HeaderSize => 96,
    _make_fields('Header',<<EOF),
Generation            int32  4
Format                int32  4
MinorVersion          int32  4
StartOffset           int32  4
RecordSize            int32  4
Exists                int32  4
LastAppenddate        time_t 4
LastUid               int32  4
QuotaUsed             int64  8
Pop3LastLogin         time_t 4
UidValidity           int32  4
Deleted               int32  4
Answered              int32  4
Flagged               int32  4
Options               bitmap 4
LeakedCache           int32  4
HighestModseq         int64  8
Spare0                int32  4
Spare1                int32  4
Spare2                int32  4
Spare3                int32  4
Spare4                int32  4
EOF
    RecordSize => 80, # defined in file too, check it!
    _make_fields('Record', <<EOF),
Uid                   int32  4
InternalDate          time_t 4
SentDate              time_t 4
Size                  int32  4
HeaderSize            int32  4
ContentOffset         int32  4
CacheOffset           int32  4
LastUpdated           time_t 4
SystemFlags           bitmap 4
UserFlags             bitmap 16
ContentLines          int32  4
CacheVersion          int32  4
MessageUuid           hex    12
Modseq                int64  8
EOF
  },
  10 => {
    HeaderSize => 96,
    _make_fields('Header',<<EOF),
Generation            int32  4
Format                int32  4
MinorVersion          int32  4
StartOffset           int32  4
RecordSize            int32  4
Exists                int32  4
LastAppenddate        time_t 4
LastUid               int32  4
QuotaUsed             int64  8
Pop3LastLogin         time_t 4
UidValidity           int32  4
Deleted               int32  4
Answered              int32  4
Flagged               int32  4
Options               bitmap 4
LeakedCache           int32  4
HighestModseq         int64  8
Spare0                int32  4
Spare1                int32  4
Spare2                int32  4
Spare3                int32  4
Spare4                int32  4
EOF
    RecordSize => 88, # defined in file too, check it!
    _make_fields('Record', <<EOF),
Uid                   int32  4
InternalDate          time_t 4
SentDate              time_t 4
Size                  int32  4
HeaderSize            int32  4
ContentOffset         int32  4
CacheOffset           int32  4
LastUpdated           time_t 4
SystemFlags           bitmap 4
UserFlags             bitmap 16
ContentLines          int32  4
CacheVersion          int32  4
MessageGuid           hex    20
Modseq                int64  8
EOF
  },
  11 => {
    HeaderSize => 96,
    _make_fields('Header',<<EOF),
Generation            int32  4
Format                int32  4
MinorVersion          int32  4
StartOffset           int32  4
RecordSize            int32  4
Exists                int32  4
LastAppenddate        time_t 4
LastUid               int32  4
QuotaUsed             int64  8
Pop3LastLogin         time_t 4
UidValidity           int32  4
Deleted               int32  4
Answered              int32  4
Flagged               int32  4
Options               bitmap 4
LeakedCache           int32  4
HighestModseq         int64  8
Spare0                int32  4
Spare1                int32  4
Spare2                int32  4
Spare3                int32  4
HeaderCrc             int32  4
EOF
    RecordSize => 96, # defined in file too, check it!
    _make_fields('Record', <<EOF),
Uid                   int32  4
InternalDate          time_t 4
SentDate              time_t 4
Size                  int32  4
HeaderSize            int32  4
ContentOffset         int32  4
CacheOffset           int32  4
LastUpdated           time_t 4
SystemFlags           bitmap 4
UserFlags             bitmap 16
ContentLines          int32  4
CacheVersion          int32  4
MessageGuid           hex    20
Modseq                int64  8
CacheCrc              int32  4
RecordCrc             int32  4
EOF
  },
};

# parse our the plaintext field definitions into a useful datastructure
sub _make_fields {
  my $prefix = shift;
  my $string = shift;

  my @lines = grep { m/\S/ } split /\n/, $string;

  my @names;
  my @items;
  my @packitems;
  my $Pos = 0;
  my $Num = 0;
  foreach my $line (@lines) {
    my ($Name, $Type, $Size) = split /\s+/, $line;

    push @names, $Name;
    push @items, [$Name, $Type, $Size, $Num, $Pos];
    push @packitems, _make_pack($Type, $Size);

    $Pos += $Size;
    $Num++;
  }

  return (
    $prefix . 'Names' => { map { $names[$_] => $_ } 0..$#names },
    $prefix . 'Fields' => \@items,
    $prefix . 'Pack' => join("", @packitems),
  );
}

# build the pack/unpack expression for a single field
sub _make_pack {
  my $format = shift;
  my $size = shift;
  if ($format eq 'int32' or $format eq 'time_t') {
    return 'N';
  }
  elsif ($format eq 'int64') { # ignore start..
    return 'x[N]N';
  }
  elsif ($format eq 'bitmap') {
    return 'B' . (8 * $size);
  }
  elsif ($format eq 'hex') {
    return 'H' . (2 * $size);
  }
}

# end format definitions
# }}}

=head1 PUBLIC API

=item Cyrus::IndexFile->new($fh)

Build a new Cyrus::IndexFile object from a filehandle.  The handle is not 
required to be seekable, so make sure you have rewound it before use.

 seek($fh, 0, 0);
 my $index = Cyrus::IndexFile->new($fh);

This function reads the header from the file and returns a Cyrus::IndexFile
object.  The filehandle will be pointing at the start of the first record.

If there is a problem, then the position of the filehandle is undefined 
(though probably either at 12 bytes or the end of the header) and the
function will "die".

Causes of death:

 * unable to read a full header's length of data from the file
 * version of the file is not one of the supported versions

=cut

sub new {
  my $class = shift;
  my $handle = shift;

  my $buf;

  # read initial header information to determine version
  my $read = sysread($handle, $buf, 12);
  die "Unable to read header information\n" 
    unless $read == 12;

  # version is always at this offset!
  my $version = unpack('N', substr($buf, 8));

  # check that it's a supported version
  my $frm = $VersionFormats->{$version}
    || die "Unknown header format $version\n";

  # read the rest of the header (length depends on version)
  sysread($handle, $buf, $frm->{HeaderSize} - 12, 12);
  my $Self = bless { 
    @_,
    version => $version, 
    handle => $handle,
    format => $frm,
    rawheader => $buf,
    msgno => 0,
  }, ref($class) || $class;

  $Self->{header} = $Self->_header_b2h($buf);
  die "Unable to parse header" unless $Self->{header};

  return $Self;
}

=item Cyrus::IndexFile->new_file($filename, $lockopts)

Open the file to read, optionally locking it with IO::File::fcntl.  If you
pass a scalar for lockopts then it will be locked with ['lock_ex'], otherwise
you can pass a tuple, e.g. ['lock_ex', 5] for a 5 second timeout.

This function will die if it can't open or lock the file.  On success, it
calls $class->new() with the filehandle.

=cut

sub new_file {
  my $class = shift;
  my $filename = shift;
  my $lockopts = shift;

  my $fh;
  if ($lockopts) {
    require 'IO/File/fcntl.pm' || die "can't lock without IO::File::fcntl module";
    $lockopts = ['lock_ex'] unless ref($lockopts) eq 'ARRAY';
    $fh = IO::File::fcntl->new($filename, '+<', @$lockopts)
          || die "Can't open $filename for locked read: $!";
  } else {
    $fh = IO::File->new("< $filename") 
          || die "Can't open $filename for read: $!";
  }

  return $class->new($fh, @_);
}

=item Cyrus::IndexFile->new_empty($version)

Create a new empty index file with the specified version.  This is
useful when you want to generate a new index file, as you can use
the write_record function and set header fields on the new object.

=cut

sub new_empty {
  my $class = shift;
  my $version = shift;

  # check that the version is supported
  my $frm = $VersionFormats->{$version} 
    || die "unknown version $version";

  my $Self = bless { 
    @_,
    version => $version,
    format => $frm,
  }, ref($class) || $class;

  return $Self;
}

=item $index->stream_copy($outfh, $decider, %Opts)

Currently broken!  Supposed to copy this file into the output filehandle.

NOTE: outfh must be seekable, as we write an initial header record with
Exists == 0, then update the header at the end with a new Exists and a
new LastUpdated.

=cut

sub stream_copy {
  my $Self = shift;
  my $outfh = shift;
  my $decide = shift;
  my %Opts = @_;

  my $out = $Self->new_empty($Opts{version} || $Self->{version});

  my $newheader = $Self->header_copy();
  if ($Opts{headerfields}) {
    foreach my $field (keys %{$Opts{headerfields}}) {
      $newheader->{$field} = $Opts{headerfields}{$field};
    }
  }

  # initially empty
  $newheader->{Exists} = 0;
  # Important!  Otherwise you get versions out of skew!
  $newheader->{MinorVersion} = $out->{version};
  $newheader->{RecordSize} = $out->{format}{RecordSize};
  $out->write_header($outfh, $newheader);

  $Self->reset();
  while (my $record = $Self->next_record()) {
    if ($decide->($newheader, $record)) {
      $newheader->{Exists}++;
      $out->write_record($outfh, $record);
    }
  }

  # update exists and last updated
  $newheader->{LastUpdated} = time();
  sysseek($outfh, 0, 0);
  $out->write_header($outfh, $newheader);
}

=item $index->header()

=item $index->header_hash()

Returns a hash reference of the entire header

=item $index->header($field)

Returns just the single named field from the header.  Dies if there is no
field with that name in the header.

=item $index->header_array($field)

Returns an array reference with the values in the order given in the version
information above.

=item $index->header_raw()

Returns the raw packed header as it is on disk.

=cut

sub header {
  my $Self = shift;
  my $Field = shift;

  if ($Field) {
    die "No such header field $Field\n" unless exists $Self->{header}{$Field};
    return $Self->{header}{$Field};
  }

  return $Self->{header};
}

sub header_array {
  my $Self = shift;
  return $Self->_header_h2a($Self->{header});
}

sub header_hash {
  my $Self = shift;
  return $Self->{header};
}

sub header_raw {
  my $Self = shift;
  return $Self->{rawheader};
}

=item $index->header_copy()

Returns a hashref the same as header_hash, but it's "non live", so you can
make destructive changes without affecting the original.

=cut

sub header_copy {
  my $Self = shift;
  my $orig = $Self->{header};
  return { %$orig };
}

=item $index->reset($num)

Deletes the cached 'current record' and seeks back to the given record 
number, or the end of the header (record 0) if no number given.

Requires the input filehandle to be seekable.

=cut

sub reset {
  my $Self = shift;
  my $num = shift || 0;

  die "Invalid record $num (must be >= 0 and <= $Self->{header}{Exists}" 
    unless ($num >= 0 and $num <= $Self->{header}{Exists});

  my $HeaderSize = $Self->{format}{HeaderSize};
  my $RecordSize = $Self->{format}{RecordSize};

  sysseek($Self->{handle}, $HeaderSize + ($num * $RecordSize), 0)
    || die "unable to seek on this filehandle";

  $Self->{msgno} = $num;

  delete $Self->{record};
  delete $Self->{rawrecord};
  delete $Self->{checksum_failure};
}

=item $index->next_record()
=item $index->next_record_hash()

Read the next record from the file and parse it in to a hash reference
per the format of the index file.

This works even on non-seekable files.

Returns undef when there are no more records (until you call reset)

=item $index->next_record_array()

As above, but returns the array in the format order.

More efficient, as the hash doesn't need to be created.

=item $index->next_record_raw()

Returns the raw bytes of the index file.  Most efficient, as no unpacking
is done, but then you have to deal with all the version checking and
offsets yourself.

=cut

sub next_record {
  my $Self = shift;
  $Self->next_record_raw();
  return $Self->record(@_);
}

sub next_record_hash {
  my $Self = shift;
  $Self->next_record_raw();
  return $Self->record_hash(@_);
}

sub next_record_array {
  my $Self = shift;
  $Self->next_record_raw();
  return $Self->record_array(@_);
}

sub next_record_raw {
  my $Self = shift;

  delete $Self->{record};
  delete $Self->{checksum_failure};

  # use direct access for speed
  my $Exists = $Self->{header}{Exists};
  my $RecordSize = $Self->{header}{RecordSize};

  return undef unless $RecordSize;

  if ($Self->{msgno} < $Exists) {
    my $res = sysread($Self->{handle}, $Self->{rawrecord}, $RecordSize);
    die "Failed to read entire record" unless $RecordSize == $res;
    # rewrite if passed so save the allocation cost
    $Self->{msgno}++;
    return $Self->{rawrecord};
  }
  else {
    delete $Self->{rawrecord};
    return undef; # no more records!
  }
}

=item $index->record()
=item $index->record_hash()

Returns the "current" record, i.e. the last record returned by
next_record_*() as a hash reference.

Returns undef if there is no current record (either next_record has never
been called, reset has just been called, or the file is finished)

=item $index->record_array()

As above, but return the version-dependant arrayref or undef

=item $index->record_raw()

As above, but return just the raw record bytes as a string or undef

=item $index->record($field)

If a field name is given, return that field only from the record, or die if
it doesn't exist in this version.

Returns undef if there is no current record.  No legitimate field ever
returns undef, because there's no such concept in the datastructure.

=cut

sub record {
  my $Self = shift;
  my $Field = shift;

  my $record = $Self->record_hash();
  return undef unless $record;

  if ($Field) {
    die "No such record field $Field\n" unless exists $record->{$Field};
    return $record->{$Field};
  }

  return $record;
}

sub record_hash {
  my $Self = shift;
  unless (exists $Self->{record}{hash}) {
    $Self->{record}{hash} = $Self->_record_a2h($Self->record_array(@_));
  }
  return $Self->{record}{hash};
}

sub record_array {
  my $Self = shift;
  unless (exists $Self->{record}{array}) {
    $Self->{record}{array} = $Self->_record_b2a($Self->{rawrecord});
  }
  return $Self->{record}{array};
}

sub record_raw {
   my $Self = shift;
   return $Self->{rawrecord};
}

=item $index->field_number($Field)

Return the field number in a record array for the named field, or die 
if there isn't one.

=cut

sub field_number {
  my $Self = shift;
  my $Field = shift;
  my $names = $Self->{format}{RecordNames};
  die "No such record field $Field\n" unless exists $names->{$Field};
  return $names->{$Field};
}

=item $index->write_header($fh, $header)

Writes a header to $fh - you need to make sure it's seeked to the start (can be used on a non-seekable filehandle)

$header can be in array, hash or buffer format

=cut

sub write_header {
  my $Self = shift;
  my $fh = shift;
  my $header = shift;

  my $buf = $Self->_make_header($header);

  syswrite($fh, $buf);
}

=item $index->append_record($record)

Appends the record (can be hash, array or buf) to the current file.  Needs the filehandle to be seekable.  Uses "Exists" from the header to find the position, so don't mess it up!

Also seeks back to the header and rewrites it with exists incremented by one.

=cut

sub append_record {
  my $Self = shift;
  my $record = shift;

  my $Exists = $Self->header('Exists');

  $Self->reset($Exists);
  $Self->write_record($Self->{handle}, $record);

  # extend the header:
  # XXX - sysflags
  my $header = $Self->header();
  $header->{Exists}++;
  $Self->rewrite_header($header);
}

sub rewrite_header {
  my $Self = shift;
  my $header = shift || $Self->header();

  sysseek($Self->{handle}, 0, 0);
  $Self->write_header($Self->{handle}, $header);

  $Self->reset(); # remove any cache and update the seek pointer
}

=item $index->rewrite_record($record, $num)

Rewrite the record at position given by $num with the record (hash, array or buf) passed.

=cut

sub rewrite_record {
  my $Self = shift;
  my $record = shift;
  my $num = @_ ? shift : ($Self->{msgno} - 1);

  $Self->reset($num);

  $Self->write_record($Self->{handle}, $record);

  $Self->{msgno}++;
}

=item $index->write_record($fh, $record, $num)

Write the record to the new filehandle $fh.  If $num is not given then it doesn't need to be seekable.

XXX - $num support not done yet
=cut

sub write_record {
  my $Self = shift;
  my $fh = shift;
  my $record = shift;
  my $num = shift; # XXX - seek?

  my $buf = $Self->_make_record($record);

  syswrite($fh, $buf);
}

=item $index->merge_indexes($target, @extras)

XXX - broken anyway.  The purpose of this function is to allow multiple index files to combined into one (say, an expunged file and an index file)

=cut

sub merge_indexes {
  my $Self = shift;
  my $target = shift;
  my @extras = shift;

  # copy the current header first
  my $targetpos = tell($target);
  my $header = $Self->header();
  # reset some stuff
  $header->{Exists} = 0;
  $header->{LastAppenddate} = 0;
  $header->{LastUid} = 0;
  $header->{QuotaUsed} = 0;
  $header->{Deleted} = 0;
  $header->{Answered} = 0;
  $header->{Flagged} = 0;
  $header->{HighestModseq} = 0;
  $Self->write_header($target, $header);

  my @all = ($Self, @extras);

  my @records = map { $_->next_record() } @all;

  my $nextuid = -1;

  while ($nextuid) {
    my $this;
    my $higheruid;

    # read the first record of all lists
    foreach my $n (0..$#all) {
      next unless $records[$n];
      if ($records[$n]{Uid} == $nextuid) {
        # algorithm: keep most recently modified
        if (not $this or $this->{LastModified} < $records[$n]{LastModified}) {
          $this = $records[$n]{LastModified};
        }
        # step forwards
        $records[$n] = $all[$n]->next_record();
      }
      # find the minimum now
      if (not $higheruid or $higheruid > $records[$n]{Uid}) {
        $higheruid = $records[$n]{Uid};
      }
    }

    # write out the best record found
    if ($this) {
      $Self->write_record($target, $this);
      $header->{Exists}++;
      # XXX - to make everything else work, we probably need to reconstruct or
      # put the entire logic here!
    }

    # move along
    $nextuid = $higheruid;
  }

  # move back to the start of this file and re-write the header
  seek($target, $targetpos, 0);
  $Self->write_header($target, $header);
}

=item $index->header_dump()

=item $index->record_dump()

=item $index->header_longdump()

=item $index->record_longdump()

=item $index->header_undump()

=item $index->record_undump()

Dump the headers and records in either space separated fields or named lines with a blank line between for long.  

The "undump" option is able to parse the space separated format, allowing pipe to a standard unix tool to 
process the records, and then re-parse them back into a binary index file.

=cut

sub header_dump {
  my $Self = shift;
  my $array = $Self->header_array();
  return join(' ', @$array);
}

sub header_longdump {
  my $Self = shift;
  my $array = $Self->header_array();
  my @data;
  my $frm = $Self->{format}{HeaderFields};
  foreach my $field (0..$#$frm) {
    my $name = $frm->[$field][0];
    my $val = $array->[$field];
    $val = sprintf("%08x", $val) if $name =~ m/Crc$/;
    push @data, "$name: $val";
  }
  return join("\n", @data, '');
}

sub header_undump {
  my $Self = shift;
  my $string = shift;
  my @items = split ' ', $string;
  return \@items;
}

sub record_dump {
  my $Self = shift;
  my $array = $Self->record_array();
  return join(' ', @$array);
}

sub record_longdump {
  my $Self = shift;
  my $array = $Self->record_array();
  my @data;
  my $frm = $Self->{format}{RecordFields};
  foreach my $field (0..$#$frm) {
    my $name = $frm->[$field][0];
    my $val = $array->[$field];
    $val = sprintf("%08x", $val) if $name =~ m/Crc$/;
    push @data, "$name: $val";
  }
  return join("\n", @data, '');
}

sub record_undump {
  my $Self = shift;
  my $string = shift;
  my @items = split ' ', $string;
  return \@items;
}

# INTERNAL METHODS

sub _make_header {
  my $Self = shift;
  my $ds = shift;

  my $ref = ref($ds);

  # check what sort of format it is:

  # scalar - already a buffer
  return $ds unless $ref;

  # array
  return $Self->_header_a2b($ds) if ref($ds) eq 'ARRAY';

  # must be hash
  return $Self->_header_h2b($ds);
}

sub _make_record {
  my $Self = shift;
  my $ds = shift;

  my $ref = ref($ds);

  # check what sort of format it is:

  # scalar - already a buffer
  return $ds unless $ref;

  # array
  return $Self->_record_a2b($ds) if ref($ds) eq 'ARRAY';

  # must be hash
  return $Self->_record_h2b($ds);
}

####################
# Header Conversions

sub _header_b2h {
  my $Self = shift;
  my $buf = shift;
  return undef unless $buf;

  my $array = $Self->_header_b2a($buf);
  my $hash = $Self->_header_a2h($array);

  return $hash;
}

sub _header_b2a {
  my $Self = shift;
  my $buf = shift;
  return undef unless $buf;

  my @array = unpack($Self->{format}{HeaderPack}, $buf);

  # check checksum match!
  if ($Self->{version} >= 11) {
    my $Header = $Self->{format}{HeaderFields}[$Self->{format}{HeaderNames}{HeaderCrc}];
    my $crc = crc32(substr($buf, 0, $Header->[4]));
    if ($array[$Header->[3]] != $crc) {
      $Self->{checksum_failure} = 1;
      warn "Header CRC Failure $array[$Header->[3]] != $crc";
      die "Header CRC Failure $array[$Header->[3]] != $crc"
        if $Self->{strict_crc};
    }
  }

  return \@array;
}

sub _header_h2b {
  my $Self = shift;
  my $hash = shift;
  return undef unless $hash;

  my $array = $Self->_header_h2a($hash);
  my $buf = $Self->_header_a2b($array);

  return $buf;
}

sub _header_a2b {
  my $Self = shift;
  my $array = shift;
  return undef unless $array;

  my $buf = pack($Self->{format}{HeaderPack}, @$array);

  if ($Self->{version} >= 11) {
    my $Header = $Self->{format}{HeaderFields}[$Self->{format}{HeaderNames}{HeaderCrc}];
    my $crc = crc32(substr($buf, 0, $Header->[4]));
    substr($buf, $Header->[4]) = pack('N', $crc);
  }

  return $buf;
}

sub _header_a2h {
  my $Self = shift;
  my $array = shift;
  return undef unless $array;

  my %res;
  my $frm = $Self->{format}{HeaderFields};
  for (0..$#$frm) {
    $res{$frm->[$_][0]} = $array->[$_];
  }

  return \%res;
}

sub _header_h2a {
  my $Self = shift;
  my $hash = shift;
  return undef unless $hash;

  my @array;
  my $frm = $Self->{format}{HeaderFields};
  for (0..$#$frm) {
    $array[$_] = $hash->{$frm->[$_][0]};
  }

  return \@array;
}

####################
# Record conversions

sub _record_h2b {
  my $Self = shift;
  my $hash = shift;
  return undef unless $hash;

  my $array = $Self->_record_h2a($hash);
  my $buf = $Self->_record_a2b($array);

  return $buf;
}

sub _record_a2b {
  my $Self = shift;
  my $array = shift;
  return undef unless $array;

  my $buf = pack($Self->{format}{RecordPack}, @$array);

  if ($Self->{version} >= 11) {
    my $Record = $Self->{format}{RecordFields}[$Self->{format}{RecordNames}{RecordCrc}];
    my $crc = crc32(substr($buf, 0, $Record->[4]));
    substr($buf, $Record->[4]) = pack('N', $crc);
  }

  return $buf;
}

sub _record_b2h {
  my $Self = shift;
  my $buf = shift;
  return undef unless $buf;

  my $array = $Self->_record_b2a($buf);
  my $hash = $Self->_record_a2h($array);

  return $hash;
}

sub _record_b2a {
  my $Self = shift;
  my $buf = shift;
  return undef unless $buf;

  my @array = unpack($Self->{format}{RecordPack}, $buf);

  # check checksum match!
  if ($Self->{version} >= 11) {
    my $Record = $Self->{format}{RecordFields}[$Self->{format}{RecordNames}{RecordCrc}];
    my $crc = crc32(substr($buf, 0, $Record->[4]));
    if ($array[$Record->[3]] != $crc) {
      $Self->{checksum_failure} = 1;
      warn "Record CRC Failure ($Self->{msgno}) $array[$Record->[3]] != $crc";
      die "Record CRC Failure ($Self->{msgno}) $array[$Record->[3]] != $crc"
        if $Self->{strict_crc};
    }
  }

  return \@array;
}

sub _record_a2h {
  my $Self = shift;
  my $array = shift;
  return undef unless $array;

  my %res;
  my $frm = $Self->{format}{RecordFields};
  for (0..$#$frm) {
    $res{$frm->[$_][0]} = $array->[$_];
  }

  return \%res;
}

sub _record_h2a {
  my $Self = shift;
  my $hash = shift;
  return undef unless $hash;

  my @array;
  my $frm = $Self->{format}{RecordFields};
  for (0..$#$frm) {
    $array[$_] = $hash->{$frm->[$_][0]};
  }

  return \@array;
}

=item AUTHOR AND COPYRIGHT

Bron Gondwana <brong@xxxxxxxxxxx> - Copyright 2008 FastMail.FM

Licenced under the same terms as Cyrus IMAPd.

=cut

1;
#!/usr/bin/perl -w

use strict;
use warnings;
use Cyrus::IndexFile;

# Bucket definitions {{{
my @exists_buckets = (
  [100000, "> 100,000"],
  [ 30000, "20,000 - 100,000"],
  [ 10000, "10,000 - 30,000"],
  [  3000, "3,000 - 10,000"],
  [  1000, "1,000 - 3,000"],
  [   300, "300 - 1,000"],
  [   100, "100 - 300"],
  [    30, "30 - 100"],
  [    10, "10 - 30"],
  # rest can be explicit
);

my @size_buckets = (
  [10000000, "> 10 MB"],
  [ 3000000, "3 - 10 MB"],
  [ 1000000, "1 - 3 MB"],
  [  300000, "300 KB - 1 MB"],
  [  100000, "100 - 300 KB"],
  [   30000, "30 - 100 KB"],
  [   10000, "10 - 30 KB"],
  [    3000, "3 - 10 KB"],
  [    1000, "1 - 3 KB"],
  [     300, "300 B - 1 KB"],
  [     100, "100 - 300 B"],
  # rest can be explicit
);

my @folders_buckets = (
  [1000, "> 1000"],
  [ 300, "300 - 1000"],
  [ 100, "100 - 300"],
  [  30, "30 - 100"],
  [  10, "10 - 30"],
  # rest can be explicit
);

# }}}

my $conf = shift || "/etc/imapd.conf";

my %paths;
open(FH, "<$conf") || die "Error, can't find file $conf";
while (<FH>) {
  if (m/^(meta)?partition-([^:]+):\s*(.*)/) {
    my $type = $1 || 'data';
    my $name = $2;
    my $path = $3;

    $paths{$name}{$type} = $path;
  }
}

my $stats = {};
foreach my $name (keys %paths) {
  my $metapath = $paths{$name}{meta} || $paths{$name}{data};
  find_stats($stats, $metapath);
}

massage($stats);

print_stats($stats);

sub find_stats {
  my $stats = shift;
  my $dir = shift;
  my $basepath = shift || '';

  if (-e "$dir/cyrus.index") {
    my $user = find_user($basepath);
    folder_stats($stats, $user, "$dir/cyrus.index");
  }

  my @subdirs;
  if (opendir(DH, $dir)) {
    while (my $item = readdir(DH)) {
      next if $item =~ m/\.$/; # skip data files AND specials
      push @subdirs, $item if -d "$dir/$item";
    }
  }

  foreach my $subdir (@subdirs) {
    my $basepath = $basepath ? "$basepath/$subdir" : $subdir;
    find_stats($stats, "$dir/$subdir", $basepath);
  }
}

sub find_user {
  my $basepath = shift;

  if ($basepath =~ m{^domain/(?:./)?([^/]+)/(?:./)user/([^/]+)}) {
    return "$2\@$1";
  }

  if ($basepath =~ m{^(?:./)?user/([^/]+)}) {
    return $1;
  }

  return 'NO_USER'; # count everything else together, including DELETED...
}

sub folder_stats {
  my $stats = shift;
  my $user = shift;
  my $indexfile = shift;

  my $Index = Cyrus::IndexFile->new_file($indexfile);
  my $exists = $Index->header('Exists');

  # folder counts - need to keep a per-user total for per-user stats
  $stats->{users}{$user}++;
  $stats->{counts}{exists_bucket($exists)}++;

  my $gaps = 0;
  my $last = 0;
  while (my $record = $Index->next_record_hash()) {
    $stats->{sizes}{size_bucket($record->{'Size'})}++;

    # track gaps
    if ($record->{Uid} > $last + 1) {
      $gaps++;
    }
    $last = $record->{Uid};
  }
  $gaps++ if ($last < $Index->header('LastUid'));

  $stats->{gappy}{gaps_bucket($gaps, $exists)}++;
}

sub exists_bucket {
  my $exists = shift;
  foreach my $item (@exists_buckets) {
    return $item->[1] if $exists >= $item->[0];
  }
  return $exists;
}

sub size_bucket {
  my $size = shift;
  foreach my $item (@size_buckets) {
    return $item->[1] if $size >= $item->[0];
  }
  return $size;
}

sub folders_bucket {
  my $folders = shift;
  foreach my $item (@folders_buckets) {
    return $item->[1] if $folders >= $item->[0];
  }
  return $folders;
}

sub gaps_bucket {
  my $gaps = shift;
  my $exists = shift;

  return 'NONE' if $gaps == 0;
  return 'ALL' unless $exists;
  my $ratio = $gaps / $exists;
  return 'LOW' if $ratio < 0.1;
  return 'MEDIUM' if $ratio < 0.3;
  return 'HIGH';
}

sub massage {
  my $stats = shift;
  my $users = delete $stats->{users};
  # split up the users!
  foreach my $user (keys %$users) {
    $stats->{folders}{folders_bucket($users->{$user})}++;
  }
}

sub print_stats {
  my $stats = shift;

  print "STATS for $conf\n";
  print "Partitions: " . scalar(keys %paths) . "\n";
  print "Users: " . sum_a(values %{$stats->{folders}}) . "\n";
  print "Folders: " . sum_a(values %{$stats->{counts}}) . "\n";
  print "\n";
  print "Folders per user:\n";
  foreach my $item (sort keys %{$stats->{folders}}) {
    printf(" %-20s %s\n", $item, $stats->{folders}{$item});
  }
  print "\n";
  print "Messages per folder:\n";
  foreach my $item (sort keys %{$stats->{counts}}) {
    printf(" %-20s %s\n", $item, $stats->{counts}{$item});
  }
  print "\n";
  print "Message sizes:\n";
  foreach my $item (sort keys %{$stats->{sizes}}) {
    printf(" %-20s %s\n", $item, $stats->{sizes}{$item});
  }
  print "\n";
  print "Ratio of gaps between UIDs: (expunge tracking)\n";
  foreach my $item (sort keys %{$stats->{gappy}}) {
    printf(" %-20s %s\n", $item, $stats->{gappy}{$item});
  }
  print "\n";
}

sub sum_a {
  my $sum = 0;
  foreach my $item (@_) {
    $sum += $item;
  }
  return $sum;
}
----
Cyrus Home Page: http://cyrusimap.web.cmu.edu/
Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html