Re: Logging

Zenon Panoussis <oracle@xxxxxxxxxxxxxxx> · Tue, 19 Apr 2011 02:10:44 +0200

On 04/19/2011 12:25 AM, Colin McCabe wrote:

> I'm curious how much you saved by disabling logging. 

To be honest, I have no idea.

> Also, how are you measuring performance?

My measuring is very primitive and un-scientific (which is why I couldn't
answer your previous question). I have a two-node cluster with rule data
{ min_size 2 } and rule metadata { min_size 2 }. My ceph.conf says
osd data = /mnt/osd and the latter is an ext4 partition of its own on
each node. One monitor is running on one of the nodes and one client is
running on my workstation.

So what I do is mount the monitor, run 'watch "df -m; echo; uptime"' on both
nodes and start copying files to ceph with rsync. I have a news spool of just
over a million small (1-3 KB) files which is perfect for the task. The rsync
command is

 rsync -vva --progress --bwlimit=N newsspool 127.0.0.1:/mnt/ceph/

where N is a number in KBps ranging from max 500 all the way down to 5. This
gives me something like

===
Every 2.0s: df -m; echo; uptime                        Tue Apr 19 01:19:12 2011

Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                15019      3768     10489  27% /
tmpfs                      993         0       993   0% /dev/shm
/dev/sda2                 7510      1342      5787  19% /var
/dev/sda5                15019      1777     12480  13% /var/log
/dev/mapper/sda6        232003      5571    213174   3% /mnt/osd

 01:19:12 up 1 day,  2:25,  1 user,  load average: 0.91, 1.12, 1.75
===

on each node and a fair idea of what the client is doing at the same time.
The reason for using 127.0.0.1 as the target is that delta transfers and
bwlimit do not apply to local tagets, so

rsync -vva --progress --bwlimit=N newsspool /mnt/ceph/

which in principle is equivalent, would just run full speed and re-copy
everything that's not already on the target.

The load is the prime measure. When I see load 16 on a dual-CPU system,
I know that things are getting out of hand before anything even breaks.
Then I lower bwlimit and start again.

The relation between OSD partitions (/dev/mapper/sda6 in the example above)
is another interesting factor. As long as the load is under 100%, the
partitions on both nodes grow in almost perfect sync. When the load exceeds
100%, one node starts lagging behind the other. If that continues long enough,
the lagging node falls out completely while the other node keeps growing.
I've seen differences of almost 300% (11 GB on node02 and 4 GB on node01)
which won't go away until a full ceph restart followed by a few hours of
replaying patience.

Interestingly, the node with the monitor that actually receives the data
(node01) is the one unable to write them to its own disk, while the other
node (node02) grows way ahead. I'm not sure yet, but the likely cause of
this is that the network speed of node01 and the disk speed of node02 both
exceed the disk speed of node01. In any case, a load above 100% is not a
reason for concern all and by itself, but the node storages growing out of
sync is. If that goes far enough, the nodes are unable to synchronise again
without an unmount and full restart of ceph and in the worst case scenario
the data is corrupted.

Now, having run rsync for a while, I kill it and start it again. This will
cause it to compare all files that are already on the target, skip those
that have identical size and near-identical mtime, and only copy files that
are either missing or different. Assuming for example that I successfully
copied 200.000 files in the first run, the second run will compare 200.000
files and start copying again from the 200.001:st file on.

This is where I could really see the difference between atime and noatime on
the OSD underlying partition. bwlimit is meaningless on size and mtime reads,
so the client would read file attributes at full blast. With atime on the
underlying partition this caused the nodes to jump to loads of 5 or more. With
noatime, they displayed a load of 0.1 or so and the whole compare opration
completed in a small fraction of the time previously required.

Z

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html