Re: persistent background write activity

Gregory Farnum <gregf@xxxxxxxxxxxxxxx> · Mon, 6 Dec 2010 10:20:49 -0800

On Sun, Dec 5, 2010 at 8:01 PM, Ravi Pinjala <pstatic@xxxxxxxxx> wrote:
> Hi Greg,
>
> I've tried to isolate the problem, and gotten a log from just writing
> to a single file. The MDS was the only thing that produced any
> significant logs, even with --debug_{mon,mds,osd} = 20.
>
> http://p-static.net/mds.beta.log
>
> This is from waiting until all other activity on the cluster quiesces
> (by watching gkrellm and ceph -w), then adding logging using "ceph mds
> injectargs", then creating and deleting a few files (echo "data" >
> foo; rm foo - that sort of thing), then disabling logging. gkrellm
> reported several bursts of I/O totaling around 1MB (though this is a
> complete guess, since gkrellm doesn't count disk bandwidth over time,
> only immediate data), which seems high. (Again, this is with no other
> activity on the fs at all).
Okay, I don't see anything unusual from glancing through the logs. I
think what you're seeing is just the effect of Ceph's heavy journaling
and replication:
The metadata server has its own journal on the OSD cluster which
allows it to make metadata changes safe in a streaming write rather
than by modifying random data all over the cluster (it's lots faster
in general to do this), but it eventually needs to flush that back
into the filesystem proper. In general it'll do this continuously in
the background as the journal grows, but if there are no metadata ops
going on then it'll eventually just put them on-disk in their proper
locations. So that's two writes for every metadata op.
Besides the metadata journaling, the OSDs generally run a local
journal themselves, for similar reasons. That's four (2+2) writes.
Besides the journaling, by default everything is replicated twice. So
that's 8 ((2+2)*2) writes.

This is a fair bit of write amplification, but since they're small and
serve to dramatically reduce latency it's worth it! I'm pretty sure
this is what you're seeing as you watch your cluster.

> I have another issue, possibly related, with metadata operations being
> really laggy. While rsyncing into my cluster, large files saturate the
> (gigE) connection, but small files go really slowly (tens of kB/s).
> This is consistent with metadata ops having a high overhead.
Hmm. Metadata ops in Ceph (and distributed filesystems in general) are
more expensive than on a local FS, but I'd be surprised if it was
hitting an rsync. Have you checked that the performance is
significantly different than what you get running an rsync to normal
local FSes? I generally see startup time for the rsync (checking
metadata, computing hashes, etc) dominates transfer time on small
files so that they report <30KB/s transfers even though the actual
exchange of data is fast.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html