Unexpected disk write activity with btrfs OSDs

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Thu, 18 Jun 2015 23:28:35 +0200

Hi,

I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
amount of disk writes on each device, our granularity is 10s (every 10s
the monitoring system collects the total amount of sector written and
write io performed since boot and computes both the B/s and IO/s).

With only residual write activity on our storage network (~450kB/s total
for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
each OSD once replication, double writes due to journal and number of
OSD are factored in) :
- Disks with btrfs OSD have a spike of activity every 30s (2 intervals
of 10s with nearly 0 activity, one interval with a total amount of
writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
- Disks with xfs OSD (with journal on a separate partition but same
disk) don't have these spikes of activity and the averages are far lower
: 160kB/s and 5 IO/s. This is not far off what is expected from the
whole cluster write activity.

There's a setting of 30s on our platform :
filestore max sync interval

I changed it to 60s with
ceph tell osd.* injectargs "'--filestore-max-sync-interval 60'"
and the amount of writes was lowered to ~2.5MB/s.

I changed it to 5s (the default) with
ceph tell osd.* injectargs "'--filestore-max-sync-interval 5'"
the amount of writes to the device rose to an average of 10MB/s (and
given our sampling interval of 10s appeared constant).

During these tests the activity on disks hosting XFS OSDs didn't change
much.

So it seems filestore syncs generate far more activity on btrfs OSDs
compared to XFS OSDs (journal activity included for both).

Note that autodefrag is disabled on our btrfs OSDs. We use our own
scheduler which in the case of our OSD limits the amount of defragmented
data to ~10MB per minute in the worst case and usually (during low write
activity which was the case here) triggers a single file defragmentation
every 2 minutes (which amounts to a 4MB write as we only host RBDs with
the default order value). So defragmentation shouldn't be an issue here.

This doesn't seem to generate too much stress when filestore max sync
interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
amount of data according to apply latencies) but at 5s the btrfs OSDs
are far slower than our xfs OSDs with 10x the average apply latency (we
didn't let this continue more than 10 minutes as it began to make some
VMs wait for IOs too much).

Does anyone know if this is normal and why it is happening?

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com