Re: mon not marking dead osds down and slow streaming write performance

Michael Chapman <michael.chapman@xxxxxxxxxx> · Sat, 22 Dec 2012 18:34:04 +1100

On Fri, Dec 21, 2012 at 11:37 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Fri, 21 Dec 2012, Michael Chapman wrote:
>> I'll remove them properly. Thanks for your help. Do you have any
>> suggestions on the second (mon IO) issue I'm seeing?
>
> Whoops, missed it:
>
>> >> A second issue I have been having is that my reads+writes are very
>> >> bursty, going from 8MB/s to 200MB/s when doing a dd from a physical
>> >> client over 10GbE. It seems to be waiting on the mon most of the time,
>> >> and iostat shows long io wait times for the disk the mon is using. I
>> >> can also see it writing ~40MB/s constantly to disk in iotop, though I
>> >> don't know if this is random or sequential. I see a lot of waiting for
>> >> sub ops which I thought might be a result of the io wait.
>> >>
>> >> Is that a normal amount of activity for a mon process? Should I be
>> >> running the mon processes off more than just a single sata disk to
>> >> keep up with ~30 OSD processes?
>
> Is the ceph-mon daemon running on its own disk (or /), separate from the
> osds?  My first guess is that this is could be a sync(2) issue.

It's on /
Is that going to be bad? My first thought was that maybe too much
logging was bringing it down but there's very little iops in the log
directory during use.

> ceph-mon daemon is running on teh same host as some of the osds?  If you
> have an older kernel (pre-3.0), or you are running argonaut and have an
> old glibc, there is no syncfs(2) syscall, and multiple osds can drag
> each other down by doing lots of commits.  That often leads to the
> bursty writes.

We're running 12.04 so we should have a new enough kernel (3.2), and ceph 0.55.1

I originally had the osds and mons on the same hosts but removed all
the OSDs on one mon host and 2 of the three mons in the cluster to try
to isolate why it's running so slow. It's currently one blade with a
single raid 1 pair and 12 cores hosting a single mon, and a few blades
with attached disks running OSD processes.

>
> Having the osd journals stored as files also contributes.  Using a
> separate partition (or, idealy, part of an SSD or caching raid array)
> helps quite a bit.
>
> In general, 30 osds is very small, and should cause no significant monitor
> load.  But the monitor is doing lots of fsync(), which can interfere with
> other workloads if its on the same disk.

That was what I thought. I might try rebuilding from scratch and see
if it still underperforms. What is confusing me is that the number of
iops we see in the osdmap directory isn't actually that high, and yet
iostat shows ~100-200ms iowait when we start to utilise the cluster.
I'd put it down to faulty hardware but it was the same across all
three mons. Should a single sata disk be able to handle the load of
one mon or should I be looking at SAS/SSD options?

>
> sage

-- 
Michael Chapman
Cloud Computing Services
ANU Supercomputer Facility
Room 318, Leonard Huxley Building (#56), Mills Road
The Australian National University
Canberra ACT 0200 Australia
Tel: +61 2 6125 7106
Web: http://nci.org.au
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html