Re: long blocking with writes on rbds

Jeff Epstein <jeff.epstein@xxxxxxxxxxxxxxxx> · Fri, 10 Apr 2015 09:41:46 -0400

        We are running one OSD per VM. All data is replicated across three VMs.
> 

      That doesn't add up to 6 OSDs, as per your "ceph -s" output.

    Yes it does. 6 VMs, 6 OSDs. Each pool is allocated to one of two
    3-node sub-clusters.

    AWS c3.large supposedly comes with 2 locally attached
      SSD storages.

    It comes with as many storage device as you want to connect. We use
    one to boot and one for OSD.

And unless you really, REALLY require different pools, you'll much happier
with just one, or as little as possible.

    Why? In your previous message, you told me that the performance
    hinged primarily on the amount of data, not how many pools or rbd it
    was divided up into. Even a 2x decrease in performance would not
    adequately explain the terrible performance we've seen.

      The calculator and the suggestions on the documentation page suggest 512
PGs/PGPs for 6 OSDs with a replication of 3 and a target per OSD of 200
(double the "default", but a good idea for small clusters).

That's with one pool, with 2 evenly sized pools, it would be 256 per
pool and so forth.

    This is very close to what we've done, see below. Performance
    problems persist.

      Unless you have changed the crush map, a pool will the spread out amongst
all the PGs and all the OSDs. 

    We have changed the crush map. Pools spread about among three of our
    six OSDs. I included our crush map in my first post.

With 23 pools (again, reduce this to 1 or 2 if possible) they should have
about 23 PG/PGPs per pool (if evenly sized) not 4 as your ratio up there
suggests.

    I don't think so. We allocate pools gradually throughout the
    lifecycle of our application. With 250GB of storage per replica, a
    5GB pool should be allocated a proportional number of pgs. In other
    words, we expect to be able to eventually have 50 pools, each sized
    5Gb. Given that we have 3 OSDs per pool, and three replicas, the
    calculator says we should have 125 pgs total, distributed evenly
    over 50 pools. We rounded up to 4 pgs per pool. What's wrong with
    that? 

    Regardless, does anyone believe that this could cause the bizarre
    performance issues we've seen?

      Your performance issues are most likely related to your platform, as in
actual OSD (SSD?) speed, network speed, things unique to AWS.

    This seems highly unlikely. We get very good performance without
    ceph. Requisitioning and manupulating block devices through LVM
    happens instantaneously. We expect that ceph will be a bit slower by
    its distributed nature, but we've seen operations block for up to an
    hour, which is clearly behind the pale. Furthermore, as the
    performance measure I posted show, read/write speed is not the
    bottleneck: ceph is simply waiting.

    So, does anyone else have any ideas why mkfs (and other operations)
    takes so long?

    Jeff

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com