Re: long blocking with writes on rbds

Christian Balzer <chibi@xxxxxxx> · Fri, 10 Apr 2015 14:56:47 +0900

On Thu, 09 Apr 2015 10:11:37 -0400 Jeff Epstein wrote:

> 
> 
> On 04/09/2015 03:14 AM, Christian Balzer wrote:
> 
> > Your 6 OSDs are on a single VM from what I gather?
> > Aside from being a very small number for something that you seem to be
> > using in some sort of production environment (Ceph gets faster the more
> > OSDs you add), where is the redundancy, HA in that?
> 
> We are running one OSD per VM. All data is replicated across three VMs.
> 
That doesn't add up to 6 OSDs, as per your "ceph -s" output.

AWS c3.large supposedly comes with 2 locally attached SSD storages.

> > The number of your PGs and PGPs need to have at least a semblance of
> > being correctly sized, as others mentioned before.
> > You want to re-read the Ceph docs about that and check out the PG
> > calculator:
> > http://ceph.com/pgcalc/
> 
> My choice of pgs is based on this page. Since each pool is spread across 
> 3 OSDs, 100 seemed like a good number. Am I misinterpreting this 
> documentation?
> http://ceph.com/docs/master/rados/operations/placement-groups/
> 
Vastly, as all those numbers are TOTALs, or in different words assuming
just one pool. 

And unless you really, REALLY require different pools, you'll much happier
with just one, or as little as possible.

The calculator and the suggestions on the documentation page suggest 512
PGs/PGPs for 6 OSDs with a replication of 3 and a target per OSD of 200
(double the "default", but a good idea for small clusters).

That's with one pool, with 2 evenly sized pools, it would be 256 per
pool and so forth.

> > Since RBDs are sparsely allocated, the actual data used is the key
> > factor. But you're adding the pool removal overhead to this.
> How much overhead does pool removal add?
A good question for a Ceph developer, not me.
I wouldn't be surprised if it doubled things.

> > Both and the fact that you have overloaded the PGs by nearly a factor
> > of 10 (or 20 if you're actually using a replica of 3 and not 1)doesn't
> > help one bit.
> > And lets clarify what objects are in the Ceph/RBD context, they're the
> > (by default) 4MB blobs that make up a RBD image.
> 
> I'm curious how you reached your estimation of overloading. According to 
> the pg calculator you linked to, given that each pool occupies only 3 
> OSDs, the suggested number of pgs is around 100. Can you explain?

Unless you have changed the crush map, a pool will the spread out amongst
all the PGs and all the OSDs. 

I can only urge you to re-read the documentation and notes on the pgcalc
page.

Christian

> >> - Somewhat off-topic, but for my own curiosity: Why is deleting data
> >> so slow, in terms of ceph's architecture? Shouldn't it just be a
> >> matter of flagging a region as available and allowing it to be
> >> overwritten, as would a traditional file system?
> >>
> > Apples and oranges, as RBD is block storage, not a FS.
> > That said, a traditional FS is local and updates an inode or equivalent
> > bit.
> > For Ceph to delete a RBD image, it has to go to all cluster nodes with
> > OSDs that have PGs that contain objects of that image. Then those
> > objects have to be deleted on the local filesystem of the OSD and
> > various maps updated cluster wide. Rince and repeat until all objects
> > have been dealt with.
> > Quite a bit more involved, but that's the price you have to pay when
> > you have a DISTRIBUTED storage architecture that doesn't rely on a
> > single item (like an inode) to reflect things for the whole system.
> Thank you for explaining.
> 
> Jeff
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com