Re: long blocking with writes on rbds

Jeff Epstein <jeff.epstein@xxxxxxxxxxxxxxxx> · Thu, 09 Apr 2015 10:11:37 -0400

On 04/09/2015 03:14 AM, Christian Balzer wrote:

Your 6 OSDs are on a single VM from what I gather?
Aside from being a very small number for something that you seem to be
using in some sort of production environment (Ceph gets faster the more
OSDs you add), where is the redundancy, HA in that?

We are running one OSD per VM. All data is replicated across three VMs.

The number of your PGs and PGPs need to have at least a semblance of being
correctly sized, as others mentioned before.
You want to re-read the Ceph docs about that and check out the PG
calculator:
http://ceph.com/pgcalc/

My choice of pgs is based on this page. Since each pool is spread across 
3 OSDs, 100 seemed like a good number. Am I misinterpreting this 
documentation?
http://ceph.com/docs/master/rados/operations/placement-groups/

Since RBDs are sparsely allocated, the actual data used is the key factor.
But you're adding the pool removal overhead to this.
How much overhead does pool removal add?
Both and the fact that you have overloaded the PGs by nearly a factor of
10 (or 20 if you're actually using a replica of 3 and not 1)doesn't help
one bit.
And lets clarify what objects are in the Ceph/RBD context, they're the (by
default) 4MB blobs that make up a RBD image.

I'm curious how you reached your estimation of overloading. According to 
the pg calculator you linked to, given that each pool occupies only 3 
OSDs, the suggested number of pgs is around 100. Can you explain?
- Somewhat off-topic, but for my own curiosity: Why is deleting data so
slow, in terms of ceph's architecture? Shouldn't it just be a matter of
flagging a region as available and allowing it to be overwritten, as
would a traditional file system?

Apples and oranges, as RBD is block storage, not a FS.
That said, a traditional FS is local and updates an inode or equivalent
bit.
For Ceph to delete a RBD image, it has to go to all cluster nodes with
OSDs that have PGs that contain objects of that image. Then those objects
have to be deleted on the local filesystem of the OSD and various maps
updated cluster wide. Rince and repeat until all objects have been dealt
with.
Quite a bit more involved, but that's the price you have to pay when you
have a DISTRIBUTED storage architecture that doesn't rely on a single item
(like an inode) to reflect things for the whole system.
Thank you for explaining.

Jeff
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com