Re: Reducing I/O when increasing number of PGs

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 22 Jan 2014 16:52:53 -0800

On Wed, Jan 22, 2014 at 3:50 PM, bf <bf31415@xxxxxxxxx> wrote:
>
>
> Gregory Farnum <greg@...> writes:
>
>>
>> On Wed, Jan 22, 2014 at 9:13 AM, Caius Howcroft
>> > I want to double the number of pgs available for a pool, however I
>> > want to reduce as much as possible the resulting I/O storm (I have
>> > quite a bit of data in these pools).
>> >
>> > What is the best way of doing this? Is it using php_nums? for example:
>> >
>> > increase pg_num form X to 2X
>> > while pgp_num < pg_num:
>> >   increase pgp_num by 10%
>> >   wait for health_ok
>> >
>> > Or is there a better way, like setting the number of simulataenous
>> > operations?
>>
>> Doing it this way certainly lets you control the pain a little more.
>
> I'm new to ceph so maybe this is obvious but as the data is being shuffled to
> account for the larger number of PGS, is it still possible to reliably access any
> object in the pool?

Yes.

> I expect the answer is yes, but not sure of the logic as to
> how this works.    Does a client see the pre-shuffle set of maps while the
> shuffling is in progress and a new set of maps is only provided by the MON
> processes when the shuffling is complete?   This would then seem to indicate
> objects might be in their old and new PGs during the transition and once the
> shuffle is done, objects in 'old' PG get background removed?

In general, CRUSH is expected to do small transitions which keep the
set similar even when part of it is remapped (so there is almost
always an overlap of OSDs which store the data in both sets), and if
an OSD which is serving as primary gets a read for data it doesn't yet
have, it will immediately recover the object in question from the OSDs
which previously were responsible for storing it. (OSDs do not remove
data which does not map to them until the data's primary instructs
them to.)
As an important optimization, the osd map can also specify an an
override "temp_pg" mapping which says "map this PG to this primary and
these OSDS no matter what CRUSH says", and if an OSD becomes primary
for something it doesn't have data on it will invoke that override
until it does have the data.

> Again, maybe an obvious question but as I understand it, the set of OSD that
> are in a PG must adhere to the rules diversity/redundancy rules defined at the
> pool level-- is that right?

Yeah.

>
> Lets say for some pool I want to define 123 PGs.  Does ceph then do all the
> heavy lifting regarding identifying the corresponding OSD per PG that satisfy
> the pool's diversiry/redundancy requirement -- up to identifying 123 different
> PGs?   Is it possible that multiple PG may have the exact same set of OSD--
> due to the pool's diversity/redundancy requirements and limited
> servers/racks/rows, etc?

Yes, Ceph does all the heavy lifting. Multiple PGs with the same OSDs
can happen (eg, if you only have two OSDs, all PGs will be on both),
but it behaves about as well as is possible within the configuration
you give it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com