Re: Strange issue with CRUSH

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 13 Jul 2015 11:28:07 -0500

Looking at the output, it looks like even pool 19 has a pretty small 
number of PGs for that many OSDs:

+----------------------------------------------------------------------------+
| Pool ID: 19 |
+----------------------------------------------------------------------------+
| Participating OSDs: 1056 |
| Participating PGs: 16404 |
+----------------------------------------------------------------------------+ 

And as you say, the distribution looks a little better than a totally 
random distribution:

| OSDs in All Roles (Up) |
| Expected PGs Per OSD: Min: 20, Max: 71, Mean: 46.6, Std Dev: 12.7 |
| Actual PGs Per OSD: Min: 24, Max: 69, Mean: 46.6, Std Dev: 6.9 |
| 5 Most Subscribed OSDs: 791(69), 977(69), 211(68), 536(67), 37(65) |
| 5 Least Subscribed OSDs: 1074(24), 1042(28), 215(29), 139(30), 205(30) |

But there's still a lot of variance between the most and least 
subscribed OSDs.  It's worse if you look at OSDs acting in a primary 
role (ie servicing reads):

| OSDs in Primary Role (Up) |
| Expected PGs Per OSD: Min: 0, Max: 29, Mean: 15.5, Std Dev: 7.4 |
| Actual PGs Per OSD: Min: 5, Max: 32, Mean: 15.5, Std Dev: 3.8 |
| 5 Most Subscribed OSDs: 606(32), 211(30), 1065(27), 956(26), 228(25) |
| 5 Least Subscribed OSDs: 317(5), 550(5), 215(6), 473(6), 19(7) |

It may be worth increasing the PG count for that pool at least!

Mark

On 07/13/2015 11:11 AM, Gleb Borisov wrote:
Hi,

Forget about exponential distribution. It was kind of raving of a madman
:) seems that it's really uniform.

I run tool mentioned above and saved output to gist:
https://gist.github.com/anonymous/d228fe9340825f33310b

We've one big pool for rgw (19) and several smaller pools (control pools
and few for testing) and also have two roots (default with 1056 osds and
ssd_default with 30 osds).

It seems that our distribution is slightly better than expected in your
code.

Thanks.

On Mon, Jul 13, 2015 at 7:11 PM, Gleb Borisov <borisov.gleb@xxxxxxxxx
<mailto:borisov.gleb@xxxxxxxxx>> wrote:
 >
 > Hi,
 >
 > Forget about exponential distribution. It was kind of raving of a
madman :) seems that it's really uniform.
 >
 >
 > I run tool mentioned above and saved output to gist:
https://gist.github.com/anonymous/d228fe9340825f33310b
 >
 > We've one big pool for rgw (19) and several smaller pools (control
pools and few for testing) and also have two roots (default with 1056
osds and ssd_default with 30 osds).
 >
 > It seems that our distribution is slightly better than expected in
your code.
 >
 > Thanks.
 >
 > On Mon, Jul 13, 2015 at 6:20 PM, Mark Nelson <mnelson@xxxxxxxxxx
<mailto:mnelson@xxxxxxxxxx>> wrote:
 >>
 >> FWIW,
 >>
 >> It would be very interesting to see the output of:
 >>
 >> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
 >>
 >> If you see something that looks anomalous.  I'd like to make sure
that I'm detecting issues like this.
 >>
 >> Mark
 >>
 >>
 >> On 07/09/2015 06:03 PM, Samuel Just wrote:
 >>>
 >>> I've seen some odd teuthology in the last week or two which seems
to be anomalous rjenkins hash behavior as well.
 >>>
 >>> http://tracker.ceph.com/issues/12231
 >>> -Sam
 >>>
 >>> ----- Original Message -----
 >>> From: "Sage Weil" <sweil@xxxxxxxxxx <mailto:sweil@xxxxxxxxxx>>
 >>> To: "Gleb Borisov" <borisov.gleb@xxxxxxxxx
<mailto:borisov.gleb@xxxxxxxxx>>
 >>> Cc: ceph-devel@xxxxxxxxxxxxxxx <mailto:ceph-devel@xxxxxxxxxxxxxxx>
 >>> Sent: Thursday, July 9, 2015 3:06:00 PM
 >>> Subject: Re: Strange issue with CRUSH
 >>>
 >>> On Fri, 10 Jul 2015, Gleb Borisov wrote:
 >>>>
 >>>> Hi Sage,
 >>>>
 >>>> Sorry for mailing you in person, I realize that you're quite busy
at redhat,
 >>>> but I wanted you have a look on an issue with CRUSH map.
 >>>
 >>>
 >>> No problem. I hope you don't mind I've added ceph-devel to the cc list.
 >>>
 >>>> I've described very first insights here:
 >>>>
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/002897.html
 >>>>
 >>>> We are continue our research and found that distribution of PG
count by OSD
 >>>> is very strange and after digging into CRUSH source code found
rjenkins1
 >>>> hash function.
 >>>>
 >>>> After some testing we realized that rjenkins1's value distribution is
 >>>> exponential, and this can cause our disbalance.
 >>>
 >>>
 >>> Any issue with rjenkins1's hash function is very interesting and
 >>> concerning.  Can you describe your analysis and what you mean by the
 >>> distribution being exponential?
 >>>
 >>>> What do you think about adding additional hashing algorithm to
CRUSH? It
 >>>> seems that it could improve distribution.
 >>>
 >>>
 >>> I am definitely open to adding new hash functions, especially if the
 >>> current ones are flawed.  The current hash was created by making ad hoc
 >>> combinations of rjenkins' mix function with various numbers of
 >>> arguments--hardly scientific or methodical.  We did an analysis a
couple
 >>> years back and found that it effectively modeled a uniform
distribution,
 >>> but if we missed something or were wrong we should definitely
correct it!
 >>>
 >>> In any case, the important step is to quantify what is wrong with the
 >>> current hash so that we can ensure any new one is not flawed in the
same
 >>> way.
 >>>
 >>> Thanks-
 >>> sage
 >>>
 >>>
 >>>> We have also tried to generate some syntetic crushmaps (another bucket
 >>>> types, more OSDs per host, more/less hosts by rack, different cound of
 >>>> racks, linear osd ids, random osd ids, etc), but didn't found any
 >>>> combination with better distribution of PG across OSD.
 >>>>
 >>>> Thanks and one more sorry for bothering you in person.
 >>>> --
 >>>> Best regards,
 >>>> Gleb M Borisov
 >>>>
 >>>>
 >>> --
 >>> To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
 >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
<mailto:majordomo@xxxxxxxxxxxxxxx>
 >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
 >>>
 >>> --
 >>> To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
 >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
<mailto:majordomo@xxxxxxxxxxxxxxx>
 >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
 >>>
 >
 >
 >
 > --
 > Best regards,
 > Gleb M Borisov

--
Best regards,
Gleb M Borisov
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html