On Sat, May 12, 2007 11:46 am, tedd wrote: > Sorry to get back to this so late, but I had some other pressing > matters. No worries. It's not like I'm likely to disappear from here anytime soon :-) >>Thanks to Tedd for answering the question I asked, I think, even >>though I was asking the wrong question. :-) > > No problem, but you did ask the right question. You touched on > something I think you intuitively knew, but have been sidetracked by > an easy "solution". > > At 9:40 AM -0500 5/3/07, Richard Lynch wrote: >>But as I realized last night, the data is ALREADY in that "curve" and >>by simply breaking down in even increments from MIN to MAX, the >>"curve" works itself out correctly. > > Sort of. > > If you are content with dividing the top 100 things into strict > groups of 20 for a tag cloud distribution, then fine. However, the > "20 items per group" rule is not defined in terms of the group's > distribution, which would be a better representation of the data. > Keep in mind you are trying to show which items are the most popular > in a representative way. I'm not dividing them into groups of 20. I'm taking the min/max of the top 100, and dividing the SCALE into 5 equal chunks. The scores themselves are weighted already, with only one or two in the top 1/5th, a handful in the 2nd 1/5th, a goodly number in the 3rd 1/5th, a lot in the 4th 1/5th, and a buttload are down in that last 1/5th. In other words, I took the Top 100, and graphed them on normal cartesian graph paper -- What I was originally trying to do was graph them on logarithmic paper. > It's difficult to explain, so I'll show you: > > http://sperling.com/a/stdev/ > > Each group (color -- could be tags) falls within a division based > upon the standard deviation (SD) of the population. The cyan group is > within one SD of the most popular -- the yellow group is within two > SD of the most popular and so on. > > All members of each color group have more in common with each other > than with those outside their color group. If you will note, the > numbers of each color group change due to distribution of the > population. Using a strict "20 items per group" rule does not reflect > that. So, if you arbitrarily assign members of the population to a > group based solely on a strict division, then you are not accurately > representing the tag cloud. If I took the first 20, second 20, etc, yeah, that would be way wrong as well. I didn't do that. I just scaled my "grid" upon which to graph them in cartesian space at an offset of MIN(top100) - MAX(top100) and then let the chips fall where they may on graphing. > Do you see what I mean? Yes -- I think we ended up with pretty much the same result... Well, not the *same*, but very similar shaped curves anyway. But my way was "easier" as I just let the natural distribution of the data on standard graph paper take care of distributing the points where they belonged. I suppose there is some merit to forcing the Standard Distribution instead of living with whatever the "real" data is. But I'm more happy living with the Reality of the data than applying a Standard Distribution to data which, according to some experts, isn't even a Standard Distribution at all, but a "long tail" or some other terms they bandy about that mean the same thing as "long tail" as I understand it... I don't claim my way is "right" -- just that it "works" and is dead easy and is data-driven rather than conforming to some statistical model which may or may not be the correct model in the first place. If somebody NEEDS a Standard Distribution, for sure use Tedd's stuff, cuz that is what that is. If you're just trying to "graph" the data that have, whatever it may be, just graph it, scaled and offset, and see what kind of curve you have. PS I'll post the actual tag cloud page link once it's out of QA and not hidden from search engines behind HTTP Basic Auth. RSN, but definiitely not until after php|tek: http://phparch.com/tek -- Some people have a "gift" link here. Know what I want? I want you to buy a CD from some indie artist. http://cdbaby.com/browse/from/lynch Yeah, I get a buck. So? -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php