Re: Re: store_avg_object_size -- default needs updating?

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Tue, 26 Feb 2013 12:35:05 -0700

On 02/25/2013 07:19 PM, Linda W wrote:
> Alex Rousskov wrote:
>> On 02/18/2013 04:01 PM, Linda W wrote:
>>> Has anyone looked at their average cached object size
>>> lately?
>>>
>>> At one point, I assume due to measurements, squid
>>> set a default to 13KB / item.
>>>
>>> About 6 or so years ago, I checked mine out:
>>> (cd /var/cache/squid;
>>> cachedirs=( $(printf "%02X " {0..63}) )
>>> echo $[$(du -sk|cut -f1)/$(find ${cachedirs[@]} -type f |wc -l)]
>>> )
>>> --- got '47K, or over 3x the default.
>>>
>>> Did it again recently:
>>> 310K/item average.
>>>
>>> Is the average size of web items going up or are these peculiar to
>>> my users' browser habits (or auto-update programs from windows
>>> going through cache, etc...).
>>
>> According to stats collected by Google in May 2010, the mean size of a
>> GET response was about 7KB:
>> https://developers.google.com/speed/articles/web-metrics
>>
>> Note that the median GET response size was less than 3KB. I doubt things
>> have changed that much since then.
> ---
> I'm pretty sure that google's stats would NOT be representative
> of the net as a whole.  Google doesn't serve content -- the service
> indexes of content -- the indices of content are going to be significantly smaller
> than the content being indexed -- especially when pictures or other non-text
> files are included.

I think you misunderstood what those "google stats" are about. They are
not about Google servers or services. They are collected from
Google-unrelated web sites around the world [that Google robots visit to
index them].

>> Google stats are biased because they are collected by Googlebot.
>> However, if you look at fresh HTTP archive stats, they seem to give a
>> picture closer to 2010 Google stats than to yours:
>> &http://httparchive.org/trends.php#bytesTotalreqTotal
>>
>> (I assume you need to divide bytesTotal by reqTotal to get mean response
>> size of about 14KB).

> 	But I'll betcha they don't have any download sites on their
> top list. 

You lost that bet (they do have download sites), but you are right (they
do not count what you consider "downloads"):

>> http://httparchive.org/about.php#listofurls
>> How is the list of URLs generated?
>> 
>> Starting in November 2011, the list of URLs is based solely on the
>> Alexa Top 1,000,000 Sites (zip).

For example, download.com (#200), filehippo.com (#600), last.fm (#908),
and iso.com (#20619) are on that Alexa list (FWIW). However:

>> What are the limitations of this testing methodology (using lists)?
>> 
>> The HTTP Archive examines each URL in the list, but does not crawl
>> the website other pages. Although these lists of websites (Fortune
>> 500 and Alexa Top 500 for example) are well known, the entire
>> website doesn't necessarily map well to a single URL.

so they probably do not get to download very large objects during
their crawl.

> Add in 'downloads.suse.org' and see how the numbers tally.

Alexa rates suse.org as #20'046'765 in the world so it was not included
in the 1M "top sites" HTTP archive sample, but it is probably not
popular enough to significantly affect statistics of an average Squid.

> Seriously -- look at stats that cut off anything > 4M is going to strongly bias
> things.

Yes, of course. For example, Google stats do not show any responses
exceeding 35MB, which means they were not downloading really large files.

If you find an unbiased source of information we can use to adjust
default averages, please post.

Thank you,

Alex.