Re: Crawling and indexing hardware

"Krishna Srinivas" <krishna@xxxxxxxxxxxxx> · Fri, 9 May 2008 15:37:40 +0530

On Fri, May 9, 2008 at 2:33 PM, Marcus Herou <marcus.herou@xxxxxxxxxxxxx> wrote:
> Oooops. Didn't think of that with AFR. However I think Lucene always create
> new files when documents are flushed to disk so on commit basis there will
> be low imapact. But the scenario you're talking about will most definitely
> kick in when optimization of the index occurs. Hundreds of smaller files
> aggregates into bigger more compact files. Since Lucene cannot hold all
> smaller files in memory it will flush parts of the merge in "log" files
> which will trigger the case you're talking about.
>
> So basically the absolute worst case possible using GlusterFS with AFR would
> be to use it with a webserver access log right ?
>
> I think I will go for AFR when it comes to the billion small files since
> they are almost never updated but is there a smart way of updating big files
> in GlusterFS ?

What do you mean by smart way? are you referring to the unsmart way of
selfheal happening now? or just write()s

>> Do you plan to do any AFR (automatic file replication) ?  If so,
>> consider that even a one-byte change to your "big index files" will
>> cause the /entire/ file to be AFR'd between all participating nodes.

Marcus, what do you mean by this?

Krishna

>
> Perhaps Gluster is a bad choice for Lucene indexing and I really need to go
> for having many cheap boxes with local disks instead.
>
> Kindly
>
> //Marcus
>
>
>
> On Fri, May 9, 2008 at 10:37 AM, Daniel Maher
> <dma+gluster@xxxxxxxxx<dma%2Bgluster@xxxxxxxxx>>
> wrote:
>
>> On Wed, 7 May 2008 20:06:40 +0200 "Marcus Herou"
>> <marcus.herou@xxxxxxxxxxxxx> wrote:
>>
>> > 1.  Big index files ~x Gig each
>> > 2.  Many small files in a huge amount of directories.
>>
>> Do you plan to do any AFR (automatic file replication) ?  If so,
>> consider that even a one-byte change to your "big index files" will
>> cause the /entire/ file to be AFR'd between all participating nodes.
>>
>> > Finally what tools would suite to test zillions of small files ?
>> > Bonnie++ ? Fewer big files ? Still Bonnie++ or perhaps IOZone ?
>>
>> IOZone is an interesting tool, assuming you can interpret the
>> results. :P  I have been using Bonnie++ and FFSB extensively over the
>> past couple of weeks to stresstest / benchmark Gluster.  Both have the
>> advantage of producing easily interpretable results, and FFSB is highly
>> configurable, depending on what sort of tests you'd like to run (read /
>> write / both, small / large files, lots / few files, etc..).
>>
>> The following page contains some sample FFSB configs to work from :
>> http://tastic.brillig.org/~jwb/zfs-xfs-ext4.html<http://tastic.brillig.org/%7Ejwb/zfs-xfs-ext4.html>
>> (see "Step 8".)
>>
>> Cheers !
>>
>> --
>> Daniel Maher <dma AT witbe.net>
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.herou@xxxxxxxxxxxxx
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>