Re: who uses Lustre in production with virtual machines?

Les Mikesell <lesmikesell@xxxxxxxxx> · Thu, 05 Aug 2010 12:38:36 -0500

On 8/5/2010 12:12 PM, Emmanuel Noobadmin wrote:
>
>> What you want is difficult to accomplish even in a local file system.  I
>> think it would be unreasonably expensive (both in speed and cost) to put
>> your entire data store on something that provides both replication and
>> transactional guarantees.   I'd like to be convinced otherwise,
>> though...   Is it a requirement that you can recover your transactional
>> state after a complete power loss or is it enough to have reached the
>> buffers of a replica system?
>
> For the local side, I can rely on ACID compliant database engines such
> as InnoDB on MySQL to maintain transactional integrity.

If you are going to do that, why not also rely on the database engine's 
replication which is aware of the transactions?   Databases rely on 
filesystem write ordering and fsync() actually working - things that 
aren't always reliable locally, much less when clustered.

> For DRBD and gluster, if I'm not mistaken, unless I deliberate set
> otherwise, a write must have at least reached the replica buffers
> before it's considered as committed. So this scenario is unlikely to
> arise thus I don't see this as a problem with using them as machine
> replication service as compared to the unknown delay of using zfs
> send/receive replicate.

But there are lots of ways things can go wrong, and clustering just adds 
to them.  What happens when your replica host dies?  Or the network to 
it, or the disk where you expect the copy to land?  And if you don't 
wait for a sync to disk, what happens if these things break after the 
remote accepted the buffer copy.

> While I'm using DB as an example, the same issue applies to the VM
> disk image.

The DB will offer a more optimized alternative. A VM image won't.  But 
can you afford to wait for transactional guarantees on all that data 
that mostly doesn't matter?

> The upper layer cannot be told a write is done until it's
> been at least sent out to the replica system. The way I see it under
> DRBD or gluster replicate, only if the replica dies after receiving
> the write, followed by the primary dying after receiving the ack AND
> reporting the result to the user AND both drives in its mirror dying.
> Then would I have a consistency issue. I know it's not possible to
> guarantee 100% but I can live with this kind of probability as
> compared to a several seconds delay where several transactions/changes
> could have taken place before a replica receives an update.

So how long do you wait if it is the replica that breaks?  And how do 
you recover/sync later?

>> I don't see how you can have transactional replication if the servers
>> don't have to stay in sync, or how you can avoid being slowed down by
>> the head motion of a good drive being replicated to a new mirror.
>> There's just some physics involved that don't make sense.
>
> Sorry for the confusion, I don't mean no slow down or expect the
> underlying fs to be responsible for transactional replication. That's
> the job of the DBMS, I just need the fs replication not to fail in
> such a way that it could cause transactional integrity issue as noted
> in my reply above.

That's a lot to ask.  I'd like to be convinced it is possible.

-- 
   Les Mikesell
    lesmikesell@xxxxxxxxx
_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
http://lists.centos.org/mailman/listinfo/centos