Re: Oops when running snapshots

Andrew Patterson <andrew.patterson@fc.hp.com> · Tue, 13 Jan 2004 11:35:46 -0700

On Mon, 2004-01-12 at 07:35, Steve McIntyre wrote:
> On Mon, Jan 05, 2004 at 10:35:00AM +0000, Me @ Plasmon wrote:
> >Has there been any progress on the issue reported by Andrew Patterson
> >in
> >
> >http://lists.sistina.com/pipermail/linux-lvm/2003-July/014422.html
> >
> >???
> 
> Apparently not. Should I expect a response here, or ask on the main
> kernel list?


We worked over this problem with Heinz on LVM 1.07.  One of our kernel
hackers (along with Heinz) came up with several possible solutions to a
race condition in the snapshot code.  They are presented below.  We
found that option #1 worked, but sometimes it could take a long time to
create multiple snapshots of the same LV under extremely heavy load. 
The first snapshot is fine, but the second or more can take hours. 
Option #2 created a huge performance penalty to writes on the original
LV (90-99%).  I don't remember that option #3 helped any, and we never
tried #4.  We settled with option #1 and solved the long snapshot
creation time by stopping I/O while creating the snapshot.

The LVM snapshot problem is the same one I had with the nfs export hash
table (classic reader/writer problem).  Read and write semaphores are
being
used to allow concurrent read access to the snapshot hash table while
allowing exclusive write access.  However, lvm_snapshot_remap_block ->
lvm_find_exception_table does the hash table optimization (i.e. writing
to the hash table) regardless of whether the read or write semaphore is
held.

The snapshot hash table has 128K entries, so there's
more potential for performance improvement. Here's the call tree for
lvm_find_exception_table:

lvm_find_exception_table
- lvm_snapshot_remap_block
-- __remap_snapshot
-- _remap_snapshot
-- lvm_map

lvm_map, _remap_snapshot, and __remap_snapshot all call
lvm_snapshot_remap_block.  

1) lvm_map 
lvm_map() {
   down_read(&lv->lv_lock)
   ...
   if (lvm_snapshot_remap_block(...))
   ...
   up_read(&lv->lv_lock)
}

This is bad, because lvm_snapshot_remap_block will call
lvm_find_exception_table, which could write to the hash table while only
holding the read semaphore during the optimization step.

2) _remap_snapshot + __remap_snapshot

lvm_map() {
   down_read(&lv->lv_lock)
   ...
   _remap_snapshot() {
     down_read(&lv->lock)  // again!
     r = lvm_snapshot_remap_block()
     up_read(&lv->lock)
     if (!r)
        __remap_snapshot() {
           down_write(&lv->lock)  // good idea, but we already hold
read!
              if (!lvm_snapshot_remap_block()) {}
           up_write(&lv->lock)
        }
   }
   up_read(&lv->lv_lock)
}

This is bad, because down_read(&lv->lock) is called twice unnecessarily.
Also, lvm_snapshot_remap_block will still possibly write to the hash
table
while only holding the read semaphore.  I don't know how
lvm_map->_remap_snapshot->__remap_snapshot works at all.  down_write
should
block until all the readers have released the semaphore.  Since this
kernel
process already holds the read semaphore, it should never get the chance
to
get the write semaphore.

Options to fix (included as attachments and inline):

1) don't do the hash table optimization in lvm_find_exception_table,
which
we tried last night.  No crashes, but slow.  Doesn't fix the "get the
write
semaphore while holding the read semaphore" problem in
__remap_snapshot().

--- lvm-snap.c~ Mon Aug 25 15:28:50 2003
+++ lvm-snap.c-erik     Thu Aug 28 13:33:28 2003
@@ -130,11 +130,13 @@ static inline lv_block_exception_t *lvm_
                exception = list_entry(next, lv_block_exception_t,
hash);
                if (exception->rsector_org == org_start &&
                    exception->rdev_org == org_dev) {
+#if 0
                        if (i) {
                                /* fun, isn't it? :) */
                                list_del(next);
                                list_add(next, hash_table);
                        }
+#endif
                        ret = exception;
                        break;
                }

2) clean up the locking: always get write semaphore before going into
lvm_snapshot_remap_block, realize we already hold the read semaphore:

--- lvm.c~      Fri Aug 29 13:56:26 2003
+++ lvm.c-erik  Fri Aug 29 14:08:30 2003
@@ -1183,6 +1183,7 @@ static void __remap_snapshot(kdev_t rdev
 {

        /* copy a chunk from the origin to a snapshot device */
+       up_read(&lv->lv_lock);
        down_write(&lv->lv_lock);

        /* we must redo lvm_snapshot_remap_block in order to avoid a
@@ -1192,6 +1193,7 @@ static void __remap_snapshot(kdev_t rdev
                lvm_write_COW_table_block(vg, lv);

        up_write(&lv->lv_lock);
+       down_read(&lv->lv_lock);
 }

 static inline void _remap_snapshot(kdev_t rdev, ulong rsector,
@@ -1200,9 +1202,11 @@ static inline void _remap_snapshot(kdev_
        int r;

        /* check to see if this chunk is already in the snapshot */
-       down_read(&lv->lv_lock);
-       r = lvm_snapshot_remap_block(&rdev, &rsector, pe_start, lv);
        up_read(&lv->lv_lock);
+       down_write(&lv->lv_lock);
+       r = lvm_snapshot_remap_block(&rdev, &rsector, pe_start, lv);
+       up_write(&lv->lv_lock);
+       down_read(&lv->lv_lock);

        if (!r)
                /* we haven't yet copied this block to the snapshot */
@@ -1340,9 +1344,16 @@ static int lvm_map(struct buffer_head *b
                goto out;

        if (lv->lv_access & LV_SNAPSHOT) {      /* remap snapshot */
+               up_read(&lv->lv_lock);
+               down_write(&lv->lv_lock);
                if (lvm_snapshot_remap_block(&rdev_map, &rsector_map,
-                                            pe_start, lv) < 0)
+                                            pe_start, lv) < 0) {
+                       up_write(&lv->lv_lock);
+                       down_read(&lv->lv_lock);
                        goto bad;
+               }
+               up_write(&lv->lv_lock);
+               down_read(&lv->lv_lock);

        } else if (rw == WRITE || rw == WRITEA) {       /* snapshot
origin
*/
                lv_t *snap;

3) clean up the locking: grab the write semaphore in lvm_map and don't
get any other semaphores.  This isn't recommended and I'm sure Heinz
will cry when he sees this.  This defeats much of the purpose of having
separate read and write semaphores, and lvm_map might will hold the
write semaphore for a long time.  This will kill concurrent access to a
volume, and will probably hurt performance.

--- lvm.c~      Fri Aug 29 13:56:26 2003
+++ lvm.c-erik  Fri Aug 29 14:16:01 2003
@@ -1183,7 +1183,6 @@ static void __remap_snapshot(kdev_t rdev
 {

        /* copy a chunk from the origin to a snapshot device */
-       down_write(&lv->lv_lock);

        /* we must redo lvm_snapshot_remap_block in order to avoid a
           race condition in the gap where no lock was held */
@@ -1191,7 +1190,6 @@ static void __remap_snapshot(kdev_t rdev
            !lvm_snapshot_COW(rdev, rsector, pe_start, rsector, vg, lv))
                lvm_write_COW_table_block(vg, lv);

-       up_write(&lv->lv_lock);
 }

 static inline void _remap_snapshot(kdev_t rdev, ulong rsector,
@@ -1200,9 +1198,7 @@ static inline void _remap_snapshot(kdev_
        int r;

        /* check to see if this chunk is already in the snapshot */
-       down_read(&lv->lv_lock);
        r = lvm_snapshot_remap_block(&rdev, &rsector, pe_start, lv);
-       up_read(&lv->lv_lock);

        if (!r)
                /* we haven't yet copied this block to the snapshot */
@@ -1253,7 +1249,7 @@ static int lvm_map(struct buffer_head *b
        lv_t *lv = vg_this->lv[LV_BLK(minor)];


-       down_read(&lv->lv_lock);
+       down_write(&lv->lv_lock);
        if (!(lv->lv_status & LV_ACTIVE)) {
                printk(KERN_ALERT
                       "%s - lvm_map: ll_rw_blk for inactive LV %s\n",
@@ -1327,7 +1323,7 @@ static int lvm_map(struct buffer_head *b
                if (_defer_extent(bh, rw, rdev_map,
                                  rsector_map, vg_this->pe_size)) {

-                       up_read(&lv->lv_lock);
+                       up_write(&lv->lv_lock);
                        return 0;
                }

@@ -1365,13 +1361,13 @@ static int lvm_map(struct buffer_head *b
       out:
        bh->b_rdev = rdev_map;
        bh->b_rsector = rsector_map;
-       up_read(&lv->lv_lock);
+       up_write(&lv->lv_lock);
        return 1;

       bad:
        if (bh->b_end_io)
                buffer_IO_error(bh);
-       up_read(&lv->lv_lock);
+       up_write(&lv->lv_lock);
        return -1;
 }                              /* lvm_map() */

4) in lvm_snapshot_remap_block, somehow determine if we hold the read
semaphore and magically turn it into a write semaphore before doing the
optimization.  I'm not going to attempt this one.


-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Andrew Patterson                Voice:  (970) 898-3261
Hewlett-Packard Company         Email:  andrew@fc.hp.com


Attachment:
signature.asc

Description: This is a digitally signed message part