Re: bluestore bug#15724: bluefs _replay file with link count 0:

Kevan Rehm <krehm@xxxxxxxx> · Tue, 19 Jul 2016 17:04:09 +0000

On 7/19/16, 11:40 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote:

>You can see the asserts in destructor of File(~File). if we have any
>valid refcount, i think boost
>asserts to delete the object. These are intrusive pointers.

Ah, you are correct, I misunderstood your point.  You were saying that
asserts would fire if readers/writers exist at File object destruction
time.   I thought you were saying that asserts would fire if
readers/writers exist when the file's link count goes to zero, those are
the asserts I could not find.

>
>Will update if i have a successful recreate of the same scenario.

Very good, thanks!

Kevan

>
>Varada
>
>On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote:
>> Varada,
>>
>>> If we are using FileWriter, we will(should) have a reference on the
>>> fnode.
>>> I don't think that will be deleted from the file_map. Will wait for
>>>your
>>> run.
>>>
>>
>>> open_for_read and open_for_write will take a reference on the file and
>>> we will increment
>>> num_readers and num_writers on the file reference. we have asserts to
>>> catch if we have any
>>> readers/writers in case file deletion.
>>
>> I had hoped that, but can't find them in the code, unless they were
>>added
>> recently.  See routine _drop_link(), when the link count hits zero, it
>> proceeds immediately with the inode/extent deletion.  It does have an
>> assert there checking file->num_reading, but (in my copy) no checks for
>> num_readers and num_writers.  There are also no checks of the 'nref' in
>> RefCountedObject from which File inherits.    If you do find any of
>>these
>> elsewhere that provide protection, please let me know, I'd like to
>> understand this code better.
>>
>> Thanks for your help digging into this, and trying to reproduce it.
>>I'll
>> update the bug shortly.
>>
>> Kevan
>>
>>
>> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote:
>>
>>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:
>>>> Varada,
>>>>
>>>> I added the following assert in _truncate() but haven't yet had it
>>>>fire
>>>> in
>>>> our test cluster.   If you or others also want to give it a try, that
>>>> would be great.  It seems like a useful assert.
>>>>
>>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
>>>> index dbfd59b..759d1af 100644
>>>> --- a/src/os/bluestore/BlueFS.cc
>>>> +++ b/src/os/bluestore/BlueFS.cc
>>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t
>>>> offset)
>>>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
>>>>             << " file " << h->file->fnode << dendl;
>>>>
>>>> +  // If ino is not in filemap, we are truncating a file that is not
>>>> allocated.
>>>> +
>>>> +  auto p = file_map.find(h->file->fnode.ino);
>>>> +  if (p == file_map.end()) {
>>>> +    assert(0 == "_truncate(): truncating a file that is not
>>>> allocated");
>>>> +  }
>>>> +
>>>>    // truncate off unflushed data?
>>>>    if (h->pos < offset &&
>>>>        h->pos + h->buffer.length() > offset) {
>>> If we are using FileWriter, we will(should) have a reference on the
>>>fnode.
>>> I don't think that will be deleted from the file_map. Will wait for
>>>your
>>> run.
>>>> The test case which seemed to make the problem occur more frequently
>>>>for
>>>> us consists of mounting a cephfs filesystem, then starting up 32
>>>> instances
>>>> of ior on each of 4 client machines.  The ior command looks like this:
>>>>
>>>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10
>>>>
>>>>
>>>> In the meantime I have been looking further at the code, and I think I
>>>> see
>>>> the problem-causing sequence of events.
>>>>
>>>> BlueFS implements unlink() differently than the POSIX model.  In
>>>>POSIX,
>>>> the inode and disk extents for a file are not released until both the
>>>> link
>>>> count reaches zero and the inode reference count (file open count)
>>>> reaches
>>>> zero.  BlueFS however releases the inode and disk extents when the
>>>>link
>>>> count (file->refs) reaches zero even if there are threads that still
>>>> have
>>>> the file open.
>>> open_for_read and open_for_write will take a reference on the file and
>>> we will increment
>>> num_readers and num_writers on the file reference. we have asserts to
>>> catch if we have any
>>> readers/writers in case file deletion.
>>>> Routine _drop_link() does the inode and disk extent releases, and it
>>>> sets
>>>> a File variable called 'deleted' to true, but note that it does not
>>>> otherwise clear the released inode number or disk extent fields in the
>>>> File.  It is therefore the responsibility of every thread with a
>>>>FileRef
>>>> for that file to first obtain a lock (to prevent races with
>>>> _drop_link())
>>>> and then look at the value of 'deleted', and only continue using the
>>>> File
>>>> if 'deleted' is false.
>>>>
>>>> Function BlueRocksWritableFile::Close() does not do this.  The routine
>>>> does not grab a lock; it calls GetPreallocationStatus(), and if
>>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a
>>>> lock) which calls _truncate().  Routine _truncate() lowers the file
>>>> size,
>>>> and calls op_file_update(), which posts an update transaction entry to
>>>> the
>>>> log for a file that is already unlinked.   This is how we end up with
>>>>an
>>>> inode in the transaction log that has disk extents but its link count
>>>>is
>>>> zero.
>>> Yes. i also found the same code path, which truncates the file. Trying
>>> to write some
>>> unit test cases to run all these ops in a concurrent way to find the
>>> problem.
>>>> Adding a check of the 'deleted' variable after the lock is taken in
>>>> _truncate() would work, but it bothers me a bit to see
>>>> BlueRocksWritableFile::Close() using the File struct as if it is still
>>>> valid.  Folks with more BlueFS experience can decide if that is a
>>>> problem
>>>> or not.   They might also be able to describe situations where a file
>>>> could be unlinked while it is still open, that might lead to better
>>>>test
>>>> cases which fire the assert more quickly.
>>> Added some asserts to find the issue using a test with multiple
>>>threads.
>>> Still
>>> trying reproduce the issue. I believe we should have a reference to the
>>> file. Will verify that
>>> anyways.
>>>> I'm also a bit concerned about the integrity of the BlueFS filesystem,
>>>> should the log have to be replayed.  This problem was only seen
>>>>because
>>>> the inode number wasn't reused in the log after it was released.  What
>>>> if
>>>> a subsequent log transaction reallocates the inode?  The "file with
>>>>link
>>>> count 0" assert would not catch it.  I guess that subsequent
>>>> reallocation
>>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's
>>>> harmless?
>>> Hopefully. Will try to add some debug around reference count of the
>>>file
>>> and
>>> verify the theory.
>>>
>>>> It would be nice to have my assert() trigger to provide the "smoking
>>>> gun".
>>>>  However, if I can't duplicate the problem in the next day or two, I
>>>> will
>>>> probably just patch _truncate() to return immediately if 'deleted' is
>>>> true, and move on.
>>>>
>>>> What do you think, should I post this history to bug 15724, create a
>>>> separate bug, anything else?
>>> Please post your findings to 15724, if we find a different problem we
>>> can file a
>>> different ticket.
>>>
>>> Varada
>>>> Thanks, Kevan
>>>>
>>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>>>> Kevan
>>>> Rehm" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of krehm@xxxxxxxx>
>>>> wrote:
>>>>
>>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote:
>>>>>
>>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>>>>>> Varada, see below.
>>>>>>>
>>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I have made a bit of progress.  Still looking for ideas/advise
>>>>>>>>> while
>>>>>>>>> I
>>>>>>>>> keep digging.
>>>>>>>>>
>>>>>>>>> First, the ceph version I forgot to mention last time.
>>>>>>>>>
>>>>>>>>> # ceph -v
>>>>>>>>> ceph version 11.0.0-196-g85bb43e
>>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>>>>> which
>>>>>>>>> will result in an inode with 0 links.   There are actually three
>>>>>>>>> inodes
>>>>>>>>> in
>>>>>>>>> this transaction which will all end up having 0 links, inodes
>>>>>>>>>109,
>>>>>>>>> 102,
>>>>>>>>> and 113.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000154.log
>>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 109
>>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000153.log
>>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 102
>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000151.log
>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 113
>>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_dir_unlink  db/000150.log
>>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_remove 111
>>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931
>>>>>>>>>13
>>>>>>>>> 54
>>>>>>>>> 8
>>>>>>>>> 8+
>>>>>>>>> 20
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,
>>>>>>>>>1:
>>>>>>>>> 96
>>>>>>>>> 0
>>>>>>>>> 49
>>>>>>>>> 56
>>>>>>>>> 16+58720256])
>>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 113 =
>>>>>>>>> 0x7fb66aafc880 (new)
>>>>>>>> When _get_file() is called with a inode number, if the file is not
>>>>>>>> found
>>>>>>>> in file_map
>>>>>>>> a new file is created and returned, hence that (new) in the log
>>>>>>>> file.
>>>>>>>> These set of events
>>>>>>>> can happen in case of rename. You can see the same
>>>>>>>>BlueFS::rename().
>>>>>>>> In
>>>>>>>> that case, we use the
>>>>>>>> same inode number.
>>>>>>> I'm not sure I understand, I do not believe this is a rename case.
>>>>>>> For
>>>>>>> this to be a rename, there would have to be at least one
>>>>>>> op_dir_link()
>>>>>>> call for each op_dir_unlink() call in the sequence above.   The
>>>>>>>only
>>>>>>> code
>>>>>>> sequence I can find where an op_dir_unlink() and an
>>>>>>>op_file_remove()
>>>>>>> appear together without a corresponding op_dir_link() is when a
>>>>>>>file
>>>>>>> is
>>>>>>> being deleted.
>>>>>>>
>>>>>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>>>>> above
>>>>>>> freed the block allocation for the inode and released the inode
>>>>>>>from
>>>>>>> the
>>>>>>> filemap, showing that the file->refs had to be zero at that time
>>>>>>> (routine
>>>>>>> _drop_link).   Yet the following op_file_update() call is
>>>>>>>performing
>>>>>>> a
>>>>>>> truncate on that same inode, which is not in the filemap.   So the
>>>>>>> caller
>>>>>>> has to be holding a file structure whose refs field is greater than
>>>>>>> zero.
>>>>>>> Which implies that the caller's file structure has to be stale.
>>>>>>>The
>>>>>>> fact
>>>>>>> that the op_file_update file contents for inode 113 is exactly the
>>>>>>> same
>>>>>>> as
>>>>>>> the previous op_file_update contents prior to the op_file_remove(),
>>>>>>> except
>>>>>>> for a reduction in file size, proves I believe that the file struct
>>>>>>> used
>>>>>>> by the op_file_update() caller is stale.
>>>>>>>
>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023
>>>>>>>>>75
>>>>>>>>> 16
>>>>>>>>> 8
>>>>>>>>> +2
>>>>>>>>> 09
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1
>>>>>>>>>:2
>>>>>>>>> 27
>>>>>>>>> 5
>>>>>>>>> 40
>>>>>>>>> 99
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :2
>>>>>>>>> 51
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946
>>>>>>>>>49
>>>>>>>>> 85
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1
>>>>>>>>>:3
>>>>>>>>> 18
>>>>>>>>> 7
>>>>>>>>> 67
>>>>>>>>> 10
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :3
>>>>>>>>> 42
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837
>>>>>>>>>78
>>>>>>>>> 81
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1
>>>>>>>>>:4
>>>>>>>>> 05
>>>>>>>>> 7
>>>>>>>>> 98
>>>>>>>>> 91
>>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 102 =
>>>>>>>>> 0x7fb66aafbb90 (new)
>>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5be000:
>>>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 109 =
>>>>>>>>> 0x7fb66aafbe60 (new)
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The prior transaction log entry for each of the three inodes was
>>>>>>>>> another
>>>>>>>>> op_file_update() call.  The only difference between the two
>>>>>>>>>updates
>>>>>>>>> for
>>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>>>>> bytes,
>>>>>>>>> all
>>>>>>>>> other fields are identical.
>>>>>>>> Are the extents same for both the events?
>>>>>>> Yes, please compare this line with the one higher above in this
>>>>>>> email,
>>>>>>> you
>>>>>>> can see that only the file size is different.   That is true for
>>>>>>>all
>>>>>>> three
>>>>>>> of these inodes.
>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5b4000:
>>>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>>>>>> 17:05:53.738683 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931
>>>>>>>>>13
>>>>>>>>> 54
>>>>>>>>> 8
>>>>>>>>> 8+
>>>>>>>>> 20
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,
>>>>>>>>>1:
>>>>>>>>> 96
>>>>>>>>> 0
>>>>>>>>> 49
>>>>>>>>> 56
>>>>>>>>> 16+58720256])
>>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 113 =
>>>>>>>>> 0x7fb66aafbb90
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>>>>>> 64421913,
>>>>>>>>> 6933026 less bytes, all other fields are identical.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5ba000:
>>>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>>>>>> 22:41:33.028030 bdev 1 extents
>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 109 =
>>>>>>>>> 0x7fb66aafc910
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>>>>>> 68557043,
>>>>>>>>> 2802362 less bytes, all other fields are identical.
>>>>>>>>>
>>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay
>>>>>>>>>0x5b9000:
>>>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>>>>>> 17:05:54.615190 bdev 1 extents
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023
>>>>>>>>>75
>>>>>>>>> 16
>>>>>>>>> 8
>>>>>>>>> +2
>>>>>>>>> 09
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1
>>>>>>>>>:2
>>>>>>>>> 27
>>>>>>>>> 5
>>>>>>>>> 40
>>>>>>>>> 99
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :2
>>>>>>>>> 51
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946
>>>>>>>>>49
>>>>>>>>> 85
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1
>>>>>>>>>:3
>>>>>>>>> 18
>>>>>>>>> 7
>>>>>>>>> 67
>>>>>>>>> 10
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097
>>>>>>>>>15
>>>>>>>>> 2,
>>>>>>>>> 1
>>>>>>>>> :3
>>>>>>>>> 42
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144
>>>>>>>>>+2
>>>>>>>>> 09
>>>>>>>>> 7
>>>>>>>>> 15
>>>>>>>>> 2,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837
>>>>>>>>>78
>>>>>>>>> 81
>>>>>>>>> 6
>>>>>>>>> +1
>>>>>>>>> 04
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1
>>>>>>>>>:4
>>>>>>>>> 05
>>>>>>>>> 7
>>>>>>>>> 98
>>>>>>>>> 91
>>>>>>>>> 2+3145728,1:507510784+5242880])
>>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino
>>>>>>>>> 102 =
>>>>>>>>> 0x7fb66aafbe60
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is only one caller of op_file_update() in which the size of
>>>>>>>>> the
>>>>>>>>> file
>>>>>>>>> is reduced but all other fields stay the same, that is
>>>>>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>>>>> BlueFS::truncate(),
>>>>>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>>>>>
>>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the
>>>>>>>>> three
>>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>>>>> having
>>>>>>>>> difficulty finding where the call is coming from.
>>>>>>>> Any syscall making dir rename or file remove or directory remove
>>>>>>>> will
>>>>>>>> result in BlueFS::rename or unlink
>>>>>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>>>>> operations.
>>>>>>>> rocksdb wal_manager or compaction
>>>>>>>> can trigger these calls during that process. Will take a closer
>>>>>>>>look
>>>>>>>> at
>>>>>>>> the logs, if i can get something from them.
>>>>>>>>> So the problem appears to be that the file is first deleted, then
>>>>>>>>> it
>>>>>>>>> is
>>>>>>>>> closed, and a byproduct of the close operation  on a stale fnode
>>>>>>>>>is
>>>>>>>>> that
>>>>>>>>> another op_file_update() entry gets logged, recreating the
>>>>>>>>>deleted
>>>>>>>>> inode
>>>>>>>>> with a stale fnode, resulting in the 0 links failure.
>>>>>>>> Once the file is deleted, all the underlying extents will be
>>>>>>>> cleared.
>>>>>>>> Until and unless there is race in handling on
>>>>>>>> the files from application level, Bluefs doesn't have any control
>>>>>>>> over
>>>>>>>> them. If the a file(handle) is been reused after
>>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>>>>>> confirm your theory.
>>>>>>>> But here, this seems to be more of rename use case. one more thing
>>>>>>>> to
>>>>>>>> verify is the extents owned by the inodes mentioned.
>>>>>>> No, I don't believe this is a rename, see above.
>>>>>>>
>>>>>>> I'll note that every occurrence of this problem has been preceded
>>>>>>>by
>>>>>>> an
>>>>>>> OSD segfault.  And this problem is racy, if I increase debug
>>>>>>>logging
>>>>>>> for
>>>>>>> bluefs or bluestore, then the problem goes away.  I have 50+
>>>>>>> bluestore
>>>>>>> segfaults and asserts in my cluster logs; I suspect many could be
>>>>>>>due
>>>>>>> to
>>>>>>> this problem, because the same range of blocks is being freed more
>>>>>>> than
>>>>>>> once.   If some other file should perform an allocation between the
>>>>>>> block-freeing in op_file_remove() and the second freeing in
>>>>>>> op_file_update(), then we could easily have multiple files with the
>>>>>>> same
>>>>>>> block allocation at the same time.
>>>>>>>
>>>>>>> I think this problem can be caught fairly easily, an assert could
>>>>>>>be
>>>>>>> added
>>>>>>> to _truncate().   The _truncate() routine is only called for files
>>>>>>> which
>>>>>>> are open, which means that the inode has to be in the filemap.
>>>>>>>Code
>>>>>>> could be added at the front of _truncate() to look up the inode in
>>>>>>> the
>>>>>>> filemap, and if it is not there, then assert.   That would provide
>>>>>>>a
>>>>>>> stack
>>>>>>> trace showing the caller using the file struct with the stale
>>>>>>> reference
>>>>>>> count.   (It may be a few days before I have a chance to try this
>>>>>>> myself.)
>>>>>> Sure. Could you please let me know the steps to hit issue?
>>>>>> i am doing only few min runs to check the performance and some unit
>>>>>> test
>>>>>> cases.
>>>>>> I will add the necessary asserts if i can reproduce the issue and
>>>>>>fix
>>>>>> it.
>>>>> Thanks for the help!   I will have to get back to you on Monday, as
>>>>> there
>>>>> is another person on our team who runs the benchmarks that trigger
>>>>>this
>>>>> problem, I don't know how to run his benchmarks myself.
>>>>>
>>>>> Kevan
>>>>>
>>>>>> Varada
>>>>>>> Regards, Kevan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>> message is
>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>above.
>>>>>> If
>>>>>> the reader of this message is not the intended recipient, you are
>>>>>> hereby
>>>>>> notified that you have received this message in error and that any
>>>>>> review, dissemination, distribution, or copying of this message is
>>>>>> strictly prohibited. If you have received this communication in
>>>>>>error,
>>>>>> please notify the sender by telephone or e-mail (as shown above)
>>>>>> immediately and destroy any and all copies of this message in your
>>>>>> possession (whether hard copies or electronically stored copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is
>>> intended only for the use of the designated recipient(s) named above.
>>>If
>>> the reader of this message is not the intended recipient, you are
>>>hereby
>>> notified that you have received this message in error and that any
>>> review, dissemination, distribution, or copying of this message is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify the sender by telephone or e-mail (as shown above)
>>> immediately and destroy any and all copies of this message in your
>>> possession (whether hard copies or electronically stored copies).
>>
>
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f