On 7/19/16, 11:40 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote: >You can see the asserts in destructor of File(~File). if we have any >valid refcount, i think boost >asserts to delete the object. These are intrusive pointers. Ah, you are correct, I misunderstood your point. You were saying that asserts would fire if readers/writers exist at File object destruction time. I thought you were saying that asserts would fire if readers/writers exist when the file's link count goes to zero, those are the asserts I could not find. > >Will update if i have a successful recreate of the same scenario. Very good, thanks! Kevan > >Varada > >On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote: >> Varada, >> >>> If we are using FileWriter, we will(should) have a reference on the >>> fnode. >>> I don't think that will be deleted from the file_map. Will wait for >>>your >>> run. >>> >> >>> open_for_read and open_for_write will take a reference on the file and >>> we will increment >>> num_readers and num_writers on the file reference. we have asserts to >>> catch if we have any >>> readers/writers in case file deletion. >> >> I had hoped that, but can't find them in the code, unless they were >>added >> recently. See routine _drop_link(), when the link count hits zero, it >> proceeds immediately with the inode/extent deletion. It does have an >> assert there checking file->num_reading, but (in my copy) no checks for >> num_readers and num_writers. There are also no checks of the 'nref' in >> RefCountedObject from which File inherits. If you do find any of >>these >> elsewhere that provide protection, please let me know, I'd like to >> understand this code better. >> >> Thanks for your help digging into this, and trying to reproduce it. >>I'll >> update the bug shortly. >> >> Kevan >> >> >> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote: >> >>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote: >>>> Varada, >>>> >>>> I added the following assert in _truncate() but haven't yet had it >>>>fire >>>> in >>>> our test cluster. If you or others also want to give it a try, that >>>> would be great. It seems like a useful assert. >>>> >>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc >>>> index dbfd59b..759d1af 100644 >>>> --- a/src/os/bluestore/BlueFS.cc >>>> +++ b/src/os/bluestore/BlueFS.cc >>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t >>>> offset) >>>> dout(10) << __func__ << " 0x" << std::hex << offset << std::dec >>>> << " file " << h->file->fnode << dendl; >>>> >>>> + // If ino is not in filemap, we are truncating a file that is not >>>> allocated. >>>> + >>>> + auto p = file_map.find(h->file->fnode.ino); >>>> + if (p == file_map.end()) { >>>> + assert(0 == "_truncate(): truncating a file that is not >>>> allocated"); >>>> + } >>>> + >>>> // truncate off unflushed data? >>>> if (h->pos < offset && >>>> h->pos + h->buffer.length() > offset) { >>> If we are using FileWriter, we will(should) have a reference on the >>>fnode. >>> I don't think that will be deleted from the file_map. Will wait for >>>your >>> run. >>>> The test case which seemed to make the problem occur more frequently >>>>for >>>> us consists of mounting a cephfs filesystem, then starting up 32 >>>> instances >>>> of ior on each of 4 client machines. The ior command looks like this: >>>> >>>> ior -a POSIX -o /mnt/cephfs/test -b 8m -i 10 >>>> >>>> >>>> In the meantime I have been looking further at the code, and I think I >>>> see >>>> the problem-causing sequence of events. >>>> >>>> BlueFS implements unlink() differently than the POSIX model. In >>>>POSIX, >>>> the inode and disk extents for a file are not released until both the >>>> link >>>> count reaches zero and the inode reference count (file open count) >>>> reaches >>>> zero. BlueFS however releases the inode and disk extents when the >>>>link >>>> count (file->refs) reaches zero even if there are threads that still >>>> have >>>> the file open. >>> open_for_read and open_for_write will take a reference on the file and >>> we will increment >>> num_readers and num_writers on the file reference. we have asserts to >>> catch if we have any >>> readers/writers in case file deletion. >>>> Routine _drop_link() does the inode and disk extent releases, and it >>>> sets >>>> a File variable called 'deleted' to true, but note that it does not >>>> otherwise clear the released inode number or disk extent fields in the >>>> File. It is therefore the responsibility of every thread with a >>>>FileRef >>>> for that file to first obtain a lock (to prevent races with >>>> _drop_link()) >>>> and then look at the value of 'deleted', and only continue using the >>>> File >>>> if 'deleted' is false. >>>> >>>> Function BlueRocksWritableFile::Close() does not do this. The routine >>>> does not grab a lock; it calls GetPreallocationStatus(), and if >>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a >>>> lock) which calls _truncate(). Routine _truncate() lowers the file >>>> size, >>>> and calls op_file_update(), which posts an update transaction entry to >>>> the >>>> log for a file that is already unlinked. This is how we end up with >>>>an >>>> inode in the transaction log that has disk extents but its link count >>>>is >>>> zero. >>> Yes. i also found the same code path, which truncates the file. Trying >>> to write some >>> unit test cases to run all these ops in a concurrent way to find the >>> problem. >>>> Adding a check of the 'deleted' variable after the lock is taken in >>>> _truncate() would work, but it bothers me a bit to see >>>> BlueRocksWritableFile::Close() using the File struct as if it is still >>>> valid. Folks with more BlueFS experience can decide if that is a >>>> problem >>>> or not. They might also be able to describe situations where a file >>>> could be unlinked while it is still open, that might lead to better >>>>test >>>> cases which fire the assert more quickly. >>> Added some asserts to find the issue using a test with multiple >>>threads. >>> Still >>> trying reproduce the issue. I believe we should have a reference to the >>> file. Will verify that >>> anyways. >>>> I'm also a bit concerned about the integrity of the BlueFS filesystem, >>>> should the log have to be replayed. This problem was only seen >>>>because >>>> the inode number wasn't reused in the log after it was released. What >>>> if >>>> a subsequent log transaction reallocates the inode? The "file with >>>>link >>>> count 0" assert would not catch it. I guess that subsequent >>>> reallocation >>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's >>>> harmless? >>> Hopefully. Will try to add some debug around reference count of the >>>file >>> and >>> verify the theory. >>> >>>> It would be nice to have my assert() trigger to provide the "smoking >>>> gun". >>>> However, if I can't duplicate the problem in the next day or two, I >>>> will >>>> probably just patch _truncate() to return immediately if 'deleted' is >>>> true, and move on. >>>> >>>> What do you think, should I post this history to bug 15724, create a >>>> separate bug, anything else? >>> Please post your findings to 15724, if we find a different problem we >>> can file a >>> different ticket. >>> >>> Varada >>>> Thanks, Kevan >>>> >>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of >>>> Kevan >>>> Rehm" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of krehm@xxxxxxxx> >>>> wrote: >>>> >>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote: >>>>> >>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote: >>>>>>> Varada, see below. >>>>>>> >>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@xxxxxxxxxxx> wrote: >>>>>>> >>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote: >>>>>>>>> Greetings, >>>>>>>>> >>>>>>>>> I have made a bit of progress. Still looking for ideas/advise >>>>>>>>> while >>>>>>>>> I >>>>>>>>> keep digging. >>>>>>>>> >>>>>>>>> First, the ceph version I forgot to mention last time. >>>>>>>>> >>>>>>>>> # ceph -v >>>>>>>>> ceph version 11.0.0-196-g85bb43e >>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> As a refresher, here is the beginning of the problem transaction >>>>>>>>> which >>>>>>>>> will result in an inode with 0 links. There are actually three >>>>>>>>> inodes >>>>>>>>> in >>>>>>>>> this transaction which will all end up having 0 links, inodes >>>>>>>>>109, >>>>>>>>> 102, >>>>>>>>> and 113. >>>>>>>>> >>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> txn(seq 77752 len 35269 crc 1796157826) >>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_dir_unlink db/000154.log >>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_remove 109 >>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_dir_unlink db/000153.log >>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_remove 102 >>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_dir_unlink db/000151.log >>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_remove 113 >>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_dir_unlink db/000150.log >>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_remove 111 >>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_update file(ino 113 size 71354938 mtime 2016-07-12 >>>>>>>>> 17:05:53.738683 bdev 1 extents >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931 >>>>>>>>>13 >>>>>>>>> 54 >>>>>>>>> 8 >>>>>>>>> 8+ >>>>>>>>> 20 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152, >>>>>>>>>1: >>>>>>>>> 96 >>>>>>>>> 0 >>>>>>>>> 49 >>>>>>>>> 56 >>>>>>>>> 16+58720256]) >>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 113 = >>>>>>>>> 0x7fb66aafc880 (new) >>>>>>>> When _get_file() is called with a inode number, if the file is not >>>>>>>> found >>>>>>>> in file_map >>>>>>>> a new file is created and returned, hence that (new) in the log >>>>>>>> file. >>>>>>>> These set of events >>>>>>>> can happen in case of rename. You can see the same >>>>>>>>BlueFS::rename(). >>>>>>>> In >>>>>>>> that case, we use the >>>>>>>> same inode number. >>>>>>> I'm not sure I understand, I do not believe this is a rename case. >>>>>>> For >>>>>>> this to be a rename, there would have to be at least one >>>>>>> op_dir_link() >>>>>>> call for each op_dir_unlink() call in the sequence above. The >>>>>>>only >>>>>>> code >>>>>>> sequence I can find where an op_dir_unlink() and an >>>>>>>op_file_remove() >>>>>>> appear together without a corresponding op_dir_link() is when a >>>>>>>file >>>>>>> is >>>>>>> being deleted. >>>>>>> >>>>>>> The (new) above makes my point exactly. The op_file_remove() call >>>>>>> above >>>>>>> freed the block allocation for the inode and released the inode >>>>>>>from >>>>>>> the >>>>>>> filemap, showing that the file->refs had to be zero at that time >>>>>>> (routine >>>>>>> _drop_link). Yet the following op_file_update() call is >>>>>>>performing >>>>>>> a >>>>>>> truncate on that same inode, which is not in the filemap. So the >>>>>>> caller >>>>>>> has to be holding a file structure whose refs field is greater than >>>>>>> zero. >>>>>>> Which implies that the caller's file structure has to be stale. >>>>>>>The >>>>>>> fact >>>>>>> that the op_file_update file contents for inode 113 is exactly the >>>>>>> same >>>>>>> as >>>>>>> the previous op_file_update contents prior to the op_file_remove(), >>>>>>> except >>>>>>> for a reduction in file size, proves I believe that the file struct >>>>>>> used >>>>>>> by the op_file_update() caller is stale. >>>>>>> >>>>>>> >>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_update file(ino 102 size 68557043 mtime 2016-07-12 >>>>>>>>> 17:05:54.615190 bdev 1 extents >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023 >>>>>>>>>75 >>>>>>>>> 16 >>>>>>>>> 8 >>>>>>>>> +2 >>>>>>>>> 09 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1 >>>>>>>>>:2 >>>>>>>>> 27 >>>>>>>>> 5 >>>>>>>>> 40 >>>>>>>>> 99 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097 >>>>>>>>>15 >>>>>>>>> 2, >>>>>>>>> 1 >>>>>>>>> :2 >>>>>>>>> 51 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032 >>>>>>>>>+2 >>>>>>>>> 09 >>>>>>>>> 7 >>>>>>>>> 15 >>>>>>>>> 2, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946 >>>>>>>>>49 >>>>>>>>> 85 >>>>>>>>> 6 >>>>>>>>> +1 >>>>>>>>> 04 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1 >>>>>>>>>:3 >>>>>>>>> 18 >>>>>>>>> 7 >>>>>>>>> 67 >>>>>>>>> 10 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097 >>>>>>>>>15 >>>>>>>>> 2, >>>>>>>>> 1 >>>>>>>>> :3 >>>>>>>>> 42 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144 >>>>>>>>>+2 >>>>>>>>> 09 >>>>>>>>> 7 >>>>>>>>> 15 >>>>>>>>> 2, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837 >>>>>>>>>78 >>>>>>>>> 81 >>>>>>>>> 6 >>>>>>>>> +1 >>>>>>>>> 04 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1 >>>>>>>>>:4 >>>>>>>>> 05 >>>>>>>>> 7 >>>>>>>>> 98 >>>>>>>>> 91 >>>>>>>>> 2+3145728,1:507510784+5242880]) >>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 102 = >>>>>>>>> 0x7fb66aafbb90 (new) >>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5be000: >>>>>>>>> op_file_update file(ino 109 size 64421913 mtime 2016-07-12 >>>>>>>>> 22:41:33.028030 bdev 1 extents >>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160]) >>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 109 = >>>>>>>>> 0x7fb66aafbe60 (new) >>>>>>>>> ... >>>>>>>>> >>>>>>>>> >>>>>>>>> The prior transaction log entry for each of the three inodes was >>>>>>>>> another >>>>>>>>> op_file_update() call. The only difference between the two >>>>>>>>>updates >>>>>>>>> for >>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less >>>>>>>>> bytes, >>>>>>>>> all >>>>>>>>> other fields are identical. >>>>>>>> Are the extents same for both the events? >>>>>>> Yes, please compare this line with the one higher above in this >>>>>>> email, >>>>>>> you >>>>>>> can see that only the file size is different. That is true for >>>>>>>all >>>>>>> three >>>>>>> of these inodes. >>>>>>> >>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5b4000: >>>>>>>>> op_file_update file(ino 113 size 71359362 mtime 2016-07-12 >>>>>>>>> 17:05:53.738683 bdev 1 extents >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931 >>>>>>>>>13 >>>>>>>>> 54 >>>>>>>>> 8 >>>>>>>>> 8+ >>>>>>>>> 20 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152, >>>>>>>>>1: >>>>>>>>> 96 >>>>>>>>> 0 >>>>>>>>> 49 >>>>>>>>> 56 >>>>>>>>> 16+58720256]) >>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 113 = >>>>>>>>> 0x7fb66aafbb90 >>>>>>>>> >>>>>>>>> >>>>>>>>> Same story for inode 109, the file size changed from 71354939 -> >>>>>>>>> 64421913, >>>>>>>>> 6933026 less bytes, all other fields are identical. >>>>>>>>> >>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5ba000: >>>>>>>>> op_file_update file(ino 109 size 71354939 mtime 2016-07-12 >>>>>>>>> 22:41:33.028030 bdev 1 extents >>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160]) >>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 109 = >>>>>>>>> 0x7fb66aafc910 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Same story for inode 102, the file size changed from 71359405 -> >>>>>>>>> 68557043, >>>>>>>>> 2802362 less bytes, all other fields are identical. >>>>>>>>> >>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay >>>>>>>>>0x5b9000: >>>>>>>>> op_file_update file(ino 102 size 71359405 mtime 2016-07-12 >>>>>>>>> 17:05:54.615190 bdev 1 extents >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023 >>>>>>>>>75 >>>>>>>>> 16 >>>>>>>>> 8 >>>>>>>>> +2 >>>>>>>>> 09 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1 >>>>>>>>>:2 >>>>>>>>> 27 >>>>>>>>> 5 >>>>>>>>> 40 >>>>>>>>> 99 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097 >>>>>>>>>15 >>>>>>>>> 2, >>>>>>>>> 1 >>>>>>>>> :2 >>>>>>>>> 51 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032 >>>>>>>>>+2 >>>>>>>>> 09 >>>>>>>>> 7 >>>>>>>>> 15 >>>>>>>>> 2, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946 >>>>>>>>>49 >>>>>>>>> 85 >>>>>>>>> 6 >>>>>>>>> +1 >>>>>>>>> 04 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1 >>>>>>>>>:3 >>>>>>>>> 18 >>>>>>>>> 7 >>>>>>>>> 67 >>>>>>>>> 10 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097 >>>>>>>>>15 >>>>>>>>> 2, >>>>>>>>> 1 >>>>>>>>> :3 >>>>>>>>> 42 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144 >>>>>>>>>+2 >>>>>>>>> 09 >>>>>>>>> 7 >>>>>>>>> 15 >>>>>>>>> 2, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837 >>>>>>>>>78 >>>>>>>>> 81 >>>>>>>>> 6 >>>>>>>>> +1 >>>>>>>>> 04 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1 >>>>>>>>>:4 >>>>>>>>> 05 >>>>>>>>> 7 >>>>>>>>> 98 >>>>>>>>> 91 >>>>>>>>> 2+3145728,1:507510784+5242880]) >>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino >>>>>>>>> 102 = >>>>>>>>> 0x7fb66aafbe60 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> There is only one caller of op_file_update() in which the size of >>>>>>>>> the >>>>>>>>> file >>>>>>>>> is reduced but all other fields stay the same, that is >>>>>>>>> BlueFS::_truncate(), which in turn is only called by >>>>>>>>> BlueFS::truncate(), >>>>>>>>> which is only called by BlueRocksWritableFile::Close(). >>>>>>>>> >>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the >>>>>>>>> three >>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm >>>>>>>>> having >>>>>>>>> difficulty finding where the call is coming from. >>>>>>>> Any syscall making dir rename or file remove or directory remove >>>>>>>> will >>>>>>>> result in BlueFS::rename or unlink >>>>>>>> or corresponding calls and results in op_dir_* or op_file_* >>>>>>>> operations. >>>>>>>> rocksdb wal_manager or compaction >>>>>>>> can trigger these calls during that process. Will take a closer >>>>>>>>look >>>>>>>> at >>>>>>>> the logs, if i can get something from them. >>>>>>>>> So the problem appears to be that the file is first deleted, then >>>>>>>>> it >>>>>>>>> is >>>>>>>>> closed, and a byproduct of the close operation on a stale fnode >>>>>>>>>is >>>>>>>>> that >>>>>>>>> another op_file_update() entry gets logged, recreating the >>>>>>>>>deleted >>>>>>>>> inode >>>>>>>>> with a stale fnode, resulting in the 0 links failure. >>>>>>>> Once the file is deleted, all the underlying extents will be >>>>>>>> cleared. >>>>>>>> Until and unless there is race in handling on >>>>>>>> the files from application level, Bluefs doesn't have any control >>>>>>>> over >>>>>>>> them. If the a file(handle) is been reused after >>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to >>>>>>>> confirm your theory. >>>>>>>> But here, this seems to be more of rename use case. one more thing >>>>>>>> to >>>>>>>> verify is the extents owned by the inodes mentioned. >>>>>>> No, I don't believe this is a rename, see above. >>>>>>> >>>>>>> I'll note that every occurrence of this problem has been preceded >>>>>>>by >>>>>>> an >>>>>>> OSD segfault. And this problem is racy, if I increase debug >>>>>>>logging >>>>>>> for >>>>>>> bluefs or bluestore, then the problem goes away. I have 50+ >>>>>>> bluestore >>>>>>> segfaults and asserts in my cluster logs; I suspect many could be >>>>>>>due >>>>>>> to >>>>>>> this problem, because the same range of blocks is being freed more >>>>>>> than >>>>>>> once. If some other file should perform an allocation between the >>>>>>> block-freeing in op_file_remove() and the second freeing in >>>>>>> op_file_update(), then we could easily have multiple files with the >>>>>>> same >>>>>>> block allocation at the same time. >>>>>>> >>>>>>> I think this problem can be caught fairly easily, an assert could >>>>>>>be >>>>>>> added >>>>>>> to _truncate(). The _truncate() routine is only called for files >>>>>>> which >>>>>>> are open, which means that the inode has to be in the filemap. >>>>>>>Code >>>>>>> could be added at the front of _truncate() to look up the inode in >>>>>>> the >>>>>>> filemap, and if it is not there, then assert. That would provide >>>>>>>a >>>>>>> stack >>>>>>> trace showing the caller using the file struct with the stale >>>>>>> reference >>>>>>> count. (It may be a few days before I have a chance to try this >>>>>>> myself.) >>>>>> Sure. Could you please let me know the steps to hit issue? >>>>>> i am doing only few min runs to check the performance and some unit >>>>>> test >>>>>> cases. >>>>>> I will add the necessary asserts if i can reproduce the issue and >>>>>>fix >>>>>> it. >>>>> Thanks for the help! I will have to get back to you on Monday, as >>>>> there >>>>> is another person on our team who runs the benchmarks that trigger >>>>>this >>>>> problem, I don't know how to run his benchmarks myself. >>>>> >>>>> Kevan >>>>> >>>>>> Varada >>>>>>> Regards, Kevan >>>>>>> >>>>>>> >>>>>>> >>>>>> PLEASE NOTE: The information contained in this electronic mail >>>>>> message is >>>>>> intended only for the use of the designated recipient(s) named >>>>>>above. >>>>>> If >>>>>> the reader of this message is not the intended recipient, you are >>>>>> hereby >>>>>> notified that you have received this message in error and that any >>>>>> review, dissemination, distribution, or copying of this message is >>>>>> strictly prohibited. If you have received this communication in >>>>>>error, >>>>>> please notify the sender by telephone or e-mail (as shown above) >>>>>> immediately and destroy any and all copies of this message in your >>>>>> possession (whether hard copies or electronically stored copies). >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> PLEASE NOTE: The information contained in this electronic mail message >>>is >>> intended only for the use of the designated recipient(s) named above. >>>If >>> the reader of this message is not the intended recipient, you are >>>hereby >>> notified that you have received this message in error and that any >>> review, dissemination, distribution, or copying of this message is >>> strictly prohibited. If you have received this communication in error, >>> please notify the sender by telephone or e-mail (as shown above) >>> immediately and destroy any and all copies of this message in your >>> possession (whether hard copies or electronically stored copies). >> > >PLEASE NOTE: The information contained in this electronic mail message is >intended only for the use of the designated recipient(s) named above. If >the reader of this message is not the intended recipient, you are hereby >notified that you have received this message in error and that any >review, dissemination, distribution, or copying of this message is >strictly prohibited. If you have received this communication in error, >please notify the sender by telephone or e-mail (as shown above) >immediately and destroy any and all copies of this message in your >possession (whether hard copies or electronically stored copies). ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f