On 12/19/2013 04:00 PM, Alexandre Oliva wrote: > On Dec 18, 2013, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > >> This probably wouldn't be too hard to get working properly, > > For some value of properly ;-) > > The current state of affairs is that the parent attribute only gets > updated when the log segment is about to be expired. Worst case, using > the proposed setxattr extension will force it to be updated earlier. > How could that end up being a bad thing? It's not like we even use the > parent attribute for anything while the inode remains in the mds > journal. > > So we have the following possibilities of divergence: > > a) the inode is created or moved, and then someone calls > setxattr(parent), and the file remains in place until the inode gets > expired from the journal. the parent attribute will be updated at the > time of the setxattr request, but it won't ever be used before the inode > gets expired from the journal, at which point it would have been updated > to the same value. > > b) the inode is absent from the journal, and someone calls > setxattr(parent), and then moves the inode to a different location. the > parent attribute will be updated (a nop unless the attribute is missing > or wrong) at the time of the setxattr request, and then the move > operation will cause the attribute to be overwritten at the time the > inode is about to be expired from the journal > > c) the inode is moved, then setxattr(parent)ed, then moved again, before > the initial move gets expired from the journal. the setxattr will be > performed at the time it is requested, and it will be correct at that > point; when the first inode move is expired from the journal, the parent > attribute may or may not be updated (I'm not sure), but if it is, then > we're back to the original behavior, and anyway, this incorrect value > won't ever be used as long as the subsequent move remains in the journal > > > Did I miss any case? > I think you are right. Setting the parent xattr direclty won't compromise the backtrace. > > Now, I've just run into another scenario in which this parent-setting > useful. I had to resort to --reset-journal (for reasons unknown), but > any files and directories created recently, whose create operations > hadn't been expired from the journal yet, won't get a parent attribute > from ceph unless I actually moved them about to force an update. This > means caps on them won't recover properly until I find out what they are > and perform corrective action. > > Moving a bunch of objects is somewhat tricky, because if the mds > restarts just at the wrong time, the move operation will seem to fail > because the new mds won't recover that transaction correctly, precisely > because the object is absent from the journal and missing the parent > attribute. This sort of probably will often get a client stuck, or > signal an error that may or may not indicate the operation failed. > > Plus, if I have to do that move dance on a large number of objects, odds > are the mds will get slow enough that a standby-replay mds will decide > it's dead and take over, and then fail to recover the ongoing > operations. See where I'm going? :-) > > Having some means to update the internal bookkeeping parent attribute > without actually touching the inodes, not even their ctimes, is a plus > for this case. > > > So now it's not just really old ceph nodes and a wish to have accurate > information in the parent nodes, it's recovering from a --reset-journal > required by some other failure I couldn't figure out. next time you encountered log corruption, please open "new issues" at http://tracker.ceph.com/ Regards Yan, Zheng > > (hmm... if I have 2*N replicas of PGs in the metadata pool and demand N > replicas to be up for the PG to be deemed complete, if I shut down the N > replicas that are up after they get an update and bring up the other N > replicas, they will know they're out of date, right? IIUC that's what > the down state is about, although I'm not sure where the OSDs get the > info from to decide to enter that state; I've always assumed it was from > pg versions known by the monitors) > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html