On 11 Oct 2021 at 03:19, Dave Chinner wrote: > On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote: >> On 01 Oct 2021 at 04:25, Dave Chinner wrote: >> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote: >> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote: >> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote: >> >> > >> >> >> >> Ok. The above solution looks logically correct. I haven't been able to come up >> >> with a scenario where the solution wouldn't work. I will implement it and see >> >> if anything breaks. >> > >> > I think I can poke one hole in it - I missed the fact that if we >> > upgrade and inode read time, and then we modify the inode without >> > modifying the inode core (can we even do that - metadata mods should >> > at least change timestamps right?) then we don't log the format >> > change or the NREXT64 inode flag change and they only appear in the >> > on-disk inode at writeback. >> > >> > Log recovery needs to be checked for correct behaviour here. I think >> > that if the inode is in NREXT64 format when read in and the log >> > inode core is not, then the on disk LSN must be more recent than >> > what is being recovered from the log and should be skipped. If >> > NREXT64 is present in the log inode, then we logged the core >> > properly and we just don't care what format is on disk because we >> > replay it into NREXT64 format and write that back. >> >> xfs_inode_item_format() logs the inode core regardless of whether >> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting >> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk >> should not result in a scenario where the corresponding >> xfs_log_dinode->di_flags2 will not have NREXT64 bit set. > > Except that log recovery might be replaying lots of indoe changes > such as: > > log inode > commit A > log inode > commit B > log inode > set NREXT64 > commit C > writeback inode > <crash before log tail moves> > > Recovery will then replay commit A, B and C, in which case we *must > not recover the log inode* in commit A or B because the LSN in the > on-disk inode points at commit C. Hence replaying A or B will result > in the on-disk inode going backwards in time and hence resulting in > an inconsistent state on disk until commit C is recovered. > >> i.e. there is no need to compare LSNs of the checkpoint >> transaction being replayed and that of the disk inode. > > Inncorrect: we -always- have to do this, regardless of the change > being made. > >> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2 >> field, then we can safely conclude that the ondisk inode has to be updated to >> reflect this change > > We can't assume that. This makes an assumption that NREXT64 is > only ever a one-way transition. There's nothing in the disk format that > prevents us from -removing- NREXT64 for inodes that don't need large > extent counts. > > Yes, the -current implementation- does not allow going back to small > extent counts, but the on-disk format design still needs to allow > for such things to be done as we may need such functionality and > flexibility in the on-disk format in the future. > > Hence we have to ensure that log recovery handles both set and reset > transistions from the start. If we don't ensure that log recovery > handles reset conditions when we first add the feature bit, then > we are going to have to add a log incompat or another feature bit > to stop older kernels from trying to recover reset operations. > Ok. I had never considered the possibility of transitioning an inode back into 32-bit data fork extent count format. With this new requirement, I now understand the reasoning behind comparing ondisk inode's LSN and checkpoint transaction's LSN. As you have mentioned earlier, comparing LSNs is required not only for the change introduced in this patch, but also for any other change in value of any of the inode's fields. Without such a comparison, the inode can temporarily end up being in an inconsistent state during log replay. To that end, The following code snippet from xlog_recover_inode_commit_pass2() skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater than checkpoint transaction's LSN, if (dip->di_version >= 3) { xfs_lsn_t lsn = be64_to_cpu(dip->di_lsn); if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) { trace_xfs_log_recover_inode_skip(log, in_f); error = 0; goto out_owner_change; } } However, if the commits in the sequence below belong to three different checkpoint transactions having the same LSN, log inode commit A log inode commit B set NREXT64 log inode commit C writeback inode <crash before log tail moves> Then the above code snippet won't prevent an inode from becoming temporarily inconsistent due to commits A and B being replayed. To handle this, we should probably go with the additional rule of "Replay log inode if both the log inode and the ondisk inode have the same value for NREXT64 bit". With that additional rule in place, the following sequence will result in a consistent inode state even if all the three checkpoint transactions have the same LSN, log inode commit A set NREXT64 log inode commit B clear NREXT64 log inode commit C writeback inode <crash before log tail moves> i.e. Commit B won't be replayed. Please let me know if my understanding is incorrect. > IOWs, the only determining factor as to whether we should replay an > inode is the LSN of the on-disk inode vs the LSN of the transaction > being replayed. Feature bits in either the on-disk ior log inode are > not reliable indicators of whether a dynamically set feature is > active or not at the time the inode item is being replayed... > >> >> > FWIW, I also think doing something like this would help make the >> >> > code be easier to read and confirm that it is obviously correct when >> >> > reading it: >> >> > >> >> > __be32 di_gid; /* owner's group id */ >> >> > __be32 di_nlink; /* number of links to file */ >> >> > __be16 di_projid_lo; /* lower part of owner's project id */ >> >> > __be16 di_projid_hi; /* higher part owner's project id */ >> >> > union { >> >> > __be64 di_big_dextcnt; /* NREXT64 data extents */ >> >> > __u8 di_v3_pad[8]; /* !NREXT64 V3 inode zeroed space */ >> >> > struct { >> >> > __u8 di_v2_pad[6]; /* V2 inode zeroed space */ >> >> > __be16 di_flushiter; /* V2 inode incremented on flush */ >> >> > }; >> >> > }; >> >> > xfs_timestamp_t di_atime; /* time last accessed */ >> >> > xfs_timestamp_t di_mtime; /* time last modified */ >> >> > xfs_timestamp_t di_ctime; /* time created/inode modified */ >> >> > __be64 di_size; /* number of bytes in file */ >> >> > __be64 di_nblocks; /* # of direct & btree blocks used */ >> >> > __be32 di_extsize; /* basic/minimum extent size for file */ >> >> > union { >> >> > struct { >> >> > __be32 di_big_aextcnt; /* NREXT64 attr extents */ >> >> > __be16 di_nrext64_pad; /* NREXT64 unused, zero */ >> >> > }; >> >> > struct { >> >> > __be32 di_nextents; /* !NREXT64 data extents */ >> >> > __be16 di_anextents; /* !NREXT64 attr extents */ >> >> > } >> >> > } >> >> The two structures above result in padding and hence result in a hole being >> introduced. The entire union above can be replaced with the following, >> >> union { >> __be32 di_big_aextcnt; /* NREXT64 attr extents */ >> __be32 di_nextents; /* !NREXT64 data extents */ >> }; >> union { >> __be16 di_nrext64_pad; /* NREXT64 unused, zero */ >> __be16 di_anextents; /* !NREXT64 attr extents */ >> }; > > I don't think this makes sense. This groups by field rather than > by feature layout. It doesn't make it clear at all that these > varaibles both change definition at the same time - they are either > {di_nexts, di_anexts} pair or a {di_big_aexts, pad} pair. That's the > whole point of using anonymous structs here - it defines and > documents the relationship between the layouts when certain features > are set rather than relying on people to parse the comments > correctly to determine the relationship.... Ok. I will need to check if there are alternative ways of arranging the fields to accomplish the goal stated above. I will think about this and get back as soon as possible. -- chandan