Hi Xavi,
On Tue, Jun 18, 2019 at 12:28 PM Xavi Hernandez <jahernan@xxxxxxxxxx> wrote:
Hi Kotresh,On Tue, Jun 18, 2019 at 8:33 AM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:Hi Xavi,Reply inline.On Mon, Jun 17, 2019 at 5:38 PM Xavi Hernandez <jahernan@xxxxxxxxxx> wrote:Hi Kotresh,On Mon, Jun 17, 2019 at 1:50 PM Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote:Hi All,The ctime feature is enabled by default from release gluster-6. But as explained in bug [1] there is a known issue with legacy files i.e., the files which are created before ctime feature is enabled. These files would not have "trusted.glusterfs.mdata" xattr which maintain time attributes. So on, accessing those files, it gets created with latest time attributes. This is not correct because all the time attributes (atime, mtime, ctime) get updated instead of required time attributes.
There are couple of approaches to solve this.1. On accessing the files, let the posix update the time attributes from the back end file on respective replicas. This obviously results in inconsistent "trusted.glusterfs.mdata" xattr values with in replica set. AFR/EC should heal this xattr as part of metadata heal upon accessing this file. It can chose to replicate from any subvolume. Ideally we should consider the highest time from the replica and treat it as source but I think that should be fine as replica time attributes are mostly in sync with max difference in order of few seconds if am not wrong.But client side self heal is disabled by default because of performance reasons [2]. If we chose to go by this approach, we need to consider enabling at least client side metadata self heal by default. Please share your thoughts on enabling the same by default.2. Don't let posix update the legacy files from the backend. On lookup cbk, let the utime xlator update the time attributes from statbuf received synchronously.Both approaches are similar as both results in updating the xattr during lookup. Please share your inputs on which approach is better.I prefer second approach. First approach is not feasible for EC volumes because self-heal requires that k bricks (on a k+r configuration) agree on the value of this xattr, otherwise it considers the metadata damaged and needs manual intervention to fix it. During upgrade, first r bricks with be upgraded without problems, but trusted.glusterfs.mdata won't be healed because r < k. In fact this xattr will be removed from new bricks because the majority of bricks agree on xattr not being present. Once the r+1 brick is upgraded, it's possible that posix sets different values for trusted.glusterfs.mdata, which will cause self-heal to fail.Second approach seems better to me if guarded by a new option that enables this behavior. utime xlator should only update the mdata xattr if that option is set, and that option should only be settable once all nodes have been upgraded (controlled by op-version). In this situation the first lookup on a file where utime detects that mdata is not set, will require a synchronous update. I think this is good enough because it will only happen once per file. We'll need to consider cases where different clients do lookups at the same time, but I think this can be easily solved by ignoring the request if mdata is already present.Initially there were two issues.1. Upgrade Issue with EC Volume as described by you.
This is solved with the patch [1]. There was a bug in ctime posix where it was creating xattr even when ctime is not set on client (during utimes system call). With patch [1], the behavior
is that utimes system call will only update the "trusted.glusterfs.mdata" xattr if present else it won't create. The new xattr creation should only happen during entry operations (i.e create, mknod and others).
So there won't be any problems with upgrade. I think we don't need new option dependent on op version if I am not wrong.If I'm not missing something, we cannot allow creation of mdata xattr even for create/mknod/setattr fops. Doing so could cause the same problem if some of the bricks are not upgraded and do not support mdata yet (or they have ctime disabled by default).
Yes, that's right, even create/mknod and other fops won't create xattr if client doesn't set ctime (holds good for older clients). I have commented in the patch [1]. All other fops where xattr gets created as the check that if ctime is not set, don't create. It was missed only in utime syscall. And hence caused upgrade issues.
2. After upgrade, how do we update "trusted.glusterfs.mdata" xattr.This mail thread was for this. Here which approach is better? I understand from EC point of view the second approach is the best one. The question I had was, Can't EC treat 'trusted.glusterfs.mdata'as special xattr and add the logic to heal it from one subvolume (i.e. to remove the requirement of having to have consistent data on k subvolumes in k+r configuration).Yes, we can do that. But this would require a newer client with support for this new xattr, which won't be possible during an upgrade, where bricks are upgraded before the clients. So, even if we add this intelligence to the client, the upgrade process is still broken. Only consideration here is if we can rely on self-heal daemon being on the server side (and thus upgraded at the same time than the server) to ensure that files can really be healed even if other bricks/shd daemons are not yet updated. Not sure if it could work, but anyway I don't like it very much.Second approach is independent of AFR and EC. So if we chose this, do we need new option to guard? If the upgrade steps is to upgrade server first and then client, we don't need to guard I think?I think you are right for regular clients. Is there any server-side daemon that acts as a client that could use utime xlator ? if not, I think we don't need an additional option here.
No, no other server side daemon has utime xlator loaded.
Xavi--Thanks and Regards,Kotresh H R--Thanks and Regards,Kotresh H R
Thanks and Regards,
Kotresh H R_______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel