Hi Brain,
>The original stacktrace shows the crash in a readdir request. I'm sure
>there are multiple things going on here (and there are a couple rename
>traces in the vmcore sitting on locks), of course, but where does the
>information about the rename come from?
>there are multiple things going on here (and there are a couple rename
>traces in the vmcore sitting on locks), of course, but where does the
>information about the rename come from?
I tracked source code of the application. It moves data to a quarantined area(another folder on same disk) under some conditions. In the bug report, it indicates a condition that DELETE(create empty file in a directory) object + list the directory will cause data MOVE (os.rename) to quarantined area(another folder). The os.rename function call is the only function of the application to touch quarantined folder.
>I'm not quite following here because I don't have enough context about
>what the application server is doing. So far, it sounds like we somehow
>have multiple threads competing to rename the same file..? Is there
>anything else in this directory at the time this sequence executes
>(e.g., a file with object data that also gets quarantined)?
The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment.
>Ideally, we'd ultimately like to translate this into a sequence of
>operations as seen by the fs that hopefully trigger the problem. We
>might have to start by reproducing through the application server.
>Looking back at that bug report, it sounds like a 'DELETE' is a
>high-level server operation that can consist of multiple sub-operations
>at the filesystem level (e.g., list, conditional rename if *.ts file
>exists, etc.). Do you have enough information through any of the above
>to try and run something against Swift that might explicitly reproduce
>the problem? For example, have one thread that creates and recreates the
>same object repeatedly and many more competing threads that try to
>remove (or whatever results in the quarantine) it? Note that I'm just
>grasping at straws here, you might be able to design a more accurate
>reproducer based on what it looks like is happening within Swift.
>what the application server is doing. So far, it sounds like we somehow
>have multiple threads competing to rename the same file..? Is there
>anything else in this directory at the time this sequence executes
>(e.g., a file with object data that also gets quarantined)?
The previous behavior (a bug in the application) should not trigger Kernel panic. Yes, there's multiple threads competing to DELETE(create a empty file) in the same directory also move the existing one to the quarantined area. I think this is the root cause of kernel panic. The scenario is 10 application workers raise 10 thread to do same thing in the same moment.
>Ideally, we'd ultimately like to translate this into a sequence of
>operations as seen by the fs that hopefully trigger the problem. We
>might have to start by reproducing through the application server.
>Looking back at that bug report, it sounds like a 'DELETE' is a
>high-level server operation that can consist of multiple sub-operations
>at the filesystem level (e.g., list, conditional rename if *.ts file
>exists, etc.). Do you have enough information through any of the above
>to try and run something against Swift that might explicitly reproduce
>the problem? For example, have one thread that creates and recreates the
>same object repeatedly and many more competing threads that try to
>remove (or whatever results in the quarantine) it? Note that I'm just
>grasping at straws here, you might be able to design a more accurate
>reproducer based on what it looks like is happening within Swift.
We observe this issue on production cluster. It's hard to have a free gear with 100% same HW to test it currently.
I'll try to figure out an approach to reproduce it. I'll update this mail thread if I can make it.
Thanks // Hugo
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs