On Mon, May 1, 2017 at 11:48 AM, Brian Foster <bfoster@xxxxxxxxxx> wrote: >> >> It takes about 10 hours to reproduce the problem. >> > > Out of curiosity, is that 10 hours of removing files or 10 hours of > repopulating and removing until the problem happens to occur? If the > latter, roughly how many fill/remove cycles does that entail (tens, > hundreds, thousands)? 10 hours of repopulating the cluster. Then remove all rbd images with "rbd rm xxx" Just 1 cycle: fill then remove. > You could try to populate the fs using Ceph as with your current > reproducer, particularly since it may use patterns or features that > could affect this problem (xattrs?) that fio may not induce, and then > try to directly reproduce the overrun via manual file removals. This > would be sufficient for debugging because if you can share a metadump > image of the original fs and appropriate steps to reproduce, we don't > particularly need to care about how the fs was constructed in the first > place. > > For example, if you have a test that currently populates and depopulates > the fs through Ceph, something I might try is to update the test to > generate a metadump image of the fs every time your test cycles from > populating to depopulating. Once the problem reproduces, you now have a > metadump image of the original fs that you can restore and use to try to > reproduce the overrun manually (repeatedly, if nec.). That's a nice idea to debug it. I'll try. Thanks. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html