Re: xlog_write: reservation ran out

Ming Lin <mlin@xxxxxxxxxx> · Mon, 1 May 2017 13:18:58 -0700

On Mon, May 1, 2017 at 11:48 AM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
>>
>> It takes about 10 hours to reproduce the problem.
>>
>
> Out of curiosity, is that 10 hours of removing files or 10 hours of
> repopulating and removing until the problem happens to occur? If the
> latter, roughly how many fill/remove cycles does that entail (tens,
> hundreds, thousands)?

10 hours of repopulating the cluster. Then remove all rbd images with
"rbd rm xxx"
Just 1 cycle: fill then remove.

> You could try to populate the fs using Ceph as with your current
> reproducer, particularly since it may use patterns or features that
> could affect this problem (xattrs?) that fio may not induce, and then
> try to directly reproduce the overrun via manual file removals. This
> would be sufficient for debugging because if you can share a metadump
> image of the original fs and appropriate steps to reproduce, we don't
> particularly need to care about how the fs was constructed in the first
> place.
>
> For example, if you have a test that currently populates and depopulates
> the fs through Ceph, something I might try is to update the test to
> generate a metadump image of the fs every time your test cycles from
> populating to depopulating. Once the problem reproduces, you now have a
> metadump image of the original fs that you can restore and use to try to
> reproduce the overrun manually (repeatedly, if nec.).

That's a nice idea to debug it.
I'll try.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html