Re: XFS: possible memory allocation deadlock in kmem_alloc on glusterfs setup

Cyril Peponnet <cyril.peponnet@xxxxxxxxxxxxxxxxx> · Sun, 4 Dec 2016 17:14:51 -0800

> On Dec 4, 2016, at 3:50 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Sun, Dec 04, 2016 at 03:24:50PM -0800, Cyril Peponnet wrote:
>>> On Dec 4, 2016, at 2:46 PM, Dave Chinner <david@xxxxxxxxxxxxx>
>>> Which used LVM snapshots to take snapshots of the entire brick.
>>> I don't see any LVM in your config, so I'm not sure what
>>> snapshot implementation you are using here. What are you using
>>> to take the snapshots of your VM image files? Are you actually
>>> using the qemu qcow2 snapshot functionality rather than anything
>>> native to gluster?
>>> 
>> 
>> Yes sorry it was not clear enough, qemu-img snapshots no native
>> snapshots.
> 
> Ok, so that's a fragmentation problem in it's own right. both
> internal qcow2 fragmentation and file fragmentation.
> 
>>> Also, can you attach the 'xfs_bmap -vp' output of some of these
>>> image files and their snapshots?
>> 
>> A snapshot:
>> https://gist.github.com/CyrilPeponnet/8108c74b9e8fd1d9edbf239b2872378d
>> (let me know if you need more basically there is around 600 live
>> snapshots sitting here).
> 
> 1200 extents, mostly small, almost entirely adjacent. Typical qcow2
> file fragmentation pattern. That's not going to cause your memory
> allocation problems - can you find one that has hundreds of
> thousands of extents?

I found one with 10799109 :/ 576GB in size (I need to find why this one is so big this is not normal…)… Could it lead to the issue? I mean could one file cause the deadlock of the entire FS?

> 
>>> 
>>> 56GB of cached file data. If you're getting high order
>>> allocation failures (which I suspect is the problem) then this
>>> is a memory fragmentation problem more than anything.
>>> 
>>>> ----------------------------------------------------------------
>>>> DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
>>>> ----------------------------------------------------------------
>>>> 0/0   RAID0 Optl  RW     Yes     RAWBC -   ON  7.275 TB scratch
>>>> ----------------------------------------------------------------
>>>> 
>>>> Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially
>>>> Degraded|dgrd=Degraded Optl=Optimal|RO=Read Only|RW=Read
>>>> Write|HD=Hidden|B=Blocked|Consist=Consistent| R=Read Ahead
>>>> Always|NR=No Read Ahead|WB=WriteBack| AWB=Always
>>>> WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
>>> 
>>> IIRC, AWB means that if the cache goes into degraded/offline
>>> mode, you’re vulnerable to corruption/loss on power
>>> failure…
>> 
>> Yes we have BBU + redundant PSU to address that.
> 
> BBU fails, data center loses power, corruption/data loss still
> occurs. Not my problem, though.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html