>Can you still see the problem with Citus 7.4? Hi, Thomas. I actually went back to the cluster with Citus7.4 and PG10.4. And modified the parallel param. So far, I haven't seen any server crash. The main difference between crashes observed and no crash, is the set of Linux TCP time out parameters (to release the ports faster). Unfortunately, I cannot "undo" the Linux params and run the stress tests anymore, as this is a multi-million $ cluster and people are doing more useful things on it. I will keep an eye on any parallel execution issue. On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro <thomas.munro@xxxxxxxxxxxxxxxx> wrote: > > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand.m.stone@xxxxxxxxx> wrote: > > Just as a follow up. I tried the parallel execution again (in a stress > > test environment). Now the crash seems gone. I will keep an eye on > > this for the next few weeks. > > Thanks for the report. That's great news, but it'd be good to > understand why it was happening. > > > My theory is that the Citus cluster created and shut down a lot of TCP > > connections between coordinator and workers. If running on untuned > > Linux machines, the TCP ports might run out. > > I'm not sure how that's relevant, unless perhaps it causes executor > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't > fix? I wonder if there could be something different about the control > flow with custom scans, or something about the way Citus worker nodes > invoke plan fragments, or some error path that I failed to consider... > It's a clue that all of your worker nodes reliably crashed at the same > time on the same/similar queries (presumably distributed query > fragments for different shards), making it seem more like a > common-or-garden bug rather than some kind of timing-based heisenbug. > If you ever manage to reproduce it, an explain plan and a back trace > would be very useful. > > > Of course, I am using "newer" PG10 bits and Citus7.5 this time. > > Hmm. There weren't any relevant commits to REL_10_STABLE that I can > think of. And (with the proviso that I know next to nothing about > Citus) I just cloned https://github.com/citusdata/citus.git and > skimmed through "git diff origin/release-7.4..origin/release-7.5", and > nothing is jumping out at me. Can you still see the problem with > Citus 7.4? > > -- > Thomas Munro > http://www.enterprisedb.com