NFQUEUE stucked packets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello list!

I have similar situation as described in
http://lists.netfilter.org/pipermail/netfilter-devel/2006-January/023153.html

I have network gateway for 700 users with average 30-40 mbit channel
utilization. For traffic acounting using NFQUEUE for realtime
statistics and user MAC address security.

root@darkstar:/#  uname -a
Linux darkstar 2.6.28.7-smp #1 SMP Mon Mar 9 14:07:23 MSK 2009
i686 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux

Yesterday all day queue overflow in about 10 min.
I write special script to discover this situation.

#!/bin/sh
QUEUE_NUM=10
while [ 1 ]; do
    TOTAL=
    DATE=`date +"%Y-%m-%d %H:%M:%S"`
    CAT=`cat /proc/net/netfilter/nfnetlink_queue | grep $QUEUE_NUM`
    QUEUE_TOTAL=`echo $CAT | cut -f 3 -d ' '`
    echo $CAT $DATE =$QUEUE_TOTAL
    sleep 10
done

And have this log:
=======================
10 26197 89 2 65535 0 40 157532 1 2009-03-08 19:18:00 =89
10 26197 80 2 65535 0 99 306347 1 2009-03-08 19:18:11 =80
10 26197 97 2 65535 0 198 453838 1 2009-03-08 19:18:21 =97
10 26197 120 2 65535 0 254 600438 1 2009-03-08 19:18:31 =120
10 26197 162 2 65535 0 254 740278 1 2009-03-08 19:18:41 =162
10 26197 165 2 65535 0 385 886832 1 2009-03-08 19:18:51 =165
10 26197 173 2 65535 0 477 1030473 1 2009-03-08 19:19:01 =173
10 26197 195 2 65535 0 477 1173828 1 2009-03-08 19:19:11 =195
10 26197 199 2 65535 0 613 1320007 1 2009-03-08 19:19:21 =199
......
10 26197 1314 2 65535 0 150480 8682536 1 2009-03-08 19:30:54 =1314
10 26197 1297 2 65535 0 155210 8762215 1 2009-03-08 19:31:04 =1297
10 26197 1395 2 65535 0 159517 8840470 1 2009-03-08 19:31:14 =1395
10 26197 1431 2 65535 0 164050 8917428 1 2009-03-08 19:31:24 =1431
10 26197 1426 2 65535 0 168789 8997451 1 2009-03-08 19:31:34 =1426
10 26197 1440 2 65535 0 173915 9077845 1 2009-03-08 19:31:44 =1440
10 26197 1469 2 65535 0 179015 9152595 1 2009-03-08 19:31:54 =1469
10 26197 1499 2 65535 0 183489 9226571 1 2009-03-08 19:32:04 =1499
10 26197 1564 2 65535 0 188427 9296313 1 2009-03-08 19:32:14 =1564

After about 10-15 min we have very big delays for each packet.
Restart of accounting program flush the queue and starting again:

10 12083 32 2 65535 0 290 129955 1 2009-03-08 19:32:37 =32
10 12083 123 2 65535 0 404 263677 1 2009-03-08 19:32:47 =123
10 12083 192 2 65535 0 577 395122 1 2009-03-08 19:32:57 =192
10 12083 140 2 65535 0 739 524411 1 2009-03-08 19:33:08 =140
10 12083 193 2 65535 0 1209 654960 1 2009-03-08 19:33:18 =193
10 12083 247 2 65535 0 1465 805822 1 2009-03-08 19:33:28 =247
10 12083 283 2 65535 0 1610 945017 1 2009-03-08 19:33:38 =283
10 12083 402 2 65535 0 1775 1073752 1 2009-03-08 19:33:48 =402
.....
10 12083 1358 2 65535 0 49719 4501398 1 2009-03-08 19:39:19 =1358
10 12083 1383 2 65535 0 54400 4578319 1 2009-03-08 19:39:29 =1383
10 12083 1484 2 65535 0 57611 4651952 1 2009-03-08 19:39:39 =1484
10 12083 1433 2 65535 0 61306 4727467 1 2009-03-08 19:39:49 =1433
10 12083 1478 2 65535 0 64507 4800503 1 2009-03-08 19:39:59 =1478
10 12083 1468 2 65535 0 68569 4873713 1 2009-03-08 19:40:09 =1468
10 12083 1516 2 65535 0 72345 4947701 1 2009-03-08 19:40:19 =1516

At night situation was not such terrible..

10 25572 3 2 65535 0 0 81409 1 2009-03-09 04:56:14 =3
10 25572 18 2 65535 0 0 166434 1 2009-03-09 04:56:24 =18
10 25572 36 2 65535 0 73 250703 1 2009-03-09 04:56:34 =36
10 25572 45 2 65535 0 78 346981 1 2009-03-09 04:56:44 =45
10 25572 17 2 65535 0 89 429374 1 2009-03-09 04:56:54 =17
10 25572 12 2 65535 0 89 513713 1 2009-03-09 04:57:04 =12
10 25572 15 2 65535 0 89 597835 1 2009-03-09 04:57:14 =15
.....
10 25572 1252 2 65535 0 3767514 188608304 1 2009-03-09 11:57:04 =1252
10 25572 1246 2 65535 0 3773969 188693799 1 2009-03-09 11:57:14 =1246
10 25572 1237 2 65535 0 3779614 188777734 1 2009-03-09 11:57:24 =1237
10 25572 1261 2 65535 0 3785212 188860586 1 2009-03-09 11:57:34 =1261
10 25572 1242 2 65535 0 3790644 188941483 1 2009-03-09 11:57:44 =1242
10 25572 1266 2 65535 0 3796270 189021111 1 2009-03-09 11:57:54 =1266
10 25572 1283 2 65535 0 3801408 189105046 1 2009-03-09 11:58:04 =1283
10 25572 1278 2 65535 0 3806845 189183468 1 2009-03-09 11:58:14 =1278
10 25572 1242 2 65535 0 3812415 189265729 1 2009-03-09 11:58:24 =1242
10 25572 1242 2 65535 0 3818292 189343670 1 2009-03-09 11:58:34 =1242

As you see after 189 millions of packet we have 1242 stucked packets
in queue.

Today i try to modify kernel to get more information about this:

diff -rcB nfnetlink_queue.c-orig nfnetlink_queue.c
======= cut here ==========
*** nfnetlink_queue.c-orig      Sat Feb 21 01:41:27 2009
--- nfnetlink_queue.c   Mon Mar  9 17:04:27 2009
***************
*** 169,178 ****
--- 169,181 ----
         queue->queue_total++;
  }

+ static unsigned int max_dequeue_position = 0;
+
  static struct nf_queue_entry *
  find_dequeue_entry(struct nfqnl_instance *queue, unsigned int id)
  {
        struct nf_queue_entry *entry = NULL, *i;
+       int dequeue_position = 0;

        spin_lock_bh(&queue->lock);

***************
*** 180,188 ****
                if (i->id == id) {
                        entry = i;
                        break;
!               }
        }

        if (entry) {
                list_del(&entry->list);
                queue->queue_total--;
--- 183,195 ----
                if (i->id == id) {
                        entry = i;
                        break;
!               } else dequeue_position++;
        }

+       if (max_dequeue_position < dequeue_position) {
+           max_dequeue_position = dequeue_position;
+       };
+
        if (entry) {
                list_del(&entry->list);
                queue->queue_total--;
***************
*** 860,871 ****
  {
        const struct nfqnl_instance *inst = v;

!       return seq_printf(s, "%5d %6d %5d %1d %5d %5d %5d %8d %2d\n",
                          inst->queue_num,
!                         inst->peer_pid, inst->queue_total,
                          inst->copy_mode, inst->copy_range,
                          inst->queue_dropped, inst->queue_user_dropped,
!                         inst->id_sequence, 1);
  }

  static const struct seq_operations nfqnl_seq_ops = {
--- 867,878 ----
  {
        const struct nfqnl_instance *inst = v;

!       return seq_printf(s, "%5d %6d %5d %5d %1d %5d %5d %5d %8d %2d %5d\n",
                          inst->queue_num,
!                         inst->peer_pid, inst->queue_total, inst->queue_maxlen,
                          inst->copy_mode, inst->copy_range,
                          inst->queue_dropped, inst->queue_user_dropped,
!                         inst->id_sequence, 1, max_dequeue_position);
  }

  static const struct seq_operations nfqnl_seq_ops = {


======= end of cut ==========

And after that i have information in /proc about queue_maxlen and
stucked packets in queue.
Log with this patch:
10 3052 55 20000 2 65535 0 3194 1563090 1 0 2009-03-09 14:22:32 =55
10 3052 0 20000 2 65535 0 3750 1680510 1 0 2009-03-09 14:22:42 =0
10 3052 0 20000 2 65535 0 4666 1806095 1 0 2009-03-09 14:22:52 =0
10 3052 36 20000 2 65535 0 5124 1934143 1 0 2009-03-09 14:23:02 =36
10 3052 0 20000 2 65535 0 5124 2067343 1 0 2009-03-09 14:23:12 =0
10 3052 0 20000 2 65535 0 5662 2185490 1 0 2009-03-09 14:23:22 =0
10 3052 3 20000 2 65535 0 5953 2314004 1 0 2009-03-09 14:23:32 =3
......
10 3052 85 20000 2 65535 0 41306 57183821 1 0 2009-03-09 15:38:46 =85
10 3052 3 20000 2 65535 0 41378 57287667 1 0 2009-03-09 15:38:56 =3
10 3052 23 20000 2 65535 0 41378 57383407 1 1 2009-03-09 15:39:06 =23
10 3052 1 20000 2 65535 0 41387 57488941 1 1 2009-03-09 15:39:16 =1
......
10 3052 36 20000 2 65535 0 45185 66286805 1 1 2009-03-09 15:52:39 =36
10 3052 1 20000 2 65535 0 45643 66403946 1 1 2009-03-09 15:52:49 =1
10 3052 20 20000 2 65535 0 46222 66522173 1 2 2009-03-09 15:52:59 =20
10 3052 41 20000 2 65535 0 46556 66642593 1 2 2009-03-09 15:53:09 =41
......
10 3052 73 20000 2 65535 0 169361 111410441 1 2 2009-03-09 16:48:20 =73
10 3052 61 20000 2 65535 0 169800 111517779 1 2 2009-03-09 16:48:30 =61
10 3052 60 20000 2 65535 0 170111 111626724 1 3 2009-03-09 16:48:40 =60
10 3052 62 20000 2 65535 0 170430 111738973 1 3 2009-03-09 16:48:50 =62
......
10 3052 117 20000 2 65535 0 224668 120418980 1 3 2009-03-09 16:59:42 =117
10 3052 3 20000 2 65535 0 224941 120567191 1 3 2009-03-09 16:59:52 =3
10 3052 101 20000 2 65535 0 225575 120724075 1 4 2009-03-09 17:00:02 =101
10 3052 70 20000 2 65535 0 225693 120886263 1 4 2009-03-09 17:00:12 =70
......
10 3052 6 20000 2 65535 0 365069 145741324 1 6 2009-03-09 17:28:48 =6
10 3052 6 20000 2 65535 0 365287 145925306 1 6 2009-03-09 17:28:58 =6
10 3052 99 20000 2 65535 0 365504 146113439 1 6 2009-03-09 17:29:08 =99
10 3052 95 20000 2 65535 0 365667 146296955 1 6 2009-03-09 17:29:18 =95
10 3052 6 20000 2 65535 0 365823 146489625 1 6 2009-03-09 17:29:28 =6
10 3052 97 20000 2 65535 0 365916 146676339 1 6 2009-03-09 17:29:39 =97
......

Today 3 hours of work with very high load (about 40 mbit and in peaks up
to 60 mbit) we have 6 packets stuck in queue...

How to resolve this problem ?
This is hardware bug, SMP bug, users corrupted packets bug, NFQUEUE
bug or anything else ? How i can view this stucked packets in queue to
try understand what is going wrong ?
(userspace don't have any errors, because this code work normally
from 2004 ip_queue ages, and we have this after rewrite application
for NFQUEUE)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Kuzin Andrey  -  kuzinandrey@xxxxxxxxx
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netfitler Users]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux