Bug #402

Tail drop behavior on fq_codel on ack-mostly streams

Added by Dave Täht 10 months ago. Updated 9 months ago.

Status:New Start date:07/12/2012
Priority:Urgent Due date:
Assignee:Dave Täht % Done:

0%

Category:Networking Spent time: 6.00 hours
Target version:1st Public Cerowrt release

Description

In order to reduce memory requirements I reduced the packet limits for the 24 fq_codel queues to anywhere between 100 and 600 packets depending on type. This, I reasoned, would be enough for most traffic loads.

I missed the case of ack mostly traffic, which as it has a 20x1 ratio vs big packets, requires that the queues be
much larger. On an ACK-heavy load, at high bandwidths, we end up with fq_codel in an "interesting" drop tail state, where it still admits packets but ends up optimizing for "new" flows that aren't actually "new".

I do not know the right values. The fq_codel default of 10k packets is sufficient to run the router out of memory in a variety of scenarios, resulting in a crash.


Related issues

related to Cerowrt - Bug #379: 3.3.4-3 router crashes under heavy load New

History

Updated by Dave Täht 10 months ago

So to add clarity here (see also #401) when I do non-ack mostly streams I run the box out of memory (can't observe this, not necessarily a valid conclusion) and queue depth, after I went back to pure codel I can still crash...

netperf -Y CS5,CS5 -l 60 -H 172.20.1.1 -t TCP_MAERTS &
netperf -Y CS1,CS1 -l 60 -H 172.20.1.1 -t TCP_MAERTS &
#netperf -Y EF,EF -l 60 -H 172.20.1.1 -t TCP_MAERTS &
netperf -Y EF,EF -l 60 -H 172.20.1.1 -t TCP_MAERTS &
#netperf -Y EF,EF -l 60 -H 172.20.1.1 -t TCP_MAERTS &
#netperf -Y EF,EF -l 60 -H 172.20.1.1 -t TCP_MAERTS &
netperf -Y CS0,CS0 -l 60 -H 172.20.1.1 -t TCP_MAERTS &

I may well have an interaction with netperf running on the box too.

root@testbox:~# tc -s qdisc show dev gw11
qdisc mq 1: root 
 Sent 349384011 bytes 231149 pkt (dropped 491, overlimits 0 requeues 26191) 
 backlog 56210b 40p requeues 26191 
qdisc codel 10: parent 1:1 limit 1000p target 5.0ms interval 100.0ms 
 Sent 344597768 bytes 227688 pkt (dropped 457, overlimits 0 requeues 25833) 
 backlog 7570b 5p requeues 25833 
  count 1 lastcount 1 ldelay 354us drop_next 0us
  maxpacket 1514 ecn_mark 0 drop_overlimit 0
qdisc codel 20: parent 1:2 limit 1000p target 10.0ms interval 100.0ms 
 Sent 4260478 bytes 2827 pkt (dropped 0, overlimits 0 requeues 271) 
 backlog 45420b 30p requeues 271 
  count 0 lastcount 0 ldelay 19.6ms drop_next 0us
  maxpacket 1514 ecn_mark 0 drop_overlimit 0
qdisc codel 30: parent 1:3 limit 1000p target 20.0ms interval 100.0ms 
 Sent 525617 bytes 632 pkt (dropped 34, overlimits 0 requeues 87) 
 backlog 3220b 5p requeues 87 
  count 32 lastcount 23 ldelay 16.7ms drop_next 0us
  maxpacket 1514 ecn_mark 0 drop_overlimit 0
qdisc codel 40: parent 1:4 limit 1000p target 5.0ms interval 100.0ms 
 Sent 148 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
  count 0 lastcount 0 ldelay 0us drop_next 0us
  maxpacket 256 ecn_mark 0 drop_overlimit 0
root@testbox:~# Write failed: Broken pipe
d@ida:~$ ~~~

Updated by Dave Täht 10 months ago

partially a heisenbug, when I drive this this hard with netserver, that consumes a ton more ram than usual.

Usually I have a test box and not netperf running on the other side of the router.

Updated by Dave Täht 10 months ago

At the observed wireless bandwidths with packet limit 1200 I see fq_codel peak at around 700 packets and not enter drop tail.

Figuring out what those bandwidths actually are and the interrelationship between the device queuing, EDCA, and fq_codel is going to be the subject of a long month or three. Or longer.

Updated by Dave Täht 9 months ago

  • Target version set to 1st Public Cerowrt release

Also available in: Atom PDF