Bug #332
the new SFQ with low limits and many flows exhibits starvation issues
| Status: | Closed | Start date: | 01/30/2012 | |
|---|---|---|---|---|
| Priority: | Urgent | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | Linux Kernel | Spent time: | 30.00 hours | |
| Target version: | 1st Public Cerowrt release |
Description
I am getting some reports of SFQ's new 'head drop' orientation causing starvation.
More details to come
History
Updated by David Taht over 1 year ago
---------- Forwarded message ----------
From: Jesper Dangaard Brouer <jdb@comx.dk>
Date: Thu, Jan 26, 2012 at 10:38 AM
Subject: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Hi Eric
(Cc Dave Taht)
I think I have found a problem with your SFQ patch:
sch_sfq: dont put new flow at the end of flows (d47a0ac7b66 net-next).
Do you want to discuss this on netdev or offlist?
My preliminary finding is, that is works really nice, until I reach the
queue "limit", then new flow almost cannot get through. With the patch
reverted this is not a problem, although of-cause with significantly
higher latency.
I'm currently running v3.2.1-stable with you and other patches applied.
I will re-run my tests on a clean net-next soon (just need my SFP+ patch
on top)... just to verify the results.
My setup is a HTB root node with a 2Mbit/s bandwidth limit, and a SFQ
qdisc attached, which have limit 10.
(This is on a 10Gbit/s NIC, thats not a problem, right?)
albpd42:~# tc class ls dev eth61
class htb 1: root leaf 42: prio 0 rate 2000Kbit ceil 2000Kbit burst
1600b cburst 1600b
albpd42:~# tc qdisc ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 0 direct_packets_stat 0
qdisc sfq 42: parent 1: limit 10p quantum 1514b
The test that revels the problem is fairly simple:
1. xterm running "ping -nA 10.10.61.2" as non-root user, thus minimal
ping interval is 200msec.
2. On sink 10.10.61.2 run "iperf -s -i10"
3. On source 10.10.61.1, I run: "iperf -c 10.10.61.2 -t 1000 -i 2 -P XXX"
If the number of parallel connection is below 10 (-P 10), then is works
very good, with very low ping times.
If the number of parallel connection is above 10, that is larger than
the SFQ limit on 10, things turn bad... and the ping drops almost all
packets. With the patch reverted, ping packets get through, with a
higher latency, but also with some dropped packets (but no as bad as the
other case).
I don't know the SFQ code well enough to fix the issue my self, so I'm
asking for you guidance...
Another corrolation is, that the problem occurs, when the following
command reach the SFQ limit of 10.
tc class ls dev eth61 | grep 'sfq' | wc -l
Updated by David Taht over 1 year ago
---------- Forwarded message ----------
From: Jesper Dangaard Brouer <jdb@comx.dk>
Date: Thu, Jan 26, 2012 at 11:57 AM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
On Thu, 2012-01-26 at 11:03 +0100, Eric Dumazet wrote:
Le jeudi 26 janvier 2012 à 10:38 +0100, Jesper Dangaard Brouer a écrit :
[...]Another corrolation is, that the problem occurs, when the following command reach the SFQ limit of 10.
tc class ls dev eth61 | grep 'sfq' | wc -l
Hi Jesper
If you have more than 10 flows and a queue limit of 10, then each time a new packet is enqueued, an 'old packet in queue' must be chosen and dropped.
This 'old packet' is chosen, using the "longest flow".
But if all flows have one packet, then what can we do ...
I think this is the exact case, all flows have one packet.
This is also given by the "tc | wc -l" command.
It can be the "ping packet" by luck. Why would it be more important than an other ?
Because it represent a new connection establishment.
- vi +341 net/sched/sch_sfq.c if (d == 1) { /* It is difficult to believe, but ALL THE SLOTS HAVE LENGTH 1. */ x = q->tail->next; slot = &q->slots[x]; q->tail->next = slot->next; q->ht[slot->hash] = SFQ_EMPTY_SLOT; goto drop; }
This part of the code really takes whatever packet, and changing it might solve your problem but is it a real one ?
This is an oversimplified example,
There is no guarantee the 'ping packet' cannot be dropped, with old SFQ or new SFQ.
The probability of dropping is just significantly higher in the new
SFQ.
On Thu, 2012-01-26 at 11:22 +0100, Eric Dumazet wrote:
Le jeudi 26 janvier 2012 à 11:03 +0100, Eric Dumazet a écrit :
By the way, above code implements a head-drop :)
Aha - thats why the ping/new-conn packets are getting dropped in the new
SFQ, as new flows are put a head in the queue!
Mostbly because a tail-drop would be very expensive : We would need to add a double linked list instead of a single link, or scan the whole list to find the element right before the tail.
Yes, we want/need an inexpensive operation here.
I just think its problematic that new flows will suffer.
The real-life case that I'm worried about with this drop behaviour, is
the fairness among flows, e.g. TCP flows. Once we reach the queue
limit, number of TCP flows, no new TCP flows can be established, thus
being unfair to new flows. Guess, I have to find a test showing this...
Updated by David Taht over 1 year ago
---------- Forwarded message ----------
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, Jan 26, 2012 at 3:33 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: jdb@comx.dk
Cc: Dave Taht <dave.taht@gmail.com>
Le jeudi 26 janvier 2012 à 15:04 +0100, Jesper Dangaard Brouer a écrit :
On Thu, 2012-01-26 at 14:50 +0100, Jesper Dangaard Brouer wrote:
Ah. I was not running with a limit that low, but at 50-300.
With a packet limit <= 128, the results should be reproduce-able, as the "depth" is fixed on 128.
I'm using a too old "tc" to adjust the "depth".
Just, git downloaded the latest "tc".
And I think I mean "flows" instead of "depth" in the above statements, sorry! (after reading the just committed manual update by Eric ;-))
I'm still a little confused. How is the "flows" related to the hash table size "divisor"? Should I reduce "divisor" to 256, to save memory, as in my setup I have a sfq per customer (per interface). As, I guess I can live with some collisions with 127 flows?
Hmm, if you reduce hash divisor, you also favor hash collisions.
It all depends on the memory you want to save I would say.
1024 divisor is quite good for 127 flows I guess.
Updated by David Taht over 1 year ago
---------- Forwarded message ----------
From: Jesper Dangaard Brouer <jdb@comx.dk>
Date: Mon, Jan 30, 2012 at 11:17 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
On Thu, 2012-01-26 at 12:12 +0100, Eric Dumazet wrote:
To avoid this problem, you can limit number of flows to N and number of packets to N+1 : This way there is always one flow with 2 packets.
That does help, with packets = flows+1, but its not stable and
"falls-back" into the bad mode (of a single flow winning). I have to
increase it some more to get good results.
E.g. limit 20 and flows 10, allows 10 TCP flows to share the bandwidth,
BUT the ping program does NOT get any packets through (thus, no new
connections can be established, which is bad).
Quite frankly, SFQ should not be limited to 10 packets.
I can provoke the exact same behaviour with the default 127 packets, and
129 TCP flows (thrulay -m 129 10.10.61.2). Only a single flow out of
129 gets any throughput.
Side note:
I noticed that adding SFQ without any params, will give limit 127p and
flows 128.
albpd42:~# ~jdb/git/iproute2/tc/tc -d -s q ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 2 direct_packets_stat 0 ver 3.17
Sent 405048608 bytes 293646 pkt (dropped 74779, overlimits 660160 requeues 0)
backlog 0b 127p requeues 0
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 flows
128/1024 divisor 1024
Sent 12350969 bytes 8748 pkt (dropped 3369, overlimits 0 requeues 0)
backlog 189382b 127p requeues 0
But I cannot create the same manually config by:
albpd42:~# jdb/git/iproute2/tc/tc qdisc del dev eth61 parent 1:2 handle 2:
albpd42:# jdb/git/iproute2/tc/tc qdisc add dev eth61 parent 1:2
handle 2: sfq limit 127 flows 128
albpd42:# ~jdb/git/iproute2/tc/tc -d -s q ls dev eth61
qdisc htb 1: root refcnt 65 r2q 10 default 2 direct_packets_stat 0 ver 3.17
Sent 410090878 bytes 297341 pkt (dropped 75611, overlimits 668278 requeues 0)
backlog 0b 0p requeues 0
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 flows
127/1024 divisor 1024
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
Guess it makes sense that I'm not allowed to create more flows than
packets, but the default config does that... strange.
Updated by David Taht over 1 year ago
---------- Forwarded message ----------
From: Jesper Dangaard Brouer <jdb@comx.dk>
Date: Thu, Jan 26, 2012 at 1:15 PM
Subject: Re: Problem with the SFQ patch: dont put new flow at the end of flows
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
On Thu, 2012-01-26 at 11:57 +0100, Jesper Dangaard Brouer wrote:
The real-life case that I'm worried about with this drop behaviour, is the fairness among flows, e.g. TCP flows. Once we reach the queue limit, number of TCP flows, no new TCP flows can be established, thus being unfair to new flows. Guess, I have to find a test showing this...
The fairness among flows seems to be broken with the patch.
In using the tool "thrulay" as it gives a nice output per TCP
connection. The important column is "Mb/s" (is also have RTT which is
also very useful, when troubleshooting latency/queue issues, but its not
important here).
Command starting 12 TCP flows:
albpd42:~# thrulay -m 12 10.10.61.2
I have only included the sum after 60 sec test.
First test without patch:
- local window = 262142B; remote window = 262142B
- block size = 8192B
- MTU: 1500B; MSS: 1448B; Topology guess: Ethernet/PPP
- MTU = getsockopt(IP_MTU); MSS = getsockopt(TCP_MAXSEG)
- test duration = 60s; reporting interval = 1s
[... cut ...]
#(ID) begin, s end, s Mb/s RTT, ms: min avg max
#( 0) 0.000 60.000 0.085 614.475 10510.368 37648.921
#( 1) 0.000 0.000 0.000 0.000 0.000 0.000
#( 2) 0.000 60.000 0.001 49749.240 49749.240 49749.240
#( 3) 0.000 60.000 0.198 389.384 4483.202 7644.653
#( 4) 0.000 60.000 0.167 401.534 5262.690 16491.171
#( 5) 0.000 60.000 0.201 413.688 4410.953 9893.164
#( 6) 0.000 60.000 0.167 438.016 5293.815 10324.560
#( 7) 0.000 60.000 0.226 468.394 3982.983 5716.139
#( 8) 0.000 60.000 0.218 486.621 4117.844 6094.251
#( 9) 0.000 60.000 0.189 517.020 4790.583 10386.741
#(10) 0.000 60.000 0.209 480.531 4331.842 7406.696
#(11) 0.000 60.000 0.223 504.839 4001.091 5681.422
#(**) 0.000 60.000 1.884 389.384 8411.218 49749.240
Notice the fairshare between the flows.
Also notice that flow id (1) has an "end,s" of 0, I think than means
that it was not able establish the connection.
Second test with the patch:
albpd42:~# thrulay -m 12 10.10.61.2- local window = 262142B; remote window = 262142B
- block size = 8192B
- MTU: 1500B; MSS: 1448B; Topology guess: Ethernet/PPP
- MTU = getsockopt(IP_MTU); MSS = getsockopt(TCP_MAXSEG)
- test duration = 60s; reporting interval = 1s
#(ID) begin, s end, s Mb/s RTT, ms: min avg max
#( 0) 0.000 0.000 0.000 0.000 0.000 0.000
#( 1) 0.000 0.000 0.000 0.000 0.000 0.000
#( 2) 0.000 0.000 0.000 0.000 0.000 0.000
#( 3) 0.000 0.000 0.000 0.000 0.000 0.000
#( 4) 0.000 60.000 1.902 127.884 489.294 583.101
#( 5) 0.000 0.000 0.000 0.000 0.000 0.000
#( 6) 0.000 0.000 0.000 0.000 0.000 0.000
#( 7) 0.000 0.000 0.000 0.000 0.000 0.000
#( 8) 0.000 0.000 0.000 0.000 0.000 0.000
#( 9) 0.000 0.000 0.000 0.000 0.000 0.000
#(10) 0.000 0.000 0.000 0.000 0.000 0.000
#(11) 0.000 0.000 0.000 0.000 0.000 0.000
#(**) 0.000 60.000 1.902 127.884 40.774 583.101
This result show a very unfair sharing. Only one TCP flow are getting
data through! -- I expected to see a more unfair behaviour towards new
flows, but this is a bit extreme... something else must be wrong! Cannot
understand this change should make it that unfair!
I have tried to re-run the tests several times. I have also tested with
iperf, and I get similar results:
[ 14] 0.0-60.0 sec 13.6 MBytes 1.90 Mbits/sec
[ 12] 0.0-60.1 sec 56.0 KBytes 7.64 Kbits/sec
[ 9] 0.0-60.1 sec 16.0 KBytes 2.18 Kbits/sec
[ 7] 0.0-60.2 sec 16.0 KBytes 2.18 Kbits/sec
[ 6] 0.0-60.2 sec 16.0 KBytes 2.18 Kbits/sec
[SUM] 0.0-60.2 sec 14.2 MBytes 1.98 Mbits/sec
[ 5] 0.0-60.4 sec 16.0 KBytes 2.17 Kbits/sec
[ 3] 0.0-60.5 sec 16.0 KBytes 2.17 Kbits/sec
[ 11] 0.0-60.5 sec 16.0 KBytes 2.17 Kbits/sec
[ 8] 0.0-74.5 sec 16.0 KBytes 1.76 Kbits/sec
[ 13] 0.0-79.6 sec 16.0 KBytes 1.65 Kbits/sec
Flow [14] gets 1.9Mbit/s, while the rest gets below 7.64 Kbits/sec.
I don't understand why this small patch changed the fairness so much?!?
Updated by David Taht over 1 year ago
Jasper:
I have duplicated your starvation problem, even with sfqred on cerowrt bql-30.
I had hoped that setting the depth would help. It didn't.
on the 3.3 based laptop:
tc -b <AAA
qdisc del dev eth0 root
qdisc add dev eth0 root handle 1: est 1sec 4sec htb default 1
class add dev eth0 parent 1: classid 1:1 est 1sec 4sec htb rate
4000kbit mtu 1500 quantum 1500
qdisc add dev eth0 parent 1:1 handle a: est 1sec 4sec sfq limit 200
headdrop perturb 60000 flows 500 divisor 16384 redflowlimit 40000
depth 24 min 4500 max 9000 probability 0.20 ecn harddrop
AAA
The cerowrt box is running:
root@OpenWrt:~# tc qdisc show dev se00
qdisc sfq a: root refcnt 2 limit 127p quantum 1514b depth 127 divisor 1024
running 10 iperfs (iperf -t 60 -c 172.30.42.1) from the laptop to the
cerowrt box.
Note the first iperf completely starves out the last 9 connection attempts, so
they don't even start.
[ 4] 0.0-60.5 sec 27.0 MBytes 3.74 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41471
[ 4] 0.0- 1.6 sec 768 KBytes 3.83 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41472
[ 4] 0.0- 0.4 sec 256 KBytes 5.30 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41473
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41474
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41475
[ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41476
[ 4] 0.0- 0.4 sec 256 KBytes 5.43 Mbits/sec
[ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41479
[ 4] 0.0- 1.5 sec 768 KBytes 4.23 Mbits/sec
**
For giggles, I tried it with the same sfqred setting on both sides, with the
same sad result.
For the sake of argument, I'm going to try this with a similar qfq setup.
Updated by David Taht over 1 year ago
Sigh. I went back and installed iperf-mt, rather than iperf, rebuilt a kernel
for cerowrt with CONFIG_NO_HZ and CONFIG_HZ_256 (which is how I
normally build it)
and can't trigger it now.
On Wed, Feb 1, 2012 at 11:37 PM, Dave Taht <dave.taht@gmail.com> wrote:
Jasper:
I have duplicated your starvation problem, even with sfqred on cerowrt bql-30.
I had hoped that setting the depth would help. It didn't.
on the 3.3 based laptop:
tc -b <AAA qdisc del dev eth0 root qdisc add dev eth0 root handle 1: est 1sec 4sec htb default 1 class add dev eth0 parent 1: classid 1:1 est 1sec 4sec htb rate 4000kbit mtu 1500 quantum 1500 qdisc add dev eth0 parent 1:1 handle a: est 1sec 4sec sfq limit 200 headdrop perturb 60000 flows 500 divisor 16384 redflowlimit 40000 depth 24 min 4500 max 9000 probability 0.20 ecn harddrop AAA
The cerowrt box is running:
root@OpenWrt:~# tc qdisc show dev se00 qdisc sfq a: root refcnt 2 limit 127p quantum 1514b depth 127 divisor 1024
running 10 iperfs (iperf -t 60 -c 172.30.42.1) from the laptop to the cerowrt box.
Note the first iperf completely starves out the last 9 connection attempts, so they don't even start.
[ 4] 0.0-60.5 sec 27.0 MBytes 3.74 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41471 [ 4] 0.0- 1.6 sec 768 KBytes 3.83 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41472 [ 4] 0.0- 0.4 sec 256 KBytes 5.30 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41473 [ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41474 [ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41475 [ 4] 0.0- 0.4 sec 256 KBytes 5.41 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41476 [ 4] 0.0- 0.4 sec 256 KBytes 5.43 Mbits/sec [ 4] local 172.30.42.1 port 5001 connected with 172.30.42.17 port 41479 [ 4] 0.0- 1.5 sec 768 KBytes 4.23 Mbits/sec
**
For giggles, I tried it with the same sfqred setting on both sides, with the same sad result.
For the sake of argument, I'm going to try this with a similar qfq setup.
Updated by David Taht over 1 year ago
On Thu, Feb 2, 2012 at 12:05 PM, Jesper Dangaard Brouer <jdb@comx.dk> wrote:
On Thu, 2012-02-02 at 01:44 +0000, Dave Taht wrote:
Sigh. I went back and installed iperf-mt, rather than iperf, rebuilt a kernel for cerowrt with CONFIG_NO_HZ and CONFIG_HZ_256 (which is how I normally build it)
and can't trigger it now.
That make perfect sense to me. By running a multi threaded version of iperf, the flows/packets gets a chance to be scheduled individually, and thus the packets have, a better chance, of not getting "grouped-together" in the qdiscs queue. You further improve this by using CONFIG_NO_HZ.
I would still claim, that this situation can still happen for real-life applications, which have many TCP connections, and have not implemented this in a multi-threaded manor.
Perhaps you mis-understood the 'sigh'. I have little doubt this problem
can occur. I was happy that I could reproduce it with a non-default
setup for me.
I will continue to try to reproduce. Another thought is merely that non-threaded
iperf will refuse a connection entirely...
As for CONFIG_NO_HZ, it was my impression that HTB performed more
accurately and better with that and hires timers on.
Updated by Dave Täht about 1 year ago
I have temporarily restored this patch to cerowrt, because I have some hope htb can be fixed....
Updated by Dave Täht about 1 year ago
- Target version changed from 14 to 1st Public Cerowrt release
I have thus far found it impossible to create bad behavior when sfqred is used properly - flows prime or near prime in the 24-53 range, low min, 200-300 packets, and redflowlimit. I have created a kernel module that can change the behavior from hoq to toq at module load time, and am testing elsewhere.
Pushing forward to head of queue has wonderful effects on mice, dns, and ants.
The behavior I see in shooting elephants, is fine by me.
Updated by Dave Täht about 1 year ago
- Status changed from New to Closed