Bug #379
3.3.4-3 router crashes under heavy load
| Status: | New | Start date: | ||
|---|---|---|---|---|
| Priority: | Urgent | Due date: | ||
| Assignee: | % Done: | 0% |
||
| Category: | Linux Kernel | Spent time: | 4.50 hours | |
| Target version: | 1st Public Cerowrt release |
Description
So I started doing long duration, heavy access tests, driving things
with my fastest three boxes,
crypted and unencrypted wireless, and two machines driving through the
internal ethernet at
gigE speeds.
after about 200 sec of heavy load, the router resets. Also in the
general case it becomes hard to get a connection to the webserver for
(example) streaming audio, and ssh won't start due to the loadavg
(have to fix that in xinetd), and dns starts acting up and babeld
stops transmitting routes
Now, this is truly abnormal use - 250+Mbits going through the router
full time rather than 20Mbit or so.
So far this is ipv4 only. I was mostly testing ipv6 and fairly short
(60 second) durations and mostly
ethernet before now.
Things tried:
0) reducing the watchdog time to reset to 1 second rather than 5. This
seemed to help somewhat, but I'm going to rule it out.
1) turning off the qdiscs. This was mildly amusing as I was perturbed
by seeing ping times go from .2sec to 20 or 30ms on GigE with
pfifo_fast. It's been a long time since these puppies were configured
as drop tail systems.
Anyway, still crashes. So I think we can regard the aqm system as
solid. Finally.
I do want to turn hoq-sfq off for further tests, tho.
3) bumping up BQL. Performance improves slightly, still crashes.
4) Not using the wireless at all. So far it's survived 2000 seconds of
abuse, pounding 250+Mbit of ipv4 through it,
from 5 streams from two boxes.
So things point at some interaction with wireless.
theories are:
hostapd can't get enough cpu time to run
the rngd daemon can't get enough cpu time to run
we have a memory leak somewhere in the wireless stack
For the longest time I've had this thing running at HZ_256. Anyway I'm
going to leave ethernet loaded up overnight, then try wireless all by
itself for a few hours.
I have other fish to fry right now.
Related issues
| related to Cerowrt - Bug #371: route table updates perturbed by alignment errors | New | 04/20/2012 | ||
| related to Cerowrt - Bug #360: Patchwork [OpenWrt-Devel] ar71xx TCP/IPsec unaligned inst... | Closed | 04/11/2012 | ||
| related to Cerowrt - Bug #401: major retries in the VI AMPDU queue on ath9k | Closed | 07/08/2012 | ||
| related to Cerowrt - Bug #402: Tail drop behavior on fq_codel on ack-mostly streams | New | 07/12/2012 | ||
| related to Cerowrt - Bug #385: more traps | New | 05/09/2012 | ||
| related to Cerowrt - Bug #386: QFQ locks up | New | 05/09/2012 |
History
Updated by Dave Täht about 1 year ago
- Category set to Linux Kernel
- Assignee set to Dave Täht
- Target version set to 1st Public Cerowrt release
Please note: I'm HAPPY to have ethernet working so well. Now that the traps are gone it should be possible to do decent analysis with oprofile, and see what, if anything, can be sped up.
I expected trouble with wireless, but didn't have any until today... because I was doing the engineer thing and testing wireless by itself (driving it at 90+Mbit), and ethernet by itself, not both together. Sigh.
I can't quite rule out iptables, and I should probably also test ipv6 under this scenario
Updated by Dave Täht about 1 year ago
another nice thing is tcp_rr performance is astoundingly good, 823 transactions per second, even with 8 streams beating up the box.
Updated by Robert Bradley about 1 year ago
- File nf_nat_conntrack_tcp.patch added
If you're thinking this is iptables and unaligned-access related, I did manage to find a couple of unaligned accesses in the nat (not relevant for wired->wireless?) and connection-tracking code for TCP. Is it possible to just rmmod the relevant modules and test without iptables or AQM tracking the connections? That would save time instead of running yet another kernel build with educated guesses at patches.
Updated by Dave Täht about 1 year ago
We're back at the point where we can either oprofile (my preferred method), or just run tests and watch the unaligned traps happen or not. 1/sec is dealable...
As for this problem and patches, there is patch review on the openwrt list going on...
See thread:
http://www.mail-archive.com/openwrt-devel@lists.openwrt.org/msg13520.html
Updated by Dave Täht about 1 year ago
oh, and to answer your question more directly, I think we have a new problem here in that it's not been possible to totally saturate this beast on interrupts before now...
Updated by Robert Bradley about 1 year ago
Yes, the real question was, "How related is this to #360 and #371?" Apart from those two hiding it, I'm not sure we can say it is related.
If it's interrupt-related, I'm going to guess that the lack of napi_poll support in the ath9k driver doesn't help. (The Ethernet driver has it, so you should be able to route LAN->WAN with no problems.)
Updated by David Taht about 1 year ago
Felix:
Can napi even work on wireless devices?
On Tue, May 1, 2012 at 10:05 AM, <cerowrt@lists.bufferbloat.net> wrote:
Issue #379 has been updated by Robert Bradley.
Yes, the real question was, "How related is this to #360 and #371?" Apart from those two hiding it, I'm not sure we can say it is related.
If it's interrupt-related, I'm going to guess that the lack of napi_poll support in the ath9k driver doesn't help. (The Ethernet driver has it, so you should be able to route LAN->WAN with no problems.)
Bug #379: 3.3.4-3 router crashes under heavy load https://www.bufferbloat.net/issues/379Author: David Taht Status: New Priority: Normal Assignee: Dave Täht Category: Linux Kernel Target version: 1st Public Cerowrt release
So I started doing long duration, heavy access tests, driving things with my fastest three boxes, crypted and unencrypted wireless, and two machines driving through the internal ethernet at gigE speeds.
after about 200 sec of heavy load, the router resets. Also in the general case it becomes hard to get a connection to the webserver for (example) streaming audio, and ssh won't start due to the loadavg (have to fix that in xinetd), and dns starts acting up and babeld stops transmitting routes
Now, this is truly abnormal use - 250+Mbits going through the router full time rather than 20Mbit or so.
So far this is ipv4 only. I was mostly testing ipv6 and fairly short (60 second) durations and mostly ethernet before now.
Things tried:
0) reducing the watchdog time to reset to 1 second rather than 5. This seemed to help somewhat, but I'm going to rule it out.
1) turning off the qdiscs. This was mildly amusing as I was perturbed by seeing ping times go from .2sec to 20 or 30ms on GigE with pfifo_fast. It's been a long time since these puppies were configured as drop tail systems.
Anyway, still crashes. So I think we can regard the aqm system as solid. Finally.
I do want to turn hoq-sfq off for further tests, tho.
3) bumping up BQL. Performance improves slightly, still crashes.
4) Not using the wireless at all. So far it's survived 2000 seconds of abuse, pounding 250+Mbit of ipv4 through it, from 5 streams from two boxes.
So things point at some interaction with wireless.
theories are:
hostapd can't get enough cpu time to run the rngd daemon can't get enough cpu time to run we have a memory leak somewhere in the wireless stack
For the longest time I've had this thing running at HZ_256. Anyway I'm going to leave ethernet loaded up overnight, then try wireless all by itself for a few hours.
I have other fish to fry right now.
Updated by Robert Bradley about 1 year ago
The ieee80211_ops struct seems to think it can. I haven't found a driver that uses it yet - rtl8180 had it (http://www.spinics.net/lists/linux-wireless/msg53741.html) but that got reverted (http://git.itanic.dy.fi/?p=linux-stable;a=patch;h=a6d27d2ac89359f84c1a559b5530967ff671d269).
Updated by David Taht about 1 year ago
Hey john, I was wondering why napi was so hard on wireless?
On Tue, May 1, 2012 at 11:13 AM, <cerowrt@lists.bufferbloat.net> wrote:
Issue #379 has been updated by Robert Bradley.
The ieee80211_ops struct seems to think it can. I haven't found a driver that uses it yet - rtl8180 had it (http://www.spinics.net/lists/linux-wireless/msg53741.html) but that got reverted (http://git.itanic.dy.fi/?p=linux-stable;a=patch;h=a6d27d2ac89359f84c1a559b5530967ff671d269).
Bug #379: 3.3.4-3 router crashes under heavy load https://www.bufferbloat.net/issues/379Author: David Taht Status: New Priority: Normal Assignee: Dave Täht Category: Linux Kernel Target version: 1st Public Cerowrt release
So I started doing long duration, heavy access tests, driving things with my fastest three boxes, crypted and unencrypted wireless, and two machines driving through the internal ethernet at gigE speeds.
after about 200 sec of heavy load, the router resets. Also in the general case it becomes hard to get a connection to the webserver for (example) streaming audio, and ssh won't start due to the loadavg (have to fix that in xinetd), and dns starts acting up and babeld stops transmitting routes
Now, this is truly abnormal use - 250+Mbits going through the router full time rather than 20Mbit or so.
So far this is ipv4 only. I was mostly testing ipv6 and fairly short (60 second) durations and mostly ethernet before now.
Things tried:
0) reducing the watchdog time to reset to 1 second rather than 5. This seemed to help somewhat, but I'm going to rule it out.
1) turning off the qdiscs. This was mildly amusing as I was perturbed by seeing ping times go from .2sec to 20 or 30ms on GigE with pfifo_fast. It's been a long time since these puppies were configured as drop tail systems.
Anyway, still crashes. So I think we can regard the aqm system as solid. Finally.
I do want to turn hoq-sfq off for further tests, tho.
3) bumping up BQL. Performance improves slightly, still crashes.
4) Not using the wireless at all. So far it's survived 2000 seconds of abuse, pounding 250+Mbit of ipv4 through it, from 5 streams from two boxes.
So things point at some interaction with wireless.
theories are:
hostapd can't get enough cpu time to run the rngd daemon can't get enough cpu time to run we have a memory leak somewhere in the wireless stack
For the longest time I've had this thing running at HZ_256. Anyway I'm going to leave ethernet loaded up overnight, then try wireless all by itself for a few hours.
I have other fish to fry right now.
Updated by Felix Fietkau about 1 year ago
On 2012-05-01 7:06 PM, Dave Taht wrote:
Felix:
Can napi even work on wireless devices?
It can, mac80211 has support for it, but it's not implemented in the
driver yet.
- Felix
Updated by Dave Täht about 1 year ago
see #385. When it happens, which is REALLY RARE, it's pretty catastrophic to the streams.
Everything resets.
Easily hit with a hammer on a patch, which I'll do if I get enough energy. Shouldn't do that tho
I also note that I've been unable to crash it with the 3.3.5-3 build + codel, which cheers me up, relatively. I haven't tried sfqred.
Updated by Dave Täht 11 months ago
- Priority changed from Normal to Urgent
This is going to require some serial port debugging and brain cells I do not have at present.
Updated by Adam Gensler 10 months ago
Hi Dave!
I've been watching your progress here for a while. I've recently started some work using OpenWRT (I know, not the same project), and I'm seeing crashes there as well, also under heavy load. It runs about 400 seconds under very heavy load and then just reboots. Watching the serial console shows no output when the reload occurs, the box simply starts rebooting. Specifics:
I'm using an alix 2D13 board. I have several and the problem is seen on all of them.
I'm using very vanilla OpenWRT builds with a near to default configuration. In fact, I've turned off dnsmasq and the firewall. I have no QoS configured. This is literally just port to port routing, nothing else.
I have two of the three ports connected to a Smartbits test device. I'm sending 64 byte packets between eth0 and eth1. eth2 is not connected. The tests start at 10% interface load and run for 90 seconds. If the test passes ( >0.01% packet loss), it increases the load. I see the following behavior:
10% load passes
55% load fails
32.5% load fails
21.5% load fails
60 seconds in to the next test the device reboots. This is repeatable every time.
I guess what I'm getting at is I think perhaps this is an OpenWRT / kernel problem and not specific to CeroWRT. In fact, there's a recently opened ticket on OpenWRT that describes something similar, though details there are sparse:
https://dev.openwrt.org/ticket/11882
I'm going to build a few images of OpenWRT to see if I can narrow down where it started crashing. Previous builds used to pass my entire test suite (64 bytes - 1500 byte packets). Now I can't get past the 4th iteration of the first frame size. I suspect the move to kernel 3.3 in 31753 for the alix2 target is the cause. If you're interested I'll report back my findings.
Updated by Robert Bradley 10 months ago
Adam: that link was quite interesting! Digging into it, if I understand correctly, the claim is that the lack of read/write memory barriers in via-rhine.c are the cause of their crashes. I noticed that in CeroWRT, the ag71xx wired driver has rmb()/wmb() calls, but not the ath9k wireless driver. It may be worth modifying the ath9k_ioread32 and ath9k_iowrite32 functions to add these as a test, although I would expect this to only affect multi-core systems.
Updated by Robert Bradley 10 months ago
Continuing my previous post, here is the patch to add IO memory barriers to ath9k. There may well be other points where memory barriers would make sense too if this is really the cause.