Bug #379

3.3.4-3 router crashes under heavy load

Added by David Taht on Apr 30, 2012. Updated on Jun 24, 2014.
Closed Urgent Dave Täht

Description

So I started doing long duration, heavy access tests, driving things
with my fastest three boxes,
crypted and unencrypted wireless, and two machines driving through the
internal ethernet at
gigE speeds.

after about 200 sec of heavy load, the router resets. Also in the
general case it becomes hard to get a connection to the webserver for
(example) streaming audio, and ssh won’t start due to the loadavg
(have to fix that in xinetd), and dns starts acting up and babeld
stops transmitting routes

Now, this is truly abnormal use - 250+Mbits going through the router
full time rather than 20Mbit or so.

So far this is ipv4 only. I was mostly testing ipv6 and fairly short
(60 second) durations and mostly
ethernet before now.

Things tried:

0) reducing the watchdog time to reset to 1 second rather than 5. This
seemed to help somewhat, but I’m going to rule it out.

1) turning off the qdiscs. This was mildly amusing as I was perturbed
by seeing ping times go from .2sec to 20 or 30ms on GigE with
pfifo_fast. It’s been a long time since these puppies were configured
as drop tail systems.

Anyway, still crashes. So I think we can regard the aqm system as
solid. Finally.

I do want to turn hoq-sfq off for further tests, tho.

3) bumping up BQL. Performance improves slightly, still crashes.

4) Not using the wireless at all. So far it’s survived 2000 seconds of
abuse, pounding 250+Mbit of ipv4 through it,
from 5 streams from two boxes.

So things point at some interaction with wireless.

theories are:

hostapd can’t get enough cpu time to run
the rngd daemon can’t get enough cpu time to run
we have a memory leak somewhere in the wireless stack

For the longest time I’ve had this thing running at HZ_256. Anyway I’m
going to leave ethernet loaded up overnight, then try wireless all by
itself for a few hours.

I have other fish to fry right now.

Attachments

  • nf_nat_conntrack_tcp.patch (text/x-patch; 1.1 kiB) Unaligned access fixes for nf_conntrack_proto_tcp and ipv4/nf_nat_proto_tcp Robert Bradley May 1, 2012

History

Updated by Dave Täht on Apr 30, 2012.
Please note: I’m HAPPY to have ethernet working so well. Now that the traps are gone it should be possible to do decent analysis with oprofile, and see what, if anything, can be sped up.

I expected trouble with wireless, but didn’t have any until today… because I was doing the engineer thing and testing wireless by itself (driving it at 90+Mbit), and ethernet by itself, not both together. Sigh.

I can’t quite rule out iptables, and I should probably also test ipv6 under this scenario

Updated by Dave Täht on Apr 30, 2012.
another nice thing is tcp_rr performance is astoundingly good, 823 transactions per second, even with 8 streams beating up the box.
Updated by Robert Bradley on May 1, 2012.
If you’re thinking this is iptables and unaligned-access related, I did manage to find a couple of unaligned accesses in the nat (not relevant for wired->wireless?) and connection-tracking code for TCP. Is it possible to just rmmod the relevant modules and test without iptables or AQM tracking the connections? That would save time instead of running yet another kernel build with educated guesses at patches.
Updated by Dave Täht on May 1, 2012.
We’re back at the point where we can either oprofile (my preferred method), or just run tests and watch the unaligned traps happen or not. 1/sec is dealable…

As for this problem and patches, there is patch review on the openwrt list going on…

See thread:

http://www.mail-archive.com/openwrt-devel@lists.openwrt.org/msg13520.html

Updated by Dave Täht on May 1, 2012.
oh, and to answer your question more directly, I think we have a new problem here in that it’s not been possible to totally saturate this beast on interrupts before now…
Updated by Robert Bradley on May 1, 2012.
Yes, the real question was, “How related is this to #360 and #371?” Apart from those two hiding it, I’m not sure we can say it is related.

If it’s interrupt-related, I’m going to guess that the lack of napi_poll support in the ath9k driver doesn’t help. (The Ethernet driver has it, so you should be able to route LAN->WAN with no problems.)

Updated by David Taht on May 1, 2012.
Felix:

Can napi even work on wireless devices?

On Tue, May 1, 2012 at 10:05 AM, cerowrt@lists.bufferbloat.net wrote:
>
> Issue #379 has been updated by Robert Bradley.
>
>
> Yes, the real question was, “How related is this to #360 and #371?”  Apart from those two hiding it, I’m not sure we can say it is related.
>
> If it’s interrupt-related, I’m going to guess that the lack of napi_poll support in the ath9k driver doesn’t help.  (The Ethernet driver has it, so you should be able to route LAN->WAN with no problems.)
> —————————————-
> Bug #379: 3.3.4-3 router crashes under heavy load
> https://www.bufferbloat.net/issues/379 >
> Author: David Taht
> Status: New
> Priority: Normal
> Assignee: Dave Täht
> Category: Linux Kernel
> Target version: 1st Public Cerowrt release
>
>
> So I started doing long duration, heavy access tests, driving things
> with my fastest three boxes,
> crypted and unencrypted wireless, and two machines driving through the
> internal ethernet at
> gigE speeds.
>
> after about 200 sec of heavy load, the router resets. Also in the
> general case it becomes hard to get a connection to the webserver for
> (example) streaming audio, and ssh won’t start due to the loadavg
> (have to fix that in xinetd), and dns starts acting up and babeld
> stops transmitting routes
>
> Now, this is truly abnormal use - 250+Mbits going through the router
> full time rather than 20Mbit or so.
>
> So far this is ipv4 only. I was mostly testing ipv6 and fairly short
> (60 second) durations and mostly
> ethernet before now.
>
> Things tried:
>
> 0) reducing the watchdog time to reset to 1 second rather than 5. This
> seemed to help somewhat, but I’m going to rule it out.
>
> 1) turning off the qdiscs. This was mildly amusing as I was perturbed
> by seeing ping times go from .2sec to 20 or 30ms on GigE with
> pfifo_fast. It’s been a long time since these puppies were configured
> as drop tail systems.
>
> Anyway, still crashes. So I think we can regard the aqm system as
> solid. Finally.
>
> I do want to turn hoq-sfq off for further tests, tho.
>
> 3) bumping up BQL. Performance improves slightly, still crashes.
>
> 4) Not using the wireless at all. So far it’s survived 2000 seconds of
> abuse, pounding 250+Mbit of ipv4 through it,
> from 5 streams from two boxes.
>
> So things point at some interaction with wireless.
>
> theories are:
>
> hostapd can’t get enough cpu time to run
> the rngd daemon can’t get enough cpu time to run
> we have a memory leak somewhere in the wireless stack
>
> For the longest time I’ve had this thing running at HZ_256. Anyway I’m
> going to leave ethernet loaded up overnight, then try wireless all by
> itself for a few hours.
>
> I have other fish to fry right now.
>
>

Updated by Robert Bradley on May 1, 2012.
The ieee80211_ops struct seems to think it can. I haven’t found a driver that uses it yet - rtl8180 had it (http://www.spinics.net/lists/linux-wireless/msg53741.html) but that got reverted (http://git.itanic.dy.fi/?p=linux-stable;a=patch;h=a6d27d2ac89359f84c1a559b5530967ff671d269).
Updated by David Taht on May 1, 2012.
Hey john, I was wondering why napi was so hard on wireless?

On Tue, May 1, 2012 at 11:13 AM, cerowrt@lists.bufferbloat.net wrote:
>
> Issue #379 has been updated by Robert Bradley.
>
>
> The ieee80211_ops struct seems to think it can.  I haven’t found a driver that uses it yet - rtl8180 had it (http://www.spinics.net/lists/linux-wireless/msg53741.html) but that got reverted (http://git.itanic.dy.fi/?p=linux-stable;a=patch;h=a6d27d2ac89359f84c1a559b5530967ff671d269). > —————————————-
> Bug #379: 3.3.4-3 router crashes under heavy load
> https://www.bufferbloat.net/issues/379 >
> Author: David Taht
> Status: New
> Priority: Normal
> Assignee: Dave Täht
> Category: Linux Kernel
> Target version: 1st Public Cerowrt release
>
>
> So I started doing long duration, heavy access tests, driving things
> with my fastest three boxes,
> crypted and unencrypted wireless, and two machines driving through the
> internal ethernet at
> gigE speeds.
>
> after about 200 sec of heavy load, the router resets. Also in the
> general case it becomes hard to get a connection to the webserver for
> (example) streaming audio, and ssh won’t start due to the loadavg
> (have to fix that in xinetd), and dns starts acting up and babeld
> stops transmitting routes
>
> Now, this is truly abnormal use - 250+Mbits going through the router
> full time rather than 20Mbit or so.
>
> So far this is ipv4 only. I was mostly testing ipv6 and fairly short
> (60 second) durations and mostly
> ethernet before now.
>
> Things tried:
>
> 0) reducing the watchdog time to reset to 1 second rather than 5. This
> seemed to help somewhat, but I’m going to rule it out.
>
> 1) turning off the qdiscs. This was mildly amusing as I was perturbed
> by seeing ping times go from .2sec to 20 or 30ms on GigE with
> pfifo_fast. It’s been a long time since these puppies were configured
> as drop tail systems.
>
> Anyway, still crashes. So I think we can regard the aqm system as
> solid. Finally.
>
> I do want to turn hoq-sfq off for further tests, tho.
>
> 3) bumping up BQL. Performance improves slightly, still crashes.
>
> 4) Not using the wireless at all. So far it’s survived 2000 seconds of
> abuse, pounding 250+Mbit of ipv4 through it,
> from 5 streams from two boxes.
>
> So things point at some interaction with wireless.
>
> theories are:
>
> hostapd can’t get enough cpu time to run
> the rngd daemon can’t get enough cpu time to run
> we have a memory leak somewhere in the wireless stack
>
> For the longest time I’ve had this thing running at HZ_256. Anyway I’m
> going to leave ethernet loaded up overnight, then try wireless all by
> itself for a few hours.
>
> I have other fish to fry right now.
>
>

Updated by Felix Fietkau on May 1, 2012.
On 2012-05-01 7:06 PM, Dave Taht wrote:
> Felix:
>
> Can napi even work on wireless devices?
It can, mac80211 has support for it, but it’s not implemented in the
driver yet.

  • Felix
Updated by Dave Täht on May 9, 2012.
see #385. When it happens, which is REALLY RARE, it’s pretty catastrophic to the streams.

Everything resets.

Easily hit with a hammer on a patch, which I’ll do if I get enough energy. Shouldn’t do that tho

I also note that I’ve been unable to crash it with the 3.3.5-3 build + codel, which cheers me up, relatively. I haven’t tried sfqred.

Updated by Dave Täht on Jul 12, 2012.
This is going to require some serial port debugging and brain cells I do not have at present.
Updated by Adam Gensler on Jul 19, 2012.
Hi Dave!

I’ve been watching your progress here for a while. I’ve recently started some work using OpenWRT (I know, not the same project), and I’m seeing crashes there as well, also under heavy load. It runs about 400 seconds under very heavy load and then just reboots. Watching the serial console shows no output when the reload occurs, the box simply starts rebooting. Specifics:

I’m using an alix 2D13 board. I have several and the problem is seen on all of them.

I’m using very vanilla OpenWRT builds with a near to default configuration. In fact, I’ve turned off dnsmasq and the firewall. I have no QoS configured. This is literally just port to port routing, nothing else.

I have two of the three ports connected to a Smartbits test device. I’m sending 64 byte packets between eth0 and eth1. eth2 is not connected. The tests start at 10% interface load and run for 90 seconds. If the test passes ( >0.01% packet loss), it increases the load. I see the following behavior:

10% load passes
55% load fails
32.5% load fails
21.5% load fails
60 seconds in to the next test the device reboots. This is repeatable every time.

I guess what I’m getting at is I think perhaps this is an OpenWRT / kernel problem and not specific to CeroWRT. In fact, there’s a recently opened ticket on OpenWRT that describes something similar, though details there are sparse:

https://dev.openwrt.org/ticket/11882

I’m going to build a few images of OpenWRT to see if I can narrow down where it started crashing. Previous builds used to pass my entire test suite (64 bytes - 1500 byte packets). Now I can’t get past the 4th iteration of the first frame size. I suspect the move to kernel 3.3 in 31753 for the alix2 target is the cause. If you’re interested I’ll report back my findings.

Updated by Robert Bradley on Aug 6, 2012.
Adam: that link was quite interesting! Digging into it, if I understand correctly, the claim is that the lack of read/write memory barriers in via-rhine.c are the cause of their crashes. I noticed that in CeroWRT, the ag71xx wired driver has rmb()/wmb() calls, but not the ath9k wireless driver. It may be worth modifying the ath9k_ioread32 and ath9k_iowrite32 functions to add these as a test, although I would expect this to only affect multi-core systems.
Updated by Robert Bradley on Aug 6, 2012.
Continuing my previous post, here is the patch to add IO memory barriers to ath9k. There may well be other points where memory barriers would make sense too if this is really the cause.
Updated by Dave Täht on Jun 24, 2014.

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Oct 20, 2023 Wiki page
What Can I Do About Bufferbloat?
Dec 3, 2022 Wiki page
Codel Wiki
Jun 11, 2022 Wiki page
More about Bufferbloat
Jun 11, 2022 Wiki page
Tests for Bufferbloat
Dec 7, 2021 Wiki page
Getting SQM Running Right

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".