Bug #195

ag71xx ethernet appears to be doing many unaligned transfers

Added by Dave Täht almost 2 years ago. Updated about 1 year ago.

Status:Closed Start date:06/04/2011
Priority:Normal Due date:
Assignee:Felix Fietkau % Done:

100%

Category:- Spent time: 76.50 hours
Target version:1st Public Cerowrt release

Description

do_ade accounts for over 22% of the runtime on stuff that stresses the ethernet driver.

this is trapping on unaligned accesses. from somewhere.


Related issues

related to Cerowrt - Feature #220: echo 1 > /proc/sys/net/ipv4/tcp_low_latency gives good re... Closed 08/02/2011
related to Cerowrt - Bug #216: Infiinite retry for ping Closed 07/28/2011
related to Cerowrt - Bug #352: Major differences between ipv6 and ipv4 performance Closed

History

Updated by Dave Täht almost 2 years ago

  • Priority changed from Normal to Urgent

Updated by Dave Täht almost 2 years ago

  • Target version set to 1st Public Cerowrt release

I have not had time to run oprofile on the code of late, but when I was testing iperf at 100+Mbit per second on ethernet, over ethernet only, (not over wireless) it was spending 22% of it's time in an unaligned instruction trap.

I was going to try using netperf, instead, to see if it was a kernel problem rather than a userspace problem.

oprofile is enabled in the current build and installable as a package. I forgot to enable the unaligned trap perf monitor, however in the rc1 build.

(This is not a blocker for rc1 of cerowrt). The router runs at wire speeds at 100Mbit, even with only 4 buffers in the ethernet driver (however there's a problem with setting that that low, via ethtool, at gigE speeds, that locks up the port entirely. I'll file a bug on that shortly, with how to duplicate that exactly, see also the email on the lists)

ALSO:

Lots of iptables rules, using multiport matches (code in my Diffserv repo) also slowed the router down far more than expected. The new firewall code doesn't use multiport matches (although they could, I think, or I have the new syntax wrong), and I haven't had time to verify what happens with lots of ports in the latest cerowrt builds (I am opening more ports than usual, so that testing tools can abuse more stuff, like rsync)

Updated by Dave Täht almost 2 years ago

Would certainly like an oprofiling expert to look at this one, again, using netperf vs iperf to see if it is userspace or not.

I again failed to include the unaligned trap perf counter in rc4 however.

Updated by Dave Täht almost 2 years ago

  • Status changed from New to In Progress
  • Assignee set to Felix Fietkau

Felix has made enormous progress today in fixing the unaligned instruction traps.

<nbd> dtaht: i have another idea about the alignment issue [08:00]
<nbd> dtaht: i can try declaring the ip/tcp/udp header struct as __packed
<dtaht> hi [08:01]
<dtaht> just pulled into the train station
<nbd> i think gcc automatically generates unaligned access code then
<dtaht> don't know how bad the wifi will be
<dtaht> um, usually
<dtaht> __packed means that a byte and a 16 bit struct and a byte will get
packed together
<dtaht> which scares me mildly in this case
<nbd> gcc will not generate any implicit padding [08:02]
<nbd> but it'll also assume that the entire struct is unaligned
<dtaht> it will?
<nbd> yes, i only figured it out by accident
<nbd> when i was profiling ath9k, i noticed the descriptor access functions
were showing up with high percentage points [08:03]
<dtaht> ok. I guess I'm going to need more pedestals for godlike behavior...
<nbd> then i looked a the assembly and noticed that it didn't do aligned 32
bit access
<dtaht> hmm
<nbd> when i did __packed __aligned(4), it started doing 32 bit access again
<dtaht> this sounds like either a brilliant hack, or an instant system crash.
[08:04]
<dtaht> I mean, normal people don't go NEAR the ip header code. EVER.
<nbd> system crash is pretty much impossible with this
<dtaht> but I concede you are not normal in any respect
<dtaht> the thing is
<nbd> because the struct was carefully written to not rely on implicit padding
<nbd> because implicit padding is highly arch specific
<dtaht> I'm on a train, and wifi is spotty
<dtaht> do you need me to get you setup on huchra, or can I reuse andrews
build environent, or what? [08:05]
<nbd> i don't need huchra right now
<nbd> i'm compiling on my laptop anyway
<dtaht> ok, well, I do want to be able to get the same results as you [08:06]
<dtaht> did you think andrews's fixes to the TU stuff was worthwhile?
<nbd> i don't think it'll have significant effects at this point [08:07]
<nbd> especially not with the max retries override
<dtaht> I guess my core question is, can I get a rollup patch from you that is
sane, of all this stuff, so I don't have to replicate the bits flying
in loose formation?
<dtaht> ok.... (That was a dubious sounding ok....)
<nbd> this TU stuff is just a heuristic for determining the number of retries
[08:08]
<nbd> minimum retries per stage is 2 anyway
<dtaht> actually, better, could andrew and I get a rollup patch - on the bug
report 216 - of where you are at. Hopefully the next train I get on
will be less laggy.
<dtaht> and I can get a build done
<nbd> so if you cap the maximum number of retries, then there is only little
effect that this stuff can have
<dtaht> hell, might even just flash a router on the train
<dtaht> and take a stab at what the heck is wrong with STA mode
  • dtaht chortles
    <nbd> wow
    <nbd> cpu load went from 75% to 57% [08:09]
    <dtaht> for 'normal' firewall at 100Mbit?
    <nbd> yes
    <nbd> and i only changed two lines ;)
    <nbd> well, i might need to change a third line for udp ;)
    <dtaht> dude, if you care to rewrite the rules a bit to use the interface+
    syntax a little smarter, and get multiport working right, I think
    you'll finally be able to drive this thing at 300Mbit for real, from
    the gige port. [08:10]
    <dtaht> I would note I would be REALLY SCARED of that change and want to slam
    a lot of verification streams through it....
    <nbd> i'm 95% confident that this change won't break anything [08:11]
    <nbd> well, 99%
    <nbd> i knew an idea like this would show up if i waited long enough for it ;)
    [08:12]
    <dtaht> heh
  • dtaht applaudes
    <dtaht> (how do you spell applaud in french? German?) [08:13]
    <nbd> applaudieren
    <nbd> (= to applaud)
  • dtaht applaudieren
    <dtaht> um, to what other protocols could this apply? Across the board?
    <dtaht> does it work with ipv6? [08:14]
    <nbd> i added ip, udp and tcp
    <nbd> ipv6 can be fixed the same way too
    <nbd> but there are still some remaining unaligned exceptions
    <dtaht> s/can/needs to be/
    <nbd> so i can probably improve things even more if i track them down
    <dtaht> I am setting up a testlab at isc with native ipv6, next week, I hope
  • dtaht makes puppy dog eyes at nbd to prioritize making ipv6 just as happy as
    ipv4 [08:15]
    <nbd> already did that [08:16]
    <nbd> there's still 7 unaligned exceptions per packet [08:17]
    <nbd> at least with icmp
    <nbd> ah, nice. debugfs allows me to force the kernel to print a stack trace
    on every unaligned exception [08:20]
    <nbd> good thing i compiled my kernel with debug info
    <dtaht> way to go
    <nbd> ok, so the checksum still needs fixing
    <dtaht> csum_partial? [08:25]
    <dtaht> I'd seen multiple people hack that up to do the right thing
    <dtaht> never understood why there wasnt a gnerica csum_partial_unaligned
    people could use
  • dtaht notes train is sucking up tcp [08:26]
    <dtaht> I have to switch trains shortly
<nbd> they didn't add an unaligned version, because there are only very few
devices that have both crappy ethernet chips with alignment limitations
and inefficient unaligned access
<nbd> so they were probably hoping to never need crap like that
<dtaht> nbd: still, felix, I've seen a modified csum_partial go by at least 3
times for a couple arches, and it bugs me it's not in there...
[08:32]
<nbd> right now i'm fixing ip_fast_csum
<nbd> which is responsible for 4 out of 7 unaligned exceptions [08:33]
<npmapn> I've read about the progress you've made. [08:34]
<npmapn> This is amazing!
<nbd> two other exceptions are in ipv4_pkt_to_tuple
<nbd> also easy to fix
<dtaht> yea, I immediately spotted that csum_partial thing last time. Been
there. Done that. :) Didn't realize it was also doing that many other
places that were busted....
<dtaht> it has been a very insanely productive couple days. Meeting andrew was
amazing.... Then the guy that wrote the original mac80211
surfaced... And nbd is back and restored to life, after chasing girls
for weeks in Indonesia... :) [08:35]
<npmapn> That always helps. ;)
<nbd> btw. this is the diffstat of my alignment change (including checksum
fix): [08:41]
<nbd> 6 files changed, 23 insertions(+), 17 deletions(-)
<nbd> currently building it for a quick test run
<dtaht> sounds worth polishing and pushing to mainline
<nbd> no
<nbd> definitely not
<nbd> it makes everything not suffering from this issue perform worse [08:42]
<nbd> and there's no good way around that
<nbd> so it'll stay an openwrt specific hack
<dtaht> ok
<dtaht> judging from the results you got on the first pass, I'm expecting to
be really impressed with this next pass
<dtaht> :) [08:44]
<npmapn> This set of patches will stay in the ar71xx patches directory, I
suppose.
<nbd> well, the first one already took care of most of the issues
<nbd> so the second patch will only be a minor improvement on top of that
<dtaht> well, csum_partial back in the old days used to be a nightmare
<dtaht> but I'm REALLY dating myself in even talking about it [08:45]
<dtaht> (we're talking an embedded board I worked on in the mid 80s)
<nbd> :)
<dtaht> racal-interlan something or other
<dtaht> offloaded the entire tcp stack into like 320k on a pci board
<dtaht> no... It was ISA - 16 bit [08:46]
<dtaht> but at the time, it was impossible to run tcp/ip well under dos
  • dtaht pulls out a few gray hairs
    <dtaht> ok, I gotta switch trains now [08:47]
    <dtaht> add some notes to the unaligned bug report? The patch maybe?
    <dtaht> will be back online in 90 minutes
    <nbd> yay, no more unaligned exceptions in the hotpath [08:48]
    <dtaht> got a number for cpu usage at 100mbit? [08:49]
    <dtaht> got 2 minutes before I get off train
    <nbd> 55%
    <dtaht> can you test and drive gige?
    <nbd> later
    <nbd> my current test endpoint only does 100
    <dtaht> I have 3 routers on me [08:50]
    <dtaht> It would be gas to run 'em on the next train
    <dtaht> gotta go
  • dtaht Applaus!!
    <npmapn_> 40-50% CPU usage at 92mbps with PPPoE wired<>wired [10:32]
    <npmapn_> tx ring buffe for eth0 / eth1 : 8 [10:33]
    <npmapn_> buffer *
    <nbd> what did you get before for the same test?
    <npmapn_> Let me check in the log, I remember I have run the same test before.
    [10:34]
    <npmapn_> 88-92mbps at 60-75% CPU usage wired->wired [10:35]
    <dtaht> wow with 8 buffers?
    <dtaht> cooooool
    <dtaht> I think the theory that 64+ buffers are required for 100+Mbit
    performance has been thoroughly debunked. [10:36]
  • dtaht wonders what 4 buffers would do
    <npmapn_> dtaht: It would be just about the same.
    <nbd> depends
    <nbd> npmapn_: what's the txqueue length?
    <npmapn_> 500 for eth0/eth1/pppoe-wan
    <nbd> so limiting the buffers to 8 doesn't really do much then [10:37]
    <npmapn_> 2.6.39 is MUCH better at handling larger tx queues
    <nbd> if you still have 500 buffers in the network stack
    <dtaht> txqueuelen Is 16 by default
    <dtaht> it's the latency and qos that starts working again < 32 total
    unmanaged buffers
    <dtaht> 16 on cerowrt
    <dtaht> also, incidentally
    <dtaht> tcp_low_latency parameter seems to help [10:38]
    <dtaht> it's in proc somewhere
    <npmapn_> /proc/sys/net/ipv4/tcp_low_latency
    <npmapn_> I get 75% CPU usage with 200 for txqueuelen for all three
    interfaces. [10:39]
    <npmapn_> 50-65% with the same txqueuelen and tcp_low_latency enabled. [10:40]
    <dtaht> realllly [10:42]
    <dtaht> I was thinking I'd make that being on as the default.
    <dtaht> nmapn [10:43]
    <dtaht> what is your actuall bandwidth at that level
    <dtaht> and what are your latency under load results
    <dtaht> (e.g. Pinging while doing all that)
    <npmapn_> dtaht: If you're away from home on various wifi networks, please
    give yeah tcp congestion a shot.
    <dtaht> heh. I'm on a train now
    <npmapn_> Let me check. Should I ping something like my DNS or something far
    away? [10:44]
    <dtaht> is that god enough
    <dtaht> something on the path that is not what you are getting the big fat
    test stream from
    <npmapn_> yeah tcp? It should help you a lot.
    <dtaht> 8.8.8.8 for example.
    <npmapn_> reply time from 8.8.8.8 only with IRC open: 34-38ms [10:46]
  • dtaht starts a drum roll
    <npmapn_> reply time from 8.8.8.8 with 92mbps downstream traffic: 38-42 with
    one reply of 52 ms [10:47]
    <dtaht> AAAAAWESEOM [10:50]
    <dtaht> with txqueuelen of? And tx queue of? [10:51]
    <npmapn_> txqueuelen 500 for all 3 interfaces and tx buffer of 8
    <dtaht> any qos at all?
    <dtaht> heh [10:52]
    <npmapn_> Yes, the one you've got from me. Let me remove it and try again.
    <dtaht> going through a tunnel now
    <dtaht> that's why the long txqueue is working so well.
    <dtaht> (or so I guess)
    <npmapn_> Yes, that's right. [10:55]
    <dtaht> with your qos turned off, ping
    goes to hell [10:56]
    <dtaht> ? [10:57]
    <npmapn_> It looks like my ISP is
    doing something right. QoS
    doesn't make it any worse.
    <npmapn_> I mean lack of QoS doesn't
    make it any worse.
    <npmapn_> I removed all qdiscs,
    checked that I've got only
    pfifo_fast on all interfaces
    and I also ran iptables -F
    -t mangle. [10:58]
    <dtaht> my next big test is hard to setup [11:03]
    <dtaht> but you'd want 8-10 big streams going
    <npmapn_>
    http://hackingnasdaq.blogspot.com/2010/01/myth-of-procsysnetipv4tcplowlatency.html
    <dtaht> and see what happens
    <dtaht> NOT MEASURED ON A SLOW, bandwidth constrained router [11:04]
    <dtaht> would love to have him repeat his tests on cerowrt
    <npmapn_> openwrt trunk is rock solid with these parameters. [11:08]
    <npmapn_> I can give cerowrt a shot in the weekend. [11:09]
    <npmapn_> This is a good test scenario because PPPoE adds some overhead and
    upstream hardware is compensating for this somehow. [11:10]
    <dtaht> ij [11:26]
    <dtaht> I will have a new build by then I hope

<dtaht> felix, can you assemble your patch set to date and get it to me and
andrew?

Updated by Dave Täht almost 2 years ago

felix pushed today's work into openwrt head.

Updated by David Taht almost 2 years ago

---------- Forwarded message ----------
From: Dave Taht <>
Date: Tue, Aug 2, 2011 at 11:03 PM
Subject: some ar71xx tests [#195]
To: Andrew McGregor <>, Felix Fietkau <>

I built a version of cerowrt from openwrt head, in andrew's dir with netperf
installed by default...

Going with the defaults:

netperf -H the_other_router from_one_router

I get 143Mbit

Going router-router over gigE (lan to wan), with tcp_low_latency turned on,
on both sides, with:

Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380  16384  16384    60.00     218.20

so tcp_low_latency looks like a win with westwood+

My laptop doesn't do gigE, so can't test that here.

All these tests are with the default firewall rules for cerowrt, and nat
turned off on the one master router... and ethernet buffers of 4, and
txqueuelen of 8.

For giggles, I did a test with cubic:

root@OpenWrt:/proc/sys/net/ipv4# fg
netperf -l 60 -H 172.30.42.33
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380  16384  16384    60.00     228.26

And last, with cubic, and firewall rules turned off:

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
gw.home.lan (172.30.42.33) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380  16384  16384    60.00     257.75

I'll try diffserv again in the morning.

Updated by Dave Täht almost 2 years ago

interestingly, the tcp stream going from laptop (100Mbit) - router (gige) router only has about a ~5ms latency for ping

the udp_stream test:

64 bytes from 172.31.42.33: icmp_req=525 ttl=64 time=41.0 ms
64 bytes from 172.31.42.33: icmp_req=526 ttl=64 time=37.3 ms
64 bytes from 172.31.42.33: icmp_req=527 ttl=64 time=33.0 ms

and crashes and burns with a -l 120 - never ending

with -l 10

MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.33 (172.30.42.33) port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec

126976 65507 10.01 1842 0 96.47
112640 10.01 0 0.00

Updated by David Taht almost 2 years ago

Just to keep bug tracking..

I pulled from 27917.

I don't see any commits that could have caused this breakage, except maybe
that I'm building against 2.6.39.3, and may have differences in my config
file from yours that are important.

could I encourage you to hop on huchra?

Also I'm curious if YOUR ethernet build showed this:

T
root@OpenWrt:~# dmesg | grep eth0
eth0: Atheros AG71xx at 0xb9000000, irq 4
eth0: unable to find MII bus on device 'rtl8366s'
eth0: Atheros AG71xx at 0xba000000, irq 5
eth0: unable to find MII bus on device 'rtl8366s'

I thought I was building from the same .config as you with support for the
1000hz and no_hz options - the above looks like a faster clock tick problem
to me... but you should have seen that too.

So I can revert to 2.6.39.2, revert all that

and/or restart from scratch from my patch set (Which only touches one tiny
bit of the eth driver and NOTHING of the wireless driver) But with a little
data perhaps that won't be neccessary.

I'll stick around here a while longer, cleaning up, and getting a router
from thursday online...

After I get some sleep.

On Fri, Aug 5, 2011 at 7:00 PM, Andrew McGregor <>wrote:

Huh? I generated it based on openwrt head r27912

On 5/08/2011, at 5:01 PM, Dave Taht wrote:

so this patch duplicates some, but not all, of what is now in openwrt head? It only partially applies.

Ripping it out again, and going off to hopefully find the problem in the ethernet driver instead....

On Fri, Aug 5, 2011 at 4:00 PM, Dave Taht <> wrote:

---------- Forwarded message ---------- From: Andrew McGregor <> Date: Fri, Aug 5, 2011 at 3:57 PM Subject: Resend of patch To: Dave Taht <>

Ok, this patch applies on top of Felix' last set of changes, and produces really spectacular results: about 2x better latency on 11n than before we both started on it at basically no performance cost, and fairly decent and consistent latency on 11g at a performance cost so small I can't measure it.

It may be worth fiddling a little more with the tunables, but this is pretty good for now.

(btw: drop this file in the packages/mac80211 directory)

Updated by David Taht almost 2 years ago

Felix told me that your patch had 1/3 stuff he'd done already.

---------- Forwarded message ----------
From: Andrew McGregor <>
Date: Fri, Aug 5, 2011 at 8:59 PM
Subject: Re: Resend of patch
To: Dave Taht <>

I don't know if ethernet was working, I never checked.

I was using 2.6.39.2 however, so I suspect an upstream change is causing the
patch to not apply. It's not big, you could put it in by hand easily
enough. The last chunk you could also change the other case in that if
statement to something smaller, although that won't happen on a WNDR3700

About to get on a plane to NZ, so I'll be out of circulation for around 36
hours.

On 5/08/2011, at 6:25 PM, Dave Taht wrote:

I pulled from 27917.

I don't see any commits that could have caused this breakage, except maybe
that I'm building against 2.6.39.3, and may have differences in my config
file from yours that are important.

could I encourage you to hop on huchra?

Also I'm curious if YOUR ethernet build showed this:

T
root@OpenWrt:~# dmesg | grep eth0
eth0: Atheros AG71xx at 0xb9000000, irq 4
eth0: unable to find MII bus on device 'rtl8366s'
eth0: Atheros AG71xx at 0xba000000, irq 5
eth0: unable to find MII bus on device 'rtl8366s'

I thought I was building from the same .config as you with support for the
1000hz and no_hz options - the above looks like a faster clock tick problem
to me... but you should have seen that too.

So I can revert to 2.6.39.2, revert all that

and/or restart from scratch from my patch set (Which only touches one tiny
bit of the eth driver and NOTHING of the wireless driver) But with a little
data perhaps that won't be neccessary.

I'll stick around here a while longer, cleaning up, and getting a router
from thursday online...

After I get some sleep.
On Fri, Aug 5, 2011 at 7:00 PM, Andrew McGregor <>wrote:

Huh? I generated it based on openwrt head r27912

On 5/08/2011, at 5:01 PM, Dave Taht wrote:

so this patch duplicates some, but not all, of what is now in openwrt head? It only partially applies.

Ripping it out again, and going off to hopefully find the problem in the ethernet driver instead....

On Fri, Aug 5, 2011 at 4:00 PM, Dave Taht <> wrote:

---------- Forwarded message ---------- From: Andrew McGregor <> Date: Fri, Aug 5, 2011 at 3:57 PM Subject: Resend of patch To: Dave Taht <>

Ok, this patch applies on top of Felix' last set of changes, and produces really spectacular results: about 2x better latency on 11n than before we both started on it at basically no performance cost, and fairly decent and consistent latency on 11g at a performance cost so small I can't measure it.

It may be worth fiddling a little more with the tunables, but this is pretty good for now.

(btw: drop this file in the packages/mac80211 directory)

Updated by David Taht almost 2 years ago

I would really like it if someone from a different time zone and sleeping
schedule
were to download the latest wndr3700v2 development build from here:

http://huchra.bufferbloat.net/~andrewm/cerowrt/

I hope this build finishes up the last of the truly epic adventure that has
taken place on bug:

http://www.bufferbloat.net/issues/216

If it works, do play with netperf 2.5.0 on the wireless interface.

I note that the fixes to #195 seem to have (maybe) messed up the ethernet
interface - or the change to a faster clock or tickless or 2.6.39.3 vs
.2.... OR It may be the router I have on me!

So let us know on this email (bug reporting interface is cc'd)

That's all the news from tent #34. Good night, and good luck.

Updated by Dave Täht almost 2 years ago

  1. The unrecognisable ethernet port problem exists on more than one router, so it wasn't my hardware, as I hoped
  2. For my next series of tests, I am first reducing the clock to 100HZ, and staying tickless. My best guess as to the source of this problem is something hard coded in the init routine for the phy.

I shan't document what I'll do next as it's pretty obvious the bisecting I'd need to do, nor, do I know, what the real effect of a 1000HZ vs 100HZ clock is on a tickless system, anymore.

Updated by Dave Täht almost 2 years ago

I so love it when I guess right.

100HZ clock + tickless + 2.6.39.3... just works.

I have a backlog of other patches that needed to land and a ton of src to clean up, but we're back in business.

Updated by Dave Täht almost 2 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to Normal
  • % Done changed from 0 to 80

OK, so the ethernet seems a great deal faster than it was, and I will start doing comprehensive testing vs the factory firmware by the weekend.

"With the right eyeballs, all bugs are shallow"

Updated by Yuri Bene almost 2 years ago

Dave Täht wrote:

OK, so the ethernet seems a great deal faster than it was, and I will start doing comprehensive testing vs the factory firmware by the weekend.

Is there an OpenWRT build containing the patch that I can try?

Updated by Dave Täht over 1 year ago

  • Status changed from Feedback to Closed
  • Target version changed from 1st Public Cerowrt release to 14
  • % Done changed from 80 to 100

We are getting absolutely spectacular ethernet performance out of this puppy now.

Updated by Dave Täht about 1 year ago

  • Target version changed from 14 to 1st Public Cerowrt release

Also available in: Atom PDF