Bug #195

ag71xx ethernet appears to be doing many unaligned transfers

Added by Dave Täht on Jun 4, 2011. Updated on Apr 21, 2012.
Closed Normal Felix Fietkau

Description

do_ade accounts for over 22% of the runtime on stuff that stresses the ethernet driver.

this is trapping on unaligned accesses. from somewhere.

History

Updated by Dave Täht on Jul 14, 2011.
Updated by Dave Täht on Jul 16, 2011.
I have not had time to run oprofile on the code of late, but when I was testing iperf at 100+Mbit per second on ethernet, over ethernet only, (not over wireless) it was spending 22% of it’s time in an unaligned instruction trap.

I was going to try using netperf, instead, to see if it was a kernel problem rather than a userspace problem.

oprofile is enabled in the current build and installable as a package. I forgot to enable the unaligned trap perf monitor, however in the rc1 build.

(This is not a blocker for rc1 of cerowrt). The router runs at wire speeds at 100Mbit, even with only 4 buffers in the ethernet driver (however there’s a problem with setting that that low, via ethtool, at gigE speeds, that locks up the port entirely. I’ll file a bug on that shortly, with how to duplicate that exactly, see also the email on the lists)

ALSO:

Lots of iptables rules, using multiport matches (code in my Diffserv repo) also slowed the router down far more than expected. The new firewall code doesn’t use multiport matches (although they could, I think, or I have the new syntax wrong), and I haven’t had time to verify what happens with lots of ports in the latest cerowrt builds (I am opening more ports than usual, so that testing tools can abuse more stuff, like rsync)

Updated by Dave Täht on Jul 31, 2011.
Would certainly like an oprofiling expert to look at this one, again, using netperf vs iperf to see if it is userspace or not.

I again failed to include the unaligned trap perf counter in rc4 however.

Updated by Dave Täht on Aug 2, 2011.
Felix has made enormous progress today in fixing the unaligned instruction traps.

dtaht: i have another idea about the alignment issue [08:00]
dtaht: i can try declaring the ip/tcp/udp header struct as __packed
hi [08:01]
just pulled into the train station
i think gcc automatically generates unaligned access code then
don’t know how bad the wifi will be
um, usually
__packed means that a byte and a 16 bit struct and a byte will get
packed together
which scares me mildly in this case
gcc will not generate any implicit padding [08:02]
but it’ll also assume that the entire struct is unaligned
it will?
yes, i only figured it out by accident
when i was profiling ath9k, i noticed the descriptor access functions
were showing up with high percentage points [08:03]
ok. I guess I’m going to need more pedestals for godlike behavior…
then i looked a the assembly and noticed that it didn’t do aligned 32
bit access
hmm
when i did _packed_aligned(4), it started doing 32 bit access again
this sounds like either a brilliant hack, or an instant system crash.
[08:04]
I mean, normal people don’t go NEAR the ip header code. EVER.
system crash is pretty much impossible with this
but I concede you are not normal in any respect
the thing is
because the struct was carefully written to not rely on implicit padding
because implicit padding is highly arch specific
I’m on a train, and wifi is spotty
do you need me to get you setup on huchra, or can I reuse andrews
build environent, or what? [08:05]
i don’t need huchra right now
i’m compiling on my laptop anyway
ok, well, I do want to be able to get the same results as you [08:06]
did you think andrews’s fixes to the TU stuff was worthwhile?
i don’t think it’ll have significant effects at this point [08:07]
especially not with the max retries override
I guess my core question is, can I get a rollup patch from you that is
sane, of all this stuff, so I don’t have to replicate the bits flying
in loose formation?
ok…. (That was a dubious sounding ok….)
this TU stuff is just a heuristic for determining the number of retries
[08:08]
minimum retries per stage is 2 anyway
actually, better, could andrew and I get a rollup patch - on the bug
report 216 - of where you are at. Hopefully the next train I get on
will be less laggy.
and I can get a build done
so if you cap the maximum number of retries, then there is only little
effect that this stuff can have
hell, might even just flash a router on the train
and take a stab at what the heck is wrong with STA mode

  • dtaht chortles
    wow
    cpu load went from 75% to 57% [08:09]
    for ‘normal’ firewall at 100Mbit?
    yes
    and i only changed two lines ;)
    well, i might need to change a third line for udp ;)
    dude, if you care to rewrite the rules a bit to use the interface+
    syntax a little smarter, and get multiport working right, I think
    you’ll finally be able to drive this thing at 300Mbit for real, from
    the gige port. [08:10]
    I would note I would be REALLY SCARED of that change and want to slam
    a lot of verification streams through it….
    i’m 95% confident that this change won’t break anything [08:11]
    well, 99%
    i knew an idea like this would show up if i waited long enough for it ;)
    [08:12]
    heh
  • dtaht applaudes
    (how do you spell applaud in french? German?) [08:13]
    applaudieren
    (= to applaud)
  • dtaht applaudieren
    um, to what other protocols could this apply? Across the board?
    does it work with ipv6? [08:14]
    i added ip, udp and tcp
    ipv6 can be fixed the same way too
    but there are still some remaining unaligned exceptions
    s/can/needs to be/
    so i can probably improve things even more if i track them down
    I am setting up a testlab at isc with native ipv6, next week, I hope
  • dtaht makes puppy dog eyes at nbd to prioritize making ipv6 just as happy as
    ipv4 [08:15]
    already did that [08:16]
    there’s still 7 unaligned exceptions per packet [08:17]
    at least with icmp
    ah, nice. debugfs allows me to force the kernel to print a stack trace
    on every unaligned exception [08:20]
    good thing i compiled my kernel with debug info
    way to go
    ok, so the checksum still needs fixing
    csum_partial? [08:25]
    I’d seen multiple people hack that up to do the right thing
    never understood why there wasnt a gnerica csum_partial_unaligned
    people could use
  • dtaht notes train is sucking up tcp [08:26]
    I have to switch trains shortly

they didn’t add an unaligned version, because there are only very few
devices that have both crappy ethernet chips with alignment limitations
and inefficient unaligned access
so they were probably hoping to never need crap like that
nbd: still, felix, I’ve seen a modified csum_partial go by at least 3
times for a couple arches, and it bugs me it’s not in there…
[08:32]
right now i’m fixing ip_fast_csum
which is responsible for 4 out of 7 unaligned exceptions [08:33]
I’ve read about the progress you’ve made. [08:34]
This is amazing!
two other exceptions are in ipv4_pkt_to_tuple
also easy to fix
yea, I immediately spotted that csum_partial thing last time. Been
there. Done that. :) Didn’t realize it was also doing that many other
places that were busted….
it has been a very insanely productive couple days. Meeting andrew was
amazing…. Then the guy that wrote the original mac80211
surfaced… And nbd is back and restored to life, after chasing girls
for weeks in Indonesia… :) [08:35]
That always helps. ;)
btw. this is the diffstat of my alignment change (including checksum
fix): [08:41]
6 files changed, 23 insertions(+), 17 deletions(-)
currently building it for a quick test run
sounds worth polishing and pushing to mainline
no
definitely not
it makes everything not suffering from this issue perform worse [08:42]
and there’s no good way around that
so it’ll stay an openwrt specific hack
ok
judging from the results you got on the first pass, I’m expecting to
be really impressed with this next pass
:) [08:44]
This set of patches will stay in the ar71xx patches directory, I
suppose.
well, the first one already took care of most of the issues
so the second patch will only be a minor improvement on top of that
well, csum_partial back in the old days used to be a nightmare
but I’m REALLY dating myself in even talking about it [08:45]
(we’re talking an embedded board I worked on in the mid 80s)
:)
racal-interlan something or other
offloaded the entire tcp stack into like 320k on a pci board
no… It was ISA - 16 bit [08:46]
but at the time, it was impossible to run tcp/ip well under dos

  • dtaht pulls out a few gray hairs
    ok, I gotta switch trains now [08:47]
    add some notes to the unaligned bug report? The patch maybe?
    will be back online in 90 minutes
    yay, no more unaligned exceptions in the hotpath [08:48]
    got a number for cpu usage at 100mbit? [08:49]
    got 2 minutes before I get off train
    55%
    can you test and drive gige?
    later
    my current test endpoint only does 100
    I have 3 routers on me [08:50]
    It would be gas to run ‘em on the next train
    gotta go
  • dtaht Applaus![]()
    40-50% CPU usage at 92mbps with PPPoE wired<>wired [10:32]
    tx ring buffe for eth0 / eth1 : 8 [10:33]
    buffer *
    what did you get before for the same test?
    Let me check in the log, I remember I have run the same test before.
    [10:34]
    88-92mbps at 60-75% CPU usage wired->wired [10:35]
    wow with 8 buffers?
    cooooool
    I think the theory that 64+ buffers are required for 100+Mbit
    performance has been thoroughly debunked. [10:36]
  • dtaht wonders what 4 buffers would do
    dtaht: It would be just about the same.
    depends
    npmapn_: what’s the txqueue length?
    500 for eth0/eth1/pppoe-wan
    so limiting the buffers to 8 doesn’t really do much then [10:37]
    2.6.39 is MUCH better at handling larger tx queues
    if you still have 500 buffers in the network stack
    txqueuelen Is 16 by default
    it’s the latency and qos that starts working again < 32 total
    unmanaged buffers
    16 on cerowrt
    also, incidentally
    tcp_low_latency parameter seems to help [10:38]
    it’s in proc somewhere
    /proc/sys/net/ipv4/tcp_low_latency
    I get 75% CPU usage with 200 for txqueuelen for all three
    interfaces. [10:39]
    50-65% with the same txqueuelen and tcp_low_latency enabled. [10:40]
    realllly [10:42]
    I was thinking I’d make that being on as the default.
    nmapn [10:43]
    what is your actuall bandwidth at that level
    and what are your latency under load results
    (e.g. Pinging while doing all that)
    dtaht: If you’re away from home on various wifi networks, please
    give yeah tcp congestion a shot.
    heh. I’m on a train now
    Let me check. Should I ping something like my DNS or something far
    away? [10:44]
    is that god enough
    something on the path that is not what you are getting the big fat
    test stream from
    yeah tcp? It should help you a lot.
    8.8.8.8 for example.
    reply time from 8.8.8.8 only with IRC open: 34-38ms [10:46]
  • dtaht starts a drum roll
    reply time from 8.8.8.8 with 92mbps downstream traffic: 38-42 with
    one reply of 52 ms [10:47]
    AAAAAWESEOM [10:50]
    with txqueuelen of? And tx queue of? [10:51]
    txqueuelen 500 for all 3 interfaces and tx buffer of 8
    any qos at all?
    heh [10:52]
    Yes, the one you’ve got from me. Let me remove it and try again.
    going through a tunnel now
    that’s why the long txqueue is working so well.
    (or so I guess)
    Yes, that’s right. [10:55]
    with your qos turned off, ping
    goes to hell [10:56]
    ? [10:57]
    It looks like my ISP is
    doing something right. QoS
    doesn’t make it any worse.
    I mean lack of QoS doesn’t
    make it any worse.
    I removed all qdiscs,
    checked that I’ve got only
    pfifo_fast on all interfaces
    and I also ran iptables -F
    -t mangle. [10:58]
    my next big test is hard to setup [11:03]
    but you’d want 8-10 big streams going

    http://hackingnasdaq.blogspot.com/2010/01/myth-of-procsysnetipv4tcplowlatency.html and see what happens
    NOT MEASURED ON A SLOW, bandwidth constrained router [11:04]
    would love to have him repeat his tests on cerowrt
    openwrt trunk is rock solid with these parameters. [11:08]
    I can give cerowrt a shot in the weekend. [11:09]
    This is a good test scenario because PPPoE adds some overhead and
    upstream hardware is compensating for this somehow. [11:10]
    ij [11:26]
    I will have a new build by then I hope

felix, can you assemble your patch set to date and get it to me and
andrew?

Updated by Dave Täht on Aug 2, 2011.
felix pushed today’s work into openwrt head.
Updated by David Taht on Aug 2, 2011.
———- Forwarded message ———-
From: Dave Taht dave.taht@gmail.com
Date: Tue, Aug 2, 2011 at 11:03 PM
Subject: some ar71xx tests [#195]
To: Andrew McGregor andrewmcgr@gmail.com, Felix Fietkau nbd@openwrt.org

I built a version of cerowrt from openwrt head, in andrew’s dir with netperf
installed by default…

Going with the defaults:

netperf -H the_other_router from_one_router

I get 143Mbit

Going router-router over gigE (lan to wan), with tcp_low_latency turned on,
on both sides, with:

Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10\^6bits/sec

87380 16384 16384 60.00 218.20

so tcp_low_latency looks like a win with westwood+

My laptop doesn’t do gigE, so can’t test that here.

All these tests are with the default firewall rules for cerowrt, and nat
turned off on the one master router… and ethernet buffers of 4, and
txqueuelen of 8.

For giggles, I did a test with cubic:

root@OpenWrt:/proc/sys/net/ipv4# fg
netperf -l 60 -H 172.30.42.33
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10\^6bits/sec

87380 16384 16384 60.00 228.26

And last, with cubic, and firewall rules turned off:

MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
gw.home.lan (172.30.42.33) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10\^6bits/sec

87380 16384 16384 60.00 257.75

I’ll try diffserv again in the morning.

Updated by Dave Täht on Aug 2, 2011.
interestingly, the tcp stream going from laptop (100Mbit) - router (gige) router only has about a ~5ms latency for ping

the udp_stream test:

64 bytes from 172.31.42.33: icmp_req=525 ttl=64 time=41.0 ms
64 bytes from 172.31.42.33: icmp_req=526 ttl=64 time=37.3 ms
64 bytes from 172.31.42.33: icmp_req=527 ttl=64 time=33.0 ms

and crashes and burns with a l 120 never ending

with -l 10

MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 172.30.42.33 (172.30.42.33) port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10\^6bits/sec

126976 65507 10.01 1842 0 96.47
112640 10.01 0 0.00

Updated by David Taht on Aug 5, 2011.
Just to keep bug tracking..

I pulled from 27917.

I don’t see any commits that could have caused this breakage, except maybe
that I’m building against 2.6.39.3, and may have differences in my config
file from yours that are important.

could I encourage you to hop on huchra?

Also I’m curious if YOUR ethernet build showed this:

T
root@OpenWrt:~# dmesg | grep eth0
eth0: Atheros AG71xx at 0xb9000000, irq 4
eth0: unable to find MII bus on device ‘rtl8366s’
eth0: Atheros AG71xx at 0xba000000, irq 5
eth0: unable to find MII bus on device ‘rtl8366s’

I thought I was building from the same .config as you with support for the
1000hz and no_hz options - the above looks like a faster clock tick problem
to me… but you should have seen that too.

So I can revert to 2.6.39.2, revert all that

and/or restart from scratch from my patch set (Which only touches one tiny
bit of the eth driver and NOTHING of the wireless driver) But with a little
data perhaps that won’t be neccessary.

I’ll stick around here a while longer, cleaning up, and getting a router
from thursday online…

After I get some sleep.

On Fri, Aug 5, 2011 at 7:00 PM, Andrew McGregor andrewmcgr@gmail.comwrote:

Huh? I generated it based on openwrt head r27912

On 5/08/2011, at 5:01 PM, Dave Taht wrote:

so this patch duplicates some, but not all, of what is now in openwrt head?
It only partially applies.

Ripping it out again, and going off to hopefully find the problem in the
ethernet driver instead….


On Fri, Aug 5, 2011 at 4:00 PM, Dave Taht dave.taht@gmail.com wrote:

>
>
> ———- Forwarded message ———-
> From: Andrew McGregor andrewmcgr@gmail.com
> Date: Fri, Aug 5, 2011 at 3:57 PM
> Subject: Resend of patch
> To: Dave Taht dave.taht@gmail.com
>
>
> Ok, this patch applies on top of Felix’ last set of changes, and produces
> really spectacular results: about 2x better latency on 11n than before we
> both started on it at basically no performance cost, and fairly decent and
> consistent latency on 11g at a performance cost so small I can’t measure it.
>
> It may be worth fiddling a little more with the tunables, but this is
> pretty good for now.
>
> (btw: drop this file in the packages/mac80211 directory)
>
>
>
>

Updated by David Taht on Aug 6, 2011.
Felix told me that your patch had 13 stuff he’d done already.

———- Forwarded message ———-
From: Andrew McGregor andrewmcgr@gmail.com
Date: Fri, Aug 5, 2011 at 8:59 PM
Subject: Re: Resend of patch
To: Dave Taht dave.taht@gmail.com

I don’t know if ethernet was working, I never checked.

I was using 2.6.39.2 however, so I suspect an upstream change is causing the
patch to not apply. It’s not big, you could put it in by hand easily
enough. The last chunk you could also change the other case in that if
statement to something smaller, although that won’t happen on a WNDR3700

About to get on a plane to NZ, so I’ll be out of circulation for around 36
hours.

On 5/08/2011, at 6:25 PM, Dave Taht wrote:

I pulled from 27917.

I don’t see any commits that could have caused this breakage, except maybe
that I’m building against 2.6.39.3, and may have differences in my config
file from yours that are important.

could I encourage you to hop on huchra?

Also I’m curious if YOUR ethernet build showed this:

T
root@OpenWrt:~# dmesg | grep eth0
eth0: Atheros AG71xx at 0xb9000000, irq 4
eth0: unable to find MII bus on device ‘rtl8366s’
eth0: Atheros AG71xx at 0xba000000, irq 5
eth0: unable to find MII bus on device ‘rtl8366s’

I thought I was building from the same .config as you with support for the
1000hz and no_hz options - the above looks like a faster clock tick problem
to me… but you should have seen that too.

So I can revert to 2.6.39.2, revert all that

and/or restart from scratch from my patch set (Which only touches one tiny
bit of the eth driver and NOTHING of the wireless driver) But with a little
data perhaps that won’t be neccessary.

I’ll stick around here a while longer, cleaning up, and getting a router
from thursday online…

After I get some sleep.
On Fri, Aug 5, 2011 at 7:00 PM, Andrew McGregor andrewmcgr@gmail.comwrote:

Huh? I generated it based on openwrt head r27912

On 5/08/2011, at 5:01 PM, Dave Taht wrote:

so this patch duplicates some, but not all, of what is now in openwrt head?
It only partially applies.

Ripping it out again, and going off to hopefully find the problem in the
ethernet driver instead….


On Fri, Aug 5, 2011 at 4:00 PM, Dave Taht dave.taht@gmail.com wrote:

>
>
> ———- Forwarded message ———-
> From: Andrew McGregor andrewmcgr@gmail.com
> Date: Fri, Aug 5, 2011 at 3:57 PM
> Subject: Resend of patch
> To: Dave Taht dave.taht@gmail.com
>
>
> Ok, this patch applies on top of Felix’ last set of changes, and produces
> really spectacular results: about 2x better latency on 11n than before we
> both started on it at basically no performance cost, and fairly decent and
> consistent latency on 11g at a performance cost so small I can’t measure it.
>
> It may be worth fiddling a little more with the tunables, but this is
> pretty good for now.
>
> (btw: drop this file in the packages/mac80211 directory)
>
>
>
>

Updated by David Taht on Aug 6, 2011.
I would really like it if someone from a different time zone and sleeping
schedule
were to download the latest wndr3700v2 development build from here:

http://huchra.bufferbloat.net/~andrewm/cerowrt/

I hope this build finishes up the last of the truly epic adventure that has
taken place on bug:

http://www.bufferbloat.net/issues/216

If it works, do play with netperf 2.5.0 on the wireless interface.

I note that the fixes to #195 seem to have (maybe) messed up the ethernet
interface - or the change to a faster clock or tickless or 2.6.39.3 vs
.2…. OR It may be the router I have on me!

So let us know on this email (bug reporting interface is cc’d)

That’s all the news from tent #34. Good night, and good luck.

Updated by Dave Täht on Aug 8, 2011.
  1. The unrecognisable ethernet port problem exists on more than one router, so it wasn’t my hardware, as I hoped
  2. For my next series of tests, I am first reducing the clock to 100HZ, and staying tickless. My best guess as to the source of this problem is something hard coded in the init routine for the phy.

I shan’t document what I’ll do next as it’s pretty obvious the bisecting I’d need to do, nor, do I know, what the real effect of a 1000HZ vs 100HZ clock is on a tickless system, anymore.

Updated by Dave Täht on Aug 8, 2011.
I so love it when I guess right.

100HZ clock + tickless + 2.6.39.3… just works.

I have a backlog of other patches that needed to land and a ton of src to clean up, but we’re back in business.

Updated by Dave Täht on Aug 9, 2011.
OK, so the ethernet seems a great deal faster than it was, and I will start doing comprehensive testing vs the factory firmware by the weekend.

“With the right eyeballs, all bugs are shallow”

Updated by Yuri Bene on Aug 12, 2011.
Dave Täht wrote:
> OK, so the ethernet seems a great deal faster than it was, and I will start doing comprehensive testing vs the factory firmware by the weekend.

Is there an OpenWRT build containing the patch that I can try?

Updated by Dave Täht on Sep 17, 2011.
We are getting absolutely spectacular ethernet performance out of this puppy now.
Updated by Dave Täht on Apr 21, 2012.

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Dec 2, 2024 Wiki page
What Can I Do About Bufferbloat?
Dec 1, 2024 Wiki page
Bufferbloat FAQs
Jul 21, 2024 Wiki page
cake-autorate
Jul 21, 2024 Wiki page
Tests for Bufferbloat
Jul 1, 2024 Wiki page
RRUL Chart Explanation

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".