Bug #436

speed problems with later versions of cero?

Added by David Taht on Nov 21, 2013. Updated on May 14, 2014.
Closed Normal Dave Täht

Description

I’ve now had three private reports of “speed” problems with the latest
cerowrt releases.

All of them take serious uptime to occur. And unfortunately all the
bug report(s) were, “oh, it
got slow after everybody got home/ran a couple days, so I went back to
the prior router”.

That’s not helpful, and could well be new users blaming the change
rather than something else, but it is possible that this is happening,
so… IF that happens to anyone here, I’d like
to request a couple things in roughly this order

tcpdump -i ge00 -s 128 -w /tmp/myproblem.cap # for 10 seconds hit
cntrl-c Send me the file
pinging the box from the outside world # got packet loss?
dmesg > /tmp/dmesg.log # send me the file
logread > /tmp/logread.log # send me te file
run top and see if there is a daemon running away
tell me what ISP’s network you are on…

restart the aqm system and see if it goes away
/etc/init.d/aqm restart

try nfq_codel and/or pie instead of fq_codel

1) I am aware of an old bug with odhcp6 where it would start hammering
the network with
requests for an ipv6 address in a tight loop. I don’t know what
triggers it. (with AQM set right it is hardly noticeable but
“feelable”. However if your bandwidth setting is even mildly wrong,
boom life gets bad and STAYS bad)

If you look at the packet capture it will be filled with dhcpv6
requests if this is what is happening.

Right now, I have a box dropping nearly every other packet for some
reason. It’s not in a place I can easily get at until tomorrow. It’s
been up for 90+ days and only started doing this recently

— X.Y.Z.Q ping statistics —
60 packets transmitted, 42 packets received, 30.0% packet loss
round-trip min/avg/max/stddev = 37.78844.20576.6076.609 ms

It’s my hope it’s the above problem and is occurring now that comcast
is rolling out ipv6

Off the top of my head I don’t remember what you need to disable to
turn it off. Certainly comment out the ge01 entry to turn it off.

2) Random number generation has changed.

rngd is still enabled, so we’re not running out of random numbers, but
we could have other problems… check entropy?

3) There could be a daemon running away (run top)

4) Could have a new instruction trap problem under certain conditions

take a look at the /sys

dmesg
logead

Tony

Attachments

  • boom.txt (text/plain; 401 bytes) Dave Täht Nov 22, 2013
  • wtf.log (text/x-log; 15.7 kiB) Dave Täht Nov 22, 2013
  • wtf.dmesg (application/octet-stream; 16.4 kiB) Dave Täht Nov 22, 2013
  • ge00.log (text/x-log; 1.2 kiB) Dave Täht Nov 22, 2013
  • ifb0.log (text/x-log; 1.1 kiB) Dave Täht Nov 22, 2013
  • odhcp6c_from_hell.png (image/png; 146.7 kiB) screenshot of odhcp6c running away, eating nearly all of cpu Dave Täht Nov 22, 2013

History

Updated by David Taht on Nov 21, 2013.
———- Forwarded message ———-
From: Tony Morel morel@armory.com
Date: Thu, Nov 21, 2013 at 3:01 PM
Subject: Re: second chance to diagnose router failure -> third chance :)
To: Dave Taht dave.taht@gmail.com

Hi Mike,

Looks like the cable modem router bug’s been exercised again,
dropping 30% of pings, and of course noticable in use:

Destination IP: 50.131.246.224
Prior hop IP: 68.86.142.254
Number of pings: 60
PING 50.131.246.224 (50.131.246.224): 56 data bytes
PING 68.86.142.254 (68.86.142.254): 56 data bytes

— 68.86.142.254 ping statistics —
60 packets transmitted, 60 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 26.98534.099/58.088/7.826 ms

— 50.131.246.224 ping statistics —
60 packets transmitted, 42 packets received, 30.0% packet loss
round-trip min/avg/max/stddev = 37.78844.20576.6076.609 ms

Tony

On Thu, Aug 08, 2013 at 04:53:14PM -0700, Tony Morel wrote:
> OK, I first ssh’d to oldnet. I’m attaching current.(log|dmesg)
>
> Is there some fine point about dissociating the reboot
> and logging off root before the reboot? Or something?
> Now I get this:
>
> root@oldnet:~# ssh root@172.26.3.1
> ssh: connect to host 172.26.3.1 port 22: Connection refused
>
> Will that require a power cycle to resolve?
>
> Anyway, reboot cleared the dropped packet problem for now:
>
> \$ hopchk
> End-of-hop IP: 50.131.246.224
> Prior hop IP: 68.86.142.250
> Number of pings: 60
> PING 50.131.246.224 (50.131.246.224): 56 data bytes
> PING 68.86.142.250 (68.86.142.250): 56 data bytes
>
> — 68.86.142.250 ping statistics —
> 60 packets transmitted, 60 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 27.93153.351294.26762.636 ms
>
> — 50.131.246.224 ping statistics —
> 60 packets transmitted, 60 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 34.31664.208315.35470.786 ms
>
> Tony
>
> On Thu, Aug 08, 2013 at 02:39:01PM -0700, Dave Taht wrote:
> > On Thu, Aug 8, 2013 at 2:26 PM, Tony Morel morel@armory.com wrote:
> > > Hi Mike,
> > >
> > > I, too, have been over busy, but finally dug up my notes
> > > last night - but then found johnd, the machine on the cable side
> > > of the network, evidentally powered off while Irene’s out of town.
> > > Is there another way to find my way to the cable router from
> > > deepthought?
> >
> > I dunno, you should be able to ssh to the internal ip addresses for her network.
> >
> > powering down johnd messes up email so far as I know (it was doing rbls)
> >
> > >
> > > Might as well ask, too: will the router let me
> > > scp the logs back to the mother ship?
> >
> > sure.
> >
> > >
> > > Tony
> > >
> > >
> > > On Wed, Jul 31, 2013 at 02:13:20PM -0700, Dave Taht wrote:
> > >> ENOTIME.
> > >>
> > >> do a
> > >>
> > >> logread > /root/current.log
> > >> dmesg > /root/current.dmesg
> > >>
> > >> and just reboot
> > >>
> > >>
> > >>
> > >> On Wed, Jul 31, 2013 at 2:08 PM, Dave Taht dave.taht@gmail.com wrote:
> > >> > wow. All I remember was that I’d port forwarded 222 somewhere…
> > >> >
> > >> >
> > >> > On Wed, Jul 31, 2013 at 1:41 PM, Tony Morel morel@armory.com wrote:
> > >> >> Hi Mike,
> > >> >>
> > >> >> Irene’s router for the cable modem is dropping ~20% packets again today.
> > >> >> If like last time, it will continue doing so until reset.
> > >> >>
> > >> >> You may recall that when we started to look into this last time, we were
> > >> >> briefly missing the router password, power-cycled the router, got the password,
> > >> >> but then no longer had the logs to look over. The reset remedied the
> > >> >> packet dropping.
> > >> >>
> > >> >> This time we have the password and the logs. Seems like something worth
> > >> >> trying to get to the bottom of, no?
> > >> >>
> > >> >> Tony
> > >> >>
> > >> >>
> > >> >> \$ hopchk
> > >> >> End-of-hop IP: 50.131.246.224
> > >> >> Prior hop IP: 68.86.142.250
> > >> >> Number of pings: 60
> > >> >> PING 68.86.142.250 (68.86.142.250): 56 data bytes
> > >> >> PING 50.131.246.224 (50.131.246.224): 56 data bytes
> > >> >>
> > >> >> — 68.86.142.250 ping statistics —
> > >> >> 60 packets transmitted, 60 packets received, 0.0% packet loss
> > >> >> round-trip min/avg/max/stddev = 26.44231.84945.4193.476 ms
> > >> >>
> > >> >> — 50.131.246.224 ping statistics —
> > >> >> 60 packets transmitted, 47 packets received, 21.7% packet loss
> > >> >> round-trip min/avg/max/stddev = 33.61638.65162.8785.068 ms
> > >> >
> > >> >
> > >> >

Updated by Dave Täht on Nov 22, 2013.
It was, indeed, odhcpv6 running away, and totally saturating the network.

I AM puzzled as to why it was noticable at all vs fq_codel, my thought is I should have captured more data than I just did….

I was experiencing 13% packet loss. running at nearly 100% of cpu, too…

the router was runnign 3.7.1, so it predates the official 3.7 release of cerowrt. I have no doubt this bug persisted in later versions and may well persist today.

Also, we had major traps. I am vague on when we fully fixed these, but I’m pretty sure it wasn’t before 3.7.2

root@gw-comcast:/sys/kernel/debug/mips# cat unaligned_instructions
17755733
root@gw-comcast:/sys/kernel/debug/mips# cat unaligned_instructions
17756005
root@gw-comcast:/sys/kernel/debug/mips#

Anyway, after killing off odhcpv6c entirely the system returned to normal

Updated by Dave Täht on Nov 22, 2013.
This box had been up for 56 days, and started exhibiting signs of trouble at least a week ago.

odhcp6c is completely normal on the reboot, blocking in recv… (no ipv6 on this network)

Theory: the packet capture I just took shows “elapsed time” of 655350 ms in the
dhcpv6 field. Perhaps we’ve hit an overflow here, after X time runnning….

root@gw-comcast:~# strace -p 1516
Process 1516 attached
clock_gettime(CLOCK_MONOTONIC, {211, 808127890}) = 0
setsockopt(4, SOL_SOCKET, SO_RCVTIMEO, “\0\0\0000\0\5\300\250”, 8) = 0
recv(4, 0x7ff5897c, 1536, 0) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {260, 190780387}) = 0
read(3, “\t![]()\5”, 4) = 4
read(3, “\204\363oB”, 4) = 4
uname({sys=“Linux”, node=“gw-comcast.hm.armory.com”, …}) = 0
sendmsg(4, {msg_name(28)={sa_family=AF_INET6, sin6_port=htons(547), inet_pton(AF_INET6, “ff02::1:2”, &sin6_addr), sin6_flowinfo=0, sin6_scope_id=if_nametoindex(“ge00”)}, msg_iov(11)=[{”\1\177\373\226\0\10\0\2Qe\0\6\0\4”, 14}, {”\0\27\0\30”, 4}, {“”, 0}, {”\0\1\0\n\0\3\0\1\0\216\362\372\241T”, 14}, {NULL, 0}, {”\0\24\0\0”, 4}, {”\0’\0\33\0\ngw-comcast\2hm\6armory\3com\0”, 31}, {”\0\3\0\f\0\0\0\1\0\0\0\0\0\0\0\0”, 16}, {NULL, 0}, {”\0\31\0\f\0\0\0\1\0\0\0\0\0\0\0\0”, 16}, {NULL, 0}], msg_controllen=0, msg_flags=0}, 0) = 99
setsockopt(4, SOL_SOCKET, SO_RCVTIMEO, “\0\0\0w\0\5_\0”, 8) = 0
recv(4,

Updated by Dave Täht on May 14, 2014.
PArt of bug #438

This is a static export of the original bufferbloat.net issue database. As such, no further commenting is possible; the information is solely here for archival purposes.
RSS feed

Recent Updates

Oct 20, 2023 Wiki page
What Can I Do About Bufferbloat?
Dec 3, 2022 Wiki page
Codel Wiki
Jun 11, 2022 Wiki page
More about Bufferbloat
Jun 11, 2022 Wiki page
Tests for Bufferbloat
Dec 7, 2021 Wiki page
Getting SQM Running Right

Find us elsewhere

Bufferbloat Mailing Lists
#bufferbloat on Twitter
Google+ group
Archived Bufferbloat pages from the Wayback Machine

Sponsors

Comcast Research Innovation Fund
Nlnet Foundation
Shuttleworth Foundation
GoFundMe

Bufferbloat Related Projects

OpenWrt Project
Congestion Control Blog
Flent Network Test Suite
Sqm-Scripts
The Cake shaper
AQMs in BSD
IETF AQM WG
CeroWrt (where it all started)

Network Performance Related Resources


Jim Gettys' Blog - The chairman of the Fjord
Toke's Blog - Karlstad University's work on bloat
Voip Users Conference - Weekly Videoconference mostly about voip
Candelatech - A wifi testing company that "gets it".