Bug #288

Duplicate pings bother me

Added by Dave Täht over 1 year ago. Updated about 1 year ago.

Status:Closed Start date:10/21/2011
Priority:Normal Due date:
Assignee:David Taht % Done:

0%

Category:Performance Spent time: 16.00 hours
Target version:1st Public Cerowrt release

Description

In a recent experiment with cerowrt, some duplicate pings were seen, largely when the network was otherwise idle.

I am really hard pressed to think of circumstances where seeing duplicate pings can happen, given the various forms of ARQ and block acks built into 802.11n. The fact that it was happening on an idle (2 hop) network implies (to me) that single packet transfers may have an issue in retries that multipacket transfers do not.

Things that tweak me on this topic were the recent sack related analysis and dan's work on coverfire

http://www.coverfire.com/archives/2011/10/15/linux-flow-classifier-proto-dst-and-tos/

which pointed to oddball packets being mishandled elsewhere in the stack. The experiment was otherwise successful, showing a distinct improvement in latency when a txqueuelen of 37 was used vs a txqueuelen of 1000, on the router, and for all I know there ARE circumstances in 802.11n where you will see duplicate packets... but interestingly, no dups were seen in the txqueuelen 1000 test.

Further experiments are needed.

d@cruithne:~/org/lincstalk/bbloat$ grep -A 1 -B 1 DUP p1.txt # this was the txqueuelen 37 test:

64 bytes from 172.26.0.3: icmp_seq=67 ttl=63 time=1.237 ms
64 bytes from 172.26.0.3: icmp_seq=67 ttl=63 time=67.926 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=68 ttl=63 time=1.503 ms
--
64 bytes from 172.26.0.3: icmp_seq=104 ttl=63 time=2.624 ms
64 bytes from 172.26.0.3: icmp_seq=104 ttl=63 time=108.333 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=105 ttl=63 time=14.603 ms
--
64 bytes from 172.26.0.3: icmp_seq=115 ttl=63 time=1.543 ms
64 bytes from 172.26.0.3: icmp_seq=115 ttl=63 time=72.641 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=116 ttl=63 time=60.502 ms
--
64 bytes from 172.26.0.3: icmp_seq=194 ttl=63 time=1.948 ms
64 bytes from 172.26.0.3: icmp_seq=194 ttl=63 time=169.254 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=195 ttl=63 time=1.486 ms
--
64 bytes from 172.26.0.3: icmp_seq=229 ttl=63 time=1.585 ms
64 bytes from 172.26.0.3: icmp_seq=229 ttl=63 time=75.449 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=230 ttl=63 time=3.115 ms
--
64 bytes from 172.26.0.3: icmp_seq=259 ttl=63 time=1.574 ms
64 bytes from 172.26.0.3: icmp_seq=259 ttl=63 time=77.107 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=260 ttl=63 time=1.253 ms
--
64 bytes from 172.26.0.3: icmp_seq=376 ttl=63 time=1.761 ms
64 bytes from 172.26.0.3: icmp_seq=376 ttl=63 time=78.557 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=377 ttl=63 time=1.333 ms
--
64 bytes from 172.26.0.3: icmp_seq=497 ttl=63 time=1.681 ms
64 bytes from 172.26.0.3: icmp_seq=497 ttl=63 time=72.170 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=498 ttl=63 time=5.078 ms

bbloat.plotting.tar.gz (7.6 kB) David Taht, 10/21/2011 07:11 pm

plot.ps (24.5 kB) David Taht, 10/21/2011 07:11 pm

dave_taht.vcf (205 Bytes) David Taht, 10/21/2011 07:11 pm

txqueuelen37_vs1000_really_works.ps (24.8 kB) David Taht, 10/21/2011 07:18 pm

dave_taht.vcf (205 Bytes) David Taht, 10/21/2011 07:18 pm

History

Updated by David Taht over 1 year ago

-------- Original Message --------
Subject: Re: plot
Date: Fri, 21 Oct 2011 16:25:16 +0200
From: Fabian Schneider <>
To: David Täht <>
CC: Jim Gettys <>, Kathleen Nichols
<>

Hi,

Obviously I need to learn enough R to be able to do this sort of analysis myself. In the interim, since I'm doing a similar experiment monday (this time varying either classification of ping or the queue-ing algorithm, not the buffer size - I haven't decided which), we'd love to see your script in the hope I can immediately use it in the class...

Ok. lets see. I created something that should ease up the plotting. See attached tar archive and sample plot. To summarize there is a Makefile which plots the data and expects:
- p1.txt and p2.txt as input (which should be the logs from the ping command)
- awk and sed to be installed and in the path
- GNU R to be installed and in the path

What it does is:
1.) extracts the RTT timeseries 'hopefully' covering the download period, by looking for the ping sequence number (SEQ) with maximum RTT (MAX) and considering all RTTs from SEQ-100 to SEQ+100.
2.) replacing all timeouted pings with an RTT value of MAX+100ms, where the 100ms can be configured in the BEGIN clause of the awk script.
3.) plot the two time series.

What it includes:
- extract.timeseries.awk (performs steps 1+2)
- plotit.R (performs step 3)
- Makefile (wrapper)
- p*.txt example input files

As for these plots, what I'd like:

the Y axis to stay in the the same range of 0 - 1000 ms

(for the combined plot i determine the maximum RTT and use that as an upper bound)

the X axis to be 100 seconds long, showing the idle period, the spike, the idle period.

(i opted for a 200 second = 200 ping samples period instead)

the two plots to be directly comparable, using a different color for the ping 'dots', and roughly the same start time, so I can overlay them...

already done. Only one output plot.

1) start time is uncontrolled (but close enough, given the length of transfer)

you can 'tune' the input data and have the input file start at a chosen point in time that is less than 100 seconds before the maximum to achieve this.
(I did not want to get into more sophisticated methods for determining the download period from the data, sofar)

4) tcpdump on at least several stations would be VERY interesting. Dario says you were seeing some interesting duplicate acks - I still haven't verified we've actually FIXED the stack enough to get reliable results.

No what we saw is ping reporting "(DUP!)" behind some lines.

Would you be interested in following up this line of work with a paper w/me/etc?

sure.

best
Fabian

Updated by David Taht over 1 year ago

I just wanted to express how (other than the dup ping issue) utterly
happy I am with the attached txqueuelen37 vs 1000 plot.

txqueuelen 37 REALLY works vs 1000 - and the real sources of the
bottlenecks have moved to the clients, not the AP. On to testing queue
management!

Is 37 the right number? don't know. Is it a better number than 1000 -
you betcha!

Andrew, are you still thinking 100 is better for google's iw10 mice attack?

Updated by Dave Täht over 1 year ago

oh, I forgot to mention what REALLY tweaked me (athough it's not shared by the remainder of the data set and is probably totallllly a matter of mere misfortune) and made me jump at shadows like this one.

first dup is packet 67
second dup is packet 104.

104-67 = 37, which is my txqueuelen.

Updated by Dave Täht over 1 year ago

  • Status changed from New to Closed
  • Target version changed from 14 to Cerowrt-1.0-rc8

Can't reproduce at current txqueuelen

Updated by Jim Gettys over 1 year ago

Seemed fixed. TXquelen == 40 seems to have fixed it..... Bizarre.

Updated by Dave Täht about 1 year ago

  • Target version changed from Cerowrt-1.0-rc8 to 1st Public Cerowrt release

Also available in: Atom PDF