I am really hard pressed to think of circumstances where seeing duplicate pings can happen, given the various forms of ARQ and block acks built into 802.11n. The fact that it was happening on an idle (2 hop) network implies (to me) that single packet transfers may have an issue in retries that multipacket transfers do not.
Things that tweak me on this topic were the recent sack related analysis and dan’s work on coverfire
which pointed to oddball packets being mishandled elsewhere in the stack. The experiment was otherwise successful, showing a distinct improvement in latency when a txqueuelen of 37 was used vs a txqueuelen of 1000, on the router, and for all I know there ARE circumstances in 802.11n where you will see duplicate packets… but interestingly, no dups were seen in the txqueuelen 1000 test.
Further experiments are needed.
d@cruithne:~/org/lincstalk/bbloat\$ grep -A 1 -B 1 DUP p1.txt # this was the txqueuelen 37 test:
64 bytes from 172.26.0.3: icmp_seq=67 ttl=63 time=1.237 ms
64 bytes from 172.26.0.3: icmp_seq=67 ttl=63 time=67.926 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=68 ttl=63 time=1.503 ms
64 bytes from 172.26.0.3: icmp_seq=104 ttl=63 time=2.624 ms
64 bytes from 172.26.0.3: icmp_seq=104 ttl=63 time=108.333 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=105 ttl=63 time=14.603 ms
64 bytes from 172.26.0.3: icmp_seq=115 ttl=63 time=1.543 ms
64 bytes from 172.26.0.3: icmp_seq=115 ttl=63 time=72.641 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=116 ttl=63 time=60.502 ms
64 bytes from 172.26.0.3: icmp_seq=194 ttl=63 time=1.948 ms
64 bytes from 172.26.0.3: icmp_seq=194 ttl=63 time=169.254 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=195 ttl=63 time=1.486 ms
64 bytes from 172.26.0.3: icmp_seq=229 ttl=63 time=1.585 ms
64 bytes from 172.26.0.3: icmp_seq=229 ttl=63 time=75.449 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=230 ttl=63 time=3.115 ms
64 bytes from 172.26.0.3: icmp_seq=259 ttl=63 time=1.574 ms
64 bytes from 172.26.0.3: icmp_seq=259 ttl=63 time=77.107 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=260 ttl=63 time=1.253 ms
64 bytes from 172.26.0.3: icmp_seq=376 ttl=63 time=1.761 ms
64 bytes from 172.26.0.3: icmp_seq=376 ttl=63 time=78.557 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=377 ttl=63 time=1.333 ms
64 bytes from 172.26.0.3: icmp_seq=497 ttl=63 time=1.681 ms
64 bytes from 172.26.0.3: icmp_seq=497 ttl=63 time=72.170 ms (DUP!)
64 bytes from 172.26.0.3: icmp_seq=498 ttl=63 time=5.078 ms
Obviously I need to learn enough R to be able to do this sort of analysis myself. In the interim, since I’m doing a similar experiment monday (this time varying either classification of ping or the queue-ing algorithm, not the buffer size - I haven’t decided which), we’d love to see your script in the hope I can immediately use it in the class…
Ok. lets see. I created something that should ease up the plotting. See
attached tar archive and sample plot. To summarize there is a Makefile
which plots the data and expects:
- p1.txt and p2.txt as input (which should be the logs from the ping command)
- awk and sed to be installed and in the path
- GNU R to be installed and in the path
What it does is:
1.) extracts the RTT timeseries ‘hopefully’ covering the download period, by looking for the ping sequence number (SEQ) with maximum RTT (MAX) and considering all RTTs from SEQ-100 to SEQ+100.
2.) replacing all timeouted pings with an RTT value of MAX+100ms, where the 100ms can be configured in the BEGIN clause of the awk script.
3.) plot the two time series.
What it includes:
- extract.timeseries.awk (performs steps 1+2)
- plotit.R (performs step 3)
- Makefile (wrapper)
- p*.txt example input files
As for these plots, what I’d like:
the Y axis to stay in the the same range of 0 - 1000 ms
(for the combined plot i determine the maximum RTT and use that as an upper bound)
the X axis to be 100 seconds long, showing the idle period, the spike, the idle period.
(i opted for a 200 second = 200 ping samples period instead)
the two plots to be directly comparable, using a different color for the ping ‘dots’, and roughly the same start time, so I can overlay them…
already done. Only one output plot.
1) start time is uncontrolled (but close enough, given the length of transfer)
you can ‘tune’ the input data and have the input file start at a chosen
point in time that is less than 100 seconds before the maximum to
(I did not want to get into more sophisticated methods for determining the download period from the data, sofar)
4) tcpdump on at least several stations would be VERY interesting. Dario says you were seeing some interesting duplicate acks - I still haven’t verified we’ve actually FIXED the stack enough to get reliable results.
No what we saw is ping reporting “(DUP!)” behind some lines.
Would you be interested in following up this line of work with a paper w/me/etc?
txqueuelen 37 REALLY works vs 1000 - and the real sources of the
bottlenecks have moved to the clients, not the AP. On to testing queue
Is 37 the right number? don’t know. Is it a better number than 1000 -
Andrew, are you still thinking 100 is better for google’s iw10 mice attack?
first dup is packet 67
second dup is packet 104.
104-67 = 37, which is my txqueuelen.