Best Practices for Benchmarking CoDel and FQ CoDel (and almost anything else!)

Document version: 1.2, March 10, 2013.

The bufferbloat project has had trouble getting consistent repeatable results from other experimenters, due to a variety of factors. This page attempts to identify the most common omissions and mistakes. There be land mines here. Your data will be garbage if you don't avoid them!

Hardware/Software Traps for the Unwary

Network hardware (even in cheap hardware!) has grown "smart", with offload engines of various sorts. Modern network hardware has often sprouted various "offload" engines, unfortunately now often enabled by default, which tend to do more damage than good except for extreme benchmarking fanatics, often primarily on big server machines in data centers. Start by turning them off. We'll write more on this topic soon. The implementers of this "smart" hardware are less "smart" than they think they are.

Transmit and receive rings are now ubiquitous, to get the CPU's out of the business of handling interrupts on a per packet basis when running at high bandwidths with small packets. But this is often a large source of bufferbloat, as the drivers have managed the rings in the most primitive possible way. BQL (byte queue limits) is the first step in putting sanity back into Ethernet driver transmit ring management.

There are now over 12 BQL enabled Ethernet drivers in Linux 3.6, but only the tg3 and e1000e are well tested on Intel devices, and the ar71xx on CeroWrt. If your driver is NOT BQL enabled, you will need to use HTB to emulate rates correctly.

Beware that other network interfaces have often sprouted similar "smart" hardware (e.g. some DSL hardware, all USB to Ethernet devices); no help for them is yet available. Worse than that, software offloads emulating TSO (GSO) have now appeared, saving a little bit on interrupt processing at higher bandwidths, and bloating up packets at all bandwidths, including lower ones. We would really like to see GRO and GSO be disabled entirely at 100Mbit and below. Let packets be PACKETS!

Consequently, you must understand the device drivers you are using.

Ethernet BQL (Byte Queue Limits) Driver Enablement

BQL rate limits packet service to the device in terms of bytes on Ethernet. BQL was initially developed and tuned at 1 and 10Gbps. Its estimator is consistently wrong at speeds below 100Mbit, usually double or more what is needed (and see the last note here about how this further damages CoDel behavior. 100Mbit and below 3K suffices. 10Mbit and below 1514 is as low as you can go and it should be even lower.

Common experimental errors: Leaving BQL at autotuning, setting it to too high a value for the bandwidth available, etc.

We would like BQL to autotune better, but as of Linux 3.10, it does not. Perhaps BQL can be fixed to have a lower limit and just periodically exceed it for larger packets.

We are looking into other algorithms than BQL for other network types, which have often sprouted similarly "smart" hardware to Ethernet, e.g. ADSL/VDSL, which must cover a bandwidth range of 250Kbps to 100Mbps.

Hardware Rate Limiting

Adjusting the ethernet card to a different line rate is useful, but several other variables inside the stack must be adjusted. Those are turning off all network offloads on ALL cards, and setting BQL to an appropriate rate for the set bandwidth.

The debloat.sh and debloat scripts try their best to eliminate those variables, but do not always succeed.

We have found that 3000 is a reasonable setting for BQL on a 100Mbit system and 1514 is reasonable for 10Mbit, if configured with a low latency kernel. We note that unlike experimenters leveraging what hardware and kernels they have lying around, a router maker should make these choices by default...

On some devices, you can also reduce the ethernet tx ring down to 2, at 10Mbit, with no ill effects, however you should measure this optimization's affect on throughput.

Note that most home routers have an internal switch that runs at a fixed rate. Turning down CeroWrt's rate on it's internal ethernet device doesn't work and for tests that try to use hardware rate limiting there, you are driving buffering into the switch, not the (ne)fq_codel algorithms, and most drops happen in the switch, thus invalidating your test.

Leave the switch out of your testbeds. If you must have one, measure its behaviour thoroughly, while under saturating load from two ports going into one. You may be surprised by how much buffering has crept into switches - one GigE switch we've seen has 50ms of buffering (at gigE!). What is more, your switch, which might or might not be properly buffered at 1Gbps, will likely have ten times too much buffering if used at 100Mbps, and 100 times too much buffering when used at 10Mbps.

Software Rate Limiting

Use of HTB to rate limit connections to a given speed is to be preferred, as HTB buffers up one, and only one packet. Note that HTB is timer based; default Linux kernels are often compiled with HZ=250 (or even lower), causing burstyness and non-uniform delivery of packets; building your kernels at HZ=1000 will reduce this effect.

Still, use of other hardware in your setup can bit you - debloat all devices thoroughly (as per the debloat scripts) in a routing - or routing emulating - setup.

Common experimental error:

ethtool -s eth0 advertise 0x002 # set one interface to 10Mbit

This would allow for GRO to happen on another interface. (and TSO, GSO, GRO, UFO, LRO) offloads to happen on all interfaces, bloating up packet sizes). Turn all offloads off on all interfaces always.

We use multiple variants of HTB and HFSC shaper scripts from Dave Täht and Dan Siemon and others. Configuring HTB can be tricky, and simple errors will result (usually) in you directing a shaper bin into a pfifo_fifo fast queue rather than where you wanted it.

ceroshaper might be a good starting place to for hacking.

Know Your Bottlenecks!

Switches have buffering; sometimes excessive (particularly on improperly configured enterprise switches). And Ethernet flow control may move the bottleneck in your path to somewhere you didn't expect, or cause the available bandwidth to be very different than you expect (particularly in switched networks that mix different Ethernet speeds). Most Linux Ethernet drivers honor flow control by default. Cheap consumer Ethernet switches typically generate pause frames; enterprise switches typically do not. There is no substitute for packet capture, mtr, and wireshark!

Kernel Versions and Configuration is Important

We try to provide a modern, precompiled kernel (usually with several advanced versions of codel and fq_codel derived schedulers) on our website, along with patches and the config used to build it.

There are two differences from this kernel's config than a normal "desktop" or "server" configuration. It is configured for low latency and a high clock interrupt rate. Faster interrupt response makes smaller buffers feasible on a conventional x86 machine.

The NetEm qdisc does not work in conjunction with other qdiscs.

The Linux network emulator qdisc, "netem", in its current incarnation, although useful for inserting delay and packet loss, cannot be effectively used in combination with other queueing disciplines. If you intend to insert delays or other forms of netem based packet manipulation, an entirely separate machine is required. A combination of netem + any complex qdisc (such as htb + fq_codel or RED, or qfq) WILL misbehave. Don't do it; your data is immediately suspect if you do.

Note: netem has been improving of late...

Tuning txqueuelen on pfifo_fast

The default txqueuelen was set to 1000 in Linux, in 2006. This is arguably wrong, even for gigE. Most network simulations at 100Mbit and below use 100, or even 50 as their default. Setting it below 50 is generally wrong, too, at any speed, except on wireless.

There is no right size for buffering!

Tuning These Algorithms

Having a sane, parameter less algorithm is very important to us. The world (and current Linux implementation), however, is not yet co-operating as well as we (or the algorithms) would like.

Tuning CoDel for Circumstances it Wasn't Designed for

CoDel is designed to attempt to be "no knobs" at the edge of the Internet (where most people are). At high speeds (10GigE), using a larger packet limit than 1000 is recommended. Also at those speeds, in a data center (not the open internet) the target and interval are often reduced to 500us and 20ms respectively by those attempting to use CoDel in those environments.

To date, no one has invented a truly "no knobs" algorithm that works in all environments in the Internet.

Secondly, codel is a "drop strategy", and is meant to be used in conjunction with another qdisc, such as DRR, QFQ, or (as we use it), several variants of (n,e)fq_codel. While available as a standalone qdisc this is intended primarily to be able to test variants of the algorithm. There is a patch for most of the current ns2 model available.

Tuning fq_codel

By default fq_codel is tuned to run well, and with no parameters, at 10GigE speeds.

However, today's Linux implementation of CoDel is imperfect: there are typically (at least) one or more packets of buffering under the Linux qdisc, in the device driver (or one packet in htb) even if BQL is available. This means that the "head drop" of CoDel's design is not actually a true head drop, but several packets back in in the actual queue (since there is no packet loss at the device driver interface), and that CoDel's square root computation is not exactly correct. These effects are vanishingly small at 1Gbps or higher, but when used at low speeds, even one packet of buffering is very significant; today's fq_codel and codel qdiscs do not try to compensate for what can be significant sojourn time of these packets at low bandwidth. So you might have to "tune" the qdiscs in ways (e.g. the target) that in principle the CoDel algorithm should not require when used at low bandwidths. We hope to get this all straightened out someday soon, but knowing exactly how much buffering is under a qdisc is currently difficult and it isn't clear when this will happen.

When running it at 1GigE and lower, today it helps to change a few parameters given limitations in today's Linux implementation and underlying device drivers.

The default packet limit of 10000 packets is crazy in any other scenario. It is sane to reduce this to a 1000, or less, on anything running at gigE or below. The over-large packet limit leads to bad results during slow start on some benchmarks. Note that, unlike txqueuelen, CoDel derived algorithms can and DO take advantage of larger queues, so reducing it to, say, 100, impacts new flow start, and a variety of other things.

We tend to use ranges of 800-1200 in our testing, and at 10Mbit, currently 600.

ECN Issues

ECN is enabled by default. ECN is useful in the data center but far less so today on the open internet. Current best practice is to turn off ECN on uplinks running at less than 4Mbit (if you want good VOIP performance; a single packet at 1Mbps takes 13ms, and packet drops get you this latency back).

ECN IS useful on downlinks on a home router, where the terminating hop is only one or two hops away, and connected to a system that handles ECN correctly (all current OS's are believed to implement ECN correctly, but this assumption bears a need for greater testing!).

Fq_codel runs well on asymmetric links such as your commonly available 24.5/5.5 service from a cable modem provider like Comcast. (in conjunction with setting a shaper to your providers's rates and htb rate limiting)

nfq_codel and efq_codel

We now have a new versions of fq_codel under test . nfq_codel is an implementation the experimental ns2 model of codel + what is standard fq_codel, and efq_codel is nfq_codel that takes much better advantage of the quantum size and interleaves small packets better at low bandwidths...

However, it helps to reduce the quantum slightly on efq_codel to improve downlink performance while not compromising upload performance. It appears that a ratio of 3x1 (500) is reasonable for most traffic in a 6x1 scenario. A quantum of 1000 isn't bad either, and works for more kinds of interactive traffic.

Life gets dicy in 12x1 as quantums below 256 disable a few key things efq_codel does to get good (for example) VOIP performance. Stay at 500 or above.

Anyway, a commonly used configuration line is this:

tc qdisc add dev your_device root efq_codel quantum 500 limit 1000 noecn

NOTE: An earlier version of this page identified nfq_codel as having the quantum optimization. Also note that "sfq_codel" in the ns2 distribution is packet oriented, the variants of (e,n)fq_codel have various degrees of byte orientation.

Research continues. Also fq_codel in Linux 3.5 did not do fate sharing, and Linux 3.6 fq_codel reduces cwnd without packet drop on buffer overload on a tcp stream enqueue.

Lastly, just running fq_codel by itself, does not help you very much when the next device in line is overbuffered (as in a home router next to today's cable modems). In that case, using HTB to rate limit your router to below the next gateway device and then applying fq_codel will work. See the note above about limitations to HTB.

Fq_codel needs more tuning below 4Mbit and on ADSL links. We're working on it.

Work on Making codel and fq_codel Implementations Better Continues

The time in packets spend in device drivers is not taken into account in CoDel's control law computation, resulting in the need for tuning with CoDel target, particularly at low speeds where even one packet is highly significant latency. This is a current implementation limitation as we have no way to find out how much time is being spent in a device driver; and several other possible bugs lurk. Let us know what you discover.

Improving BQL and other algorithms for driver ring buffer control is needed. Ideally, being able to run delay based AQM algorithms across coupled queues such as OS queuing system & drivers as a unified queue would be best, but today's driver interfaces (and possibly hardware) make this difficult (or impossible).

Simulation Traps with TCP

ns2 does not support ECN, or cubic, or proportional rate reduction . For that matter it doesn't support TCP compound, or any of numerous tweaks to TCP that exist in the field such as syn cookies or TFO. There is a built in assumption in most simulations that there is no tx ring, or mac layer buffering at the hardware layer (interfaces to DOCSIS or ATM), or software rate limiting enforced at the ISP. Also there is rarely accounting for ATM overhead.

Most home links use asymmetric rate limiting (e.g. 20Mbit down, 4Mbit up) , and most simulations assume bidirectional symmetry. Simulations of 10Mbit links are still too high for the vast majority of the real world consumer links.

ns3 has similar problems.

It would be useful if more sims came much closer to modeling the real world.