iproute2+tc notes

The iproute2+tc package allows access to the variety of neat new networking features in the 2.2 kernels. Policy routing, NAT, QoS, advanced tunnels, RSVP and Differentiated services, are just a few of the buzzword capabilities unleashed by the ip and tc programs.

Introduction

This is the second version of this document; the first began as a simple command reference for my own use and evolved into a slightly more complex document, including some usage notes and a bunch of links to documents. That version became dated rather rapidly, and so I've redone it from fresh sources. Hopefully the pace of development has slowed somewhat (at least on the user interface side of things), and this version will remain useful a bit longer.

Suggestions, additions, sources, comments, etc to dragon@snafu.freedom.org. Specifically more detailed explanations of the commands listed in the Command Syntax section, for inclusion, would be extremely welcome.

All of the information pulled from linux/ files comes from linux-2.2.14-pre9. The iproute2/ files come from the iproute2-2.2.4-now-ss991023.tar.gz distribution.

Distribution

From the Documentation/Configure.help file of the linux distribution; "QoS and/or fair queueing" item:

  To administer these schedulers, you'll need the user-level utilities
  from the package iproute2+tc at ftp://ftp.inr.ac.ru/ip-routing/

That directory is mirrored at:

ftp://linux.wauug.org/pub/net

ftp://ftp.nc.ras.ru/pub/mirrors/ftp.inr.ac.ru/ip-routing/

ftp://ftp.gts.cz/MIRRORS/ftp.inr.ac.ru/

ftp://ftp.funet.fi/pub/mirrors/ftp.inr.ac.ru/ip-routing/ (STM1 to USA)

ftp://sunsite.icm.edu.pl/pub/Linux/linux-iproute/

ftp://ftp.sunet.se/pub/Linux/ip-routing/

It's probably worth picking up the latest version; if your distribution includes these tools at all they're likely to be a revision or two behind.

Building

Unpack the distribution tarball. Edit the Makefile; you need to tell it where to find the kernel include files with the KERNEL_INCLUDES variable. You might also need to tell it about libraries or something; I didn't have to. A simple "make all" compiles the programs ip/ip, ip/rtmon, and tc/tc. I didn't get but a couple of warning messages during the compile. Copy those programs to /sbin, and you're ready to play.

Keep the unpacked distribution, it's got several examples and utility scripts which may well be useful, especially if you're going to be using the tc program to do QoS stuff. In that case, you may also need to edit a kernel header file. In the tc/ directory of the distribution there's a file named "README.last", that explains the possible values and meanings of PSCHED_CLOCK_SOURCE in the file linux/include/net/pkt_sched.h. The default "PSCHED_JIFFIES" there is probably what you want anyway.

Kernel Configuration

Most of the capabilities of the kernel accessed by the ip and tc programs probably aren't included in your distribution's stock kernel configuration. If you can't get a feature to work, make sure that all of the required kernel configuration values are set properly. I've tried to note what kernel configuration options are required or related in the Command Syntax section. I'm often guessing, so don't be suprised if I've missed something (please do tell me about it, though).

Documentation

If you can't find the answers to your questions in this document or in the documentation listed below, you might try searching through the linux-net mailing list and linux-kernel mailing list archives and/or asking the habitants of those lists.

from the Distribution

In the iproute2/ directory you'll find the README file, which contains build instructions; the RELNOTES file, and the README.iproute2+tc file, which discusses the tc Traffic Control tools. The README.decnet file contains some points relevant to DECnet. There's a README.last file in the iproute2/tc/ directory that discusses the clock source for the QoS code; it's a bit dated but it explains what the options are.

In the iproute2/doc/ directory are several documents; the IP Command Reference (HTML, TeX, and Postscript); Tunnels over IP in Linux-2.2 (HTML, TeX, and Postscript); and IPv6 Flow Labels in Linux-2.2 (HTML, TeX, and Postscript).

from the Kernel

The kernel's Documentation/networking/ has a couple of files; the routing.txt file, and the policy-routing.txt file. These are slim and outdated, but they introduce a couple of important concepts. I've got a partial kernel Configure.help file, mostly networking options, that should be a little easier to read than via make config.

from everywhere else

Timur A. Bolokhov has started an "Advanced routing mini-HOWTO" (HTML,TeX and Postscript), which ain't bad at all. The date on those files from the original site (ftp://post.tepkom.ru/pub/vol2/Linux/docs/) was Sep 29 1998; they might be a little dated, but not so much so as the kernel docs.

Saravanan Radhakrishan has put together a fine collection of links and documentation at the IITC QoS site, dealing with the implementation details of the QoS system and how to attack it from C via netlink. I haven't read it all yet but it should be helpful. Specifically see the Linux-Qos-HOWTO (local copy) and the Netlink HOWTO (local copy).

The README.iproute2+tc file mentions "Terminology and advices about setting CBQ parameters may be found in Sally Floyd papers." Some of those can be found at her site. She 's also got pages listing resources for CBQ, and RED. Programmers looking to interface with the kernel side of the new networking capabilities may find some joy in Andi Kleen's new network man pages.

Juergen Jaeschke pointed me to the Differentiated Services on Linux Internet Draft, which contains some good discussion of the iproute tc traffic control tools.

Werner Almesberger has posted the slides he used for his presentations at the Singapore Linux conference '99 and the Linux Expo '99; these are PostScript files. I've got copies of both, Linux Network Traffic Control Implementation Overview slides from Linux Expo, and the QoS Networking with Linux slides from the Singapore conference. He also has an excellent Linux Traffic Control - Implementation Overview paper on his site (which I've copied).

I've collected some mail list messages that discuss the usage of these tools. One of the mail list messages responding to a request for information on the traffic control toys said "see Cisco Docs"; the IOS v12 Configuration Guides and References and most specifically the Quality of Service Solutions Configuration Guide are likely to be helpful.

Open Resource has an article on linux 2.2, which includes a some discussion of the new networking.

Command Syntax

`ip` command

From file iproute2/ip/ip.c

Usage: ip [ OPTIONS ] OBJECT { COMMAND | help } where OBJECT := { link | addr | route | rule | neigh | tunnel | maddr | mroute | monitor } OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] | -f[amily] { inet | inet6 | dnet | link } | -o[neline] }

See the IP Command Reference.

If it's called as "ipaddr", "iproute", "iprule", "ipneigh", "iplink", "iptunnel", or "ipmonitor" it runs the appropriate subcommand. So you can create a link to "iproute", and use:

# iproute (whatever)

instead of:

# ip route (whatever)

ip link

From file iproute2/ip/iplink.c

Usage: ip link set DEVICE { up | down | arp { on | off } | dynamic { on | off } | multicast { on | off } | txqueuelen PACKETS | name NEWNAME | address LLADDR | broadcast LLADDR | mtu MTU } ip link show [ DEVICE ]

See the IP Command Reference.

ip address

From file iproute2/ip/ipaddress.c

Usage: ip addr {add|del} IFADDR dev STRING ip addr {show|flush} [ dev STRING ] [ scope SCOPE-ID ] [ to PREFIX ] [ FLAG-LIST ] [ label PATTERN ] IFADDR := PREFIX | ADDR peer PREFIX [ broadcast ADDR ] [ anycast ADDR ] [ label STRING ] [ scope SCOPE-ID ] SCOPE-ID := [ host | link | global | NUMBER ] FLAG-LIST := [ FLAG-LIST ] FLAG FLAG := [ permanent | dynamic | secondary | primary | tentative | deprecated ]

See the IP Command Reference.

ip neighbor

From file iproute2/ip/ipneigh.c

Usage: ip neigh { add | del | change | replace } { ADDR [ lladdr LLADDR ] [ nud { permanent | noarp | stale | reachable } ] | proxy ADDR } [ dev DEV ] ip neigh {show|flush} [ to PREFIX ] [ dev DEV ] [ nud STATE ]

See the IP Command Reference.

ip rule

Kernel Config: IP: advanced router, IP: policy routing, IP: use TOS value as routing key, IP: use FWMARK value as routing key, IP: fast network address translation

See the IP Command Reference.

ip route

Kernel Config: IP: advanced router, IP: policy routing, IP: equal cost multipath, IP: large routing tables, IP: fast network address translation

From file iproute2/ip/iproute.c

Usage: ip route { list | flush } SELECTOR ip route get ADDRESS [ from ADDRESS iif STRING ] [ oif STRING ] [ tos TOS ] ip route { add | del | replace | change | append | replace | monitor } ROUTE SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ] [ table TABLE_ID ] [ proto RTPROTO ] [ type TYPE ] [ scope SCOPE ] ROUTE := NODE_SPEC [ INFO_SPEC ] NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ] [ table TABLE_ID ] [ proto RTPROTO ] [ scope SCOPE ] [ metric METRIC ] INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]... NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ] [ rtt NUMBER ] [ rttvar NUMBER ] [ window NUMBER] [ cwnd NUMBER ] [ ssthresh REALM ] [ realms REALM ] TYPE := [ unicast | local | broadcast | multicast | throw | unreachable | prohibit | blackhole | nat ] TABLE_ID := [ local | main | default | all | NUMBER ] SCOPE := [ host | link | global | NUMBER ] FLAGS := [ equalize ] NHFLAGS := [ onlink | pervasive ] RTPROTO := [ kernel | boot | static | NUMBER ]

See the IP Command Reference.

ip tunnel

Kernel Config: IP: tunneling, IP: GRE tunnels over IP, IP: broadcast GRE over IP

From file iproute2/ip/iptunnel.c

Usage: ip tunnel { add | change | del | show } [ NAME ] [ mode { ipip | gre | sit } ] [ remote ADDR ] [ local ADDR ] [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ] [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ] Where: NAME := STRING ADDR := { IP_ADDRESS | any } TOS := { NUMBER | inherit } TTL := { 1..255 | inherit } KEY := { DOTTED_QUAD | NUMBER }

See the IP Command Reference, also see Tunnels over IP in Linux-2.2 and my (old) tunnel notes.

ip maddress

From file iproute2/ip/ipmaddr.c

Usage: ip maddr [ add | del ] MULTIADDR dev STRING ip maddr show [ dev STRING ]

See the IP Command Reference.

ip mroute

From file iproute2/ip/ipmroute.c

Usage: ip mroute show [ PREFIX ] [ from PREFIX ] [ iif DEVICE ]

See the IP Command Reference.

ip monitor

Kernel Config: IP: verbose route monitoring

From file iproute2/ip/ipmonitor.c

Usage: ip monitor [ all | LISTofOBJECTS ]

See the IP Command Reference.

`rtmon` command

Kernel Config: IP: verbose route monitoring

From file iproute2/ip/rtmon.c

Usage: rtmon file FILE [ all | LISTofOBJECTS] LISTofOBJECTS := [ link ] [ address ] [ route ]

`tc` command

Kernel Config: QoS support, QoS and/or fair queueing

From file iproute2/tc/tc.c

Usage: tc [ OPTIONS ] OBJECT { COMMAND | help } where OBJECT := { qdisc | class | filter } OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }

Where it's expecting a number for BPS; it understands some suffixes: kbps (*1024), mbps (*1024*1024), kbit (*1024/8), and mbit (*1024*1024/8). If I'm reading the code correctly; "BPS" means Bytes Per Second; if you give a number without a suffix it assumes you want BITS per second (it divides the number you give it by 8). It also understands bps as a suffix.

Where it's expecting a time value, it seems it understands suffixes of s, sec, and secs for seconds, ms, msec, and msecs for milliseconds, and us, usec, and usecs for microseconds.

Where it wants a size parameter, it assumes non-suffixed numbers to be specified in bytes. It also understands suffixes of k and kb to mean kilobytes (*1024), m and mb to mean megabytes (*1024*1024), kbit to mean kilobit (*1024/8), and mbit to mean megabits (*1024*1024/8).

1Mbit == 128Kbps

qdisc

Kernel Config: QoS support, QoS and/or fair queueing

From file iproute2/tc/tc_qdisc.c

Usage: tc qdisc [ add | del | replace | change | get ] dev STRING [ handle QHANDLE ] [ root | ingress | parent CLASSID ] [ estimator INTERVAL TIME_CONSTANT ] [ [ QDISC_KIND ] [ help | OPTIONS ] ] tc qdisc show [ dev STRING ] [ingress] Where: QDISC_KIND := { [p|b]fifo | tbf | prio | cbq | red | etc. } OPTIONS := ... try tc qdisc add <desired QDISC_KIND> help

From file linux/net/sched/sch_api.c

Short review. ------------- This file consists of two interrelated parts: 1. queueing disciplines manager frontend. 2. traffic classes manager frontend. Generally, queueing discipline ("qdisc") is a black box, which is able to enqueue packets and to dequeue them (when device is ready to send something) in order and at times determined by algorithm hidden in it. qdisc's are divided to two categories: - "queues", which have no internal structure visible from outside. - "schedulers", which split all the packets to "traffic classes", using "packet classifiers" (look at cls_api.c) In turn, classes may have child qdiscs (as rule, queues) attached to them etc. etc. etc. The goal of the routines in this file is to translate information supplied by user in the form of handles to more intelligible for kernel form, to make some sanity checks and part of work, which is common to all qdiscs and to provide rtnetlink notifications. All real intelligent work is done inside qdisc modules. Every discipline has two major routines: enqueue and dequeue. ---dequeue dequeue usually returns a skb to send. It is allowed to return NULL, but it does not mean that queue is empty, it just means that discipline does not want to send anything this time. Queue is really empty if q->q.qlen == 0. For complicated disciplines with multiple queues q->q is not real packet queue, but however q->q.qlen must be valid. ---enqueue enqueue returns number of enqueued packets i.e. this number is 1, if packet was enqueued successfully and <1 if something (not necessary THIS packet) was dropped. Auxiliary routines: ---requeue requeues once dequeued packet. It is used for non-standard or just buggy devices, which can defer output even if dev->tbusy=0. ---reset returns qdisc to initial state: purge all buffers, clear all timers, counters (except for statistics) etc. ---init initializes newly created qdisc. ---destroy destroys resources allocated by init and during lifetime of qdisc. ---change changes qdisc parameters.

qdisc [p|b]fifo

From file iproute2/tc/q_fifo.c

Usage: ... [p|b]fifo [ limit NUMBER ]

From file linux/net/sched/sch_fifo.c

/* 1 band FIFO pseudo-"scheduler" */

qdisc tbf

Kernel Config: QoS support, QoS and/or fair queueing, TBF queue, Rate estimator

From file iproute2/tc/

Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ] [ peakrate KBPS ] [ latency TIME ]

From file linux/net/sched/sch_tbf.c

/* Simple Token Bucket Filter. ======================================= SOURCE. ------- None. Description. ------------ A data flow obeys TBF with rate R and depth B, if for any time interval t_i...t_f the number of transmitted bits does not exceed B + R*(t_f-t_i). Packetized version of this definition: The sequence of packets of sizes s_i served at moments t_i obeys TBF, if for any i<=k: s_i+....+s_k <= B + R*(t_k - t_i) Algorithm. ---------- Let N(t_i) be B/R initially and N(t) grow continuously with time as: N(t+delta) = min{B/R, N(t) + delta} If the first packet in queue has length S, it may be transmited only at the time t_* when S/R <= N(t_*), and in this case N(t) jumps: N(t_* + 0) = N(t_* - 0) - S/R. Actually, QoS requires two TBF to be applied to a data stream. One of them controls steady state burst size, another one with rate P (peak rate) and depth M (equal to link MTU) limits bursts at a smaller time scale. It is easy to see that P>R, and B>M. If P is infinity, this double TBF is equivalent to a single one. When TBF works in reshaping mode, latency is estimated as: lat = max ((L-B)/R, (L-M)/P) NOTES. ------ If TBF throttles, it starts a watchdog timer, which will wake it up when it is ready to transmit. Note that the minimal timer resolution is 1/HZ. If no new packets arrive during this period, or if the device is not awaken by EOI for some previous packet, TBF can stop its activity for 1/HZ. This means, that with depth B, the maximal rate is R_crit = B*HZ F.e. for 10Mbit ethernet and HZ=100 the minimal allowed B is ~10Kbytes. Note that the peak rate TBF is much more tough: with MTU 1500 P_crit = 150Kbytes/sec. So, if you need greater peak rates, use alpha with HZ=1000 :-) */

qdisc cbq

Kernel Config: QoS support, QoS and/or fair queueing, CBQ packet scheduler, Rate estimator

From file iproute2/tc/q_cbq.c

Usage: ... cbq bandwidth BPS avpkt BYTES [ mpu BYTES ] [ cell BYTES ] [ ewma LOG ]

From file linux/net/sched/sch_cbq.c

/* Class-Based Queueing (CBQ) algorithm. ======================================= Sources: [1] Sally Floyd and Van Jacobson, "Link-sharing and Resource Management Models for Packet Networks", IEEE/ACM Transactions on Networking, Vol.3, No.4, 1995 [2] Sally Floyd, "Notes on CBQ and Guaranted Service", 1995 [3] Sally Floyd, "Notes on Class-Based Queueing: Setting Parameters", 1996 [4] Sally Floyd and Michael Speer, "Experimental Results for Class-Based Queueing", 1998, not published. ----------------------------------------------------------------------- Algorithm skeleton was taken from NS simulator cbq.cc. If someone wants to check this code against the LBL version, he should take into account that ONLY the skeleton was borrowed, the implementation is different. Particularly: --- The WRR algorithm is different. Our version looks more reasonable (I hope) and works when quanta are allowed to be less than MTU, which is always the case when real time classes have small rates. Note, that the statement of [3] is incomplete, delay may actually be estimated even if class per-round allotment is less than MTU. Namely, if per-round allotment is W*r_i, and r_1+...+r_k = r < 1 delay_i <= ([MTU/(W*r_i)]*W*r + W*r + k*MTU)/B In the worst case we have IntServ estimate with D = W*r+k*MTU and C = MTU*r. The proof (if correct at all) is trivial. --- It seems that cbq-2.0 is not very accurate. At least, I cannot interpret some places, which look like wrong translations from NS. Anyone is advised to find these differences and explain to me, why I am wrong 8). --- Linux has no EOI event, so that we cannot estimate true class idle time. Workaround is to consider the next dequeue event as sign that previous packet is finished. This is wrong because of internal device queueing, but on a permanently loaded link it is true. Moreover, combined with clock integrator, this scheme looks very close to an ideal solution. */
tristate 'CBQ packet scheduler' CONFIG_NET_SCH_CBQ

qdisc red

Kernel Config: QoS support, QoS and/or fair queueing, RED queue

From file iproute2/tc/q_red.c

Usage: ... red limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS probability PROBABILITY bandwidth KBPS

From file linux/net/sched/sch_red.c

/* Random Early Detection (RED) algorithm. ======================================= Source: Sally Floyd and Van Jacobson, "Random Early Detection Gateways for Congestion Avoidance", 1993, IEEE/ACM Transactions on Networking. This file codes a "divisionless" version of RED algorithm as written down in Fig.17 of the paper. Short description. ------------------ When a new packet arrives we calculate the average queue length: avg = (1-W)*avg + W*current_queue_len, W is the filter time constant (choosen as 2^(-Wlog)), it controls the inertia of the algorithm. To allow larger bursts, W should be decreased. if (avg > th_max) -> packet marked (dropped). if (avg < th_min) -> packet passes. if (th_min < avg < th_max) we calculate probability: Pb = max_P * (avg - th_min)/(th_max-th_min) and mark (drop) packet with this probability. Pb changes from 0 (at avg==th_min) to max_P (avg==th_max). max_P should be small (not 1), usually 0.01..0.02 is good value. max_P is chosen as a number, so that max_P/(th_max-th_min) is a negative power of two in order arithmetics to contain only shifts. Parameters, settable by user: ----------------------------- limit - bytes (must be > qth_max + burst) Hard limit on queue length, should be chosen >qth_max to allow packet bursts. This parameter does not affect the algorithms behaviour and can be chosen arbitrarily high (well, less than ram size) Really, this limit will never be reached if RED works correctly. qth_min - bytes (should be < qth_max/2) qth_max - bytes (should be at least 2*qth_min and less limit) Wlog - bits (<32) log(1/W). Plog - bits (<32) Plog is related to max_P by formula: max_P = (qth_max-qth_min)/2^Plog; F.e. if qth_max=128K and qth_min=32K, then Plog=22 corresponds to max_P=0.02 Scell_log Stab Lookup table for log((1-W)^(t/t_ave). NOTES: Upper bound on W. ----------------- If you want to allow bursts of L packets of size S, you should choose W: L + 1 - th_min/S < (1-(1-W)^L)/W th_min/S = 32 th_min/S = 4 log(W) L -1 33 -2 35 -3 39 -4 46 -5 57 -6 75 -7 101 -8 135 -9 190 etc. */

qdisc sfq

From file iproute2/tc/q_sfq.c

Usage: ... sfq [ perturb SECS ] [ quantum BYTES ]

Kernel Config: QoS support, QoS and/or fair queueing, SFQ queue

From file linux/net/sched/

/* Stochastic Fairness Queuing algorithm. ======================================= Source: Paul E. McKenney "Stochastic Fairness Queuing", IEEE INFOCOMM'90 Proceedings, San Francisco, 1990. Paul E. McKenney "Stochastic Fairness Queuing", "Interworking: Research and Experience", v.2, 1991, p.113-131. See also: M. Shreedhar and George Varghese "Efficient Fair Queuing using Deficit Round Robin", Proc. SIGCOMM 95. This is not the thing that is usually called (W)FQ nowadays. It does not use any timestamp mechanism, but instead processes queues in round-robin order. ADVANTAGE: - It is very cheap. Both CPU and memory requirements are minimal. DRAWBACKS: - "Stochastic" -> It is not 100% fair. When hash collisions occur, several flows are considered as one. - "Round-robin" -> It introduces larger delays than virtual clock based schemes, and should not be used for isolating interactive traffic from non-interactive. It means, that this scheduler should be used as leaf of CBQ or P3, which put interactive traffic to higher priority band. We still need true WFQ for top level CSZ, but using WFQ for the best effort traffic is absolutely pointless: SFQ is superior for this purpose. IMPLEMENTATION: This implementation limits maximal queue length to 128; maximal mtu to 2^15-1; number of hash buckets to 1024. The only goal of this restrictions was that all data fit into one 4K page :-). Struct sfq_sched_data is organized in anti-cache manner: all the data for a bucket are scattered over different locations. This is not good, but it allowed me to put it into 4K. It is easy to increase these values, but not in flight. */

qdisc prio

Kernel Config: QoS support, QoS and/or fair queueing, The simplest PRIO pseudo scheduler

From file iproute2/tc/q_prio.c

Usage: ... prio bands NUMBER priomap P1 P2...

From file linux/net/sched/sch_prio.c

* net/sched/sch_prio.c Simple 3-band priority "scheduler".

qdisc csz

Kernel Config: QoS support, QoS and/or fair queueing, CSZ packet scheduler

From file iproute2/tc/q_csz.c

NOT IMPLEMENTED

From file linux/net/sched/sch_csz.c

/* Clark-Shenker-Zhang algorithm. ======================================= SOURCE. David D. Clark, Scott Shenker and Lixia Zhang "Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism". CBQ presents a flexible universal algorithm for packet scheduling, but it has pretty poor delay characteristics. Round-robin scheduling and link-sharing goals apparently contradict minimization of network delay and jitter. Moreover, correct handling of predictive flows seems to be impossible in CBQ. CSZ presents a more precise but less flexible and less efficient approach. As I understand it, the main idea is to create WFQ flows for each guaranteed service and to allocate the rest of bandwith to dummy flow-0. Flow-0 comprises the predictive services and the best effort traffic; it is handled by a priority scheduler with the highest priority band allocated for predictive services, and the rest --- to the best effort packets. Note that in CSZ flows are NOT limited to their bandwidth. It is supposed that the flow passed admission control at the edge of the QoS network and it doesn't need further shaping. Any attempt to improve the flow or to shape it to a token bucket at intermediate hops will introduce undesired delays and raise jitter. At the moment CSZ is the only scheduler that provides true guaranteed service. Another schemes (including CBQ) do not provide guaranteed delay and randomize jitter. There is a proof (Sally Floyd), that delay can be estimated by a IntServ compliant formula. This result is true formally, but it is wrong in principle. It takes into account only round-robin delays, ignoring delays introduced by link sharing i.e. overlimiting. Note that temporary overlimits are inevitable because real links are not ideal, and the real algorithm must take this into account. ALGORITHM. --- Notations. $B$ is link bandwidth (bits/sec). $I$ is set of all flows, including flow $0$. Every flow $a \in I$ has associated bandwidth slice $r_a < 1$ and $\sum_{a \in I} r_a = 1$. --- Flow model. Let $m_a$ is the number of backlogged bits in flow $a$. The flow is {\em active}, if $m_a > 0$. This number is a discontinuous function of time; when a packet $i$ arrives: \[ m_a(t_i+0) - m_a(t_i-0) = L^i, \] where $L^i$ is the length of the arrived packet. The flow queue is drained continuously until $m_a == 0$: \[ {d m_a \over dt} = - { B r_a \over \sum_{b \in A} r_b}. \] I.e. flow rates are their allocated rates proportionally scaled to take all available link bandwidth. Apparently, it is not the only possible policy. F.e. CBQ classes without borrowing would be modelled by: \[ {d m_a \over dt} = - B r_a . \] More complicated hierarchical bandwidth allocation policies are possible, but unfortunately, the basic flow equations have a simple solution only for proportional scaling. --- Departure times. We calculate the time until the last bit of packet is sent: \[ E_a^i(t) = { m_a(t_i) - \delta_a(t) \over r_a }, \] where $\delta_a(t)$ is number of bits drained since $t_i$. We have to evaluate $E_a^i$ for all queued packets, then find the packet with minimal $E_a^i$ and send it. This sounds good, but direct implementation of the algorithm is absolutely infeasible. Luckily, if flow rates are scaled proportionally, the equations have a simple solution. The differential equation for $E_a^i$ is \[ {d E_a^i (t) \over dt } = - { d \delta_a(t) \over dt} { 1 \over r_a} = { B \over \sum_{b \in A} r_b} \] with initial condition \[ E_a^i (t_i) = { m_a(t_i) \over r_a } . \] Let's introduce an auxiliary function $R(t)$: --- Round number. Consider the following model: we rotate over active flows, sending $r_a B$ bits from every flow, so that we send $B \sum_{a \in A} r_a$ bits per round, that takes $\sum_{a \in A} r_a$ seconds. Hence, $R(t)$ (round number) is a monotonically increasing linear function of time when $A$ is not changed \[ { d R(t) \over dt } = { 1 \over \sum_{a \in A} r_a } \] and it is continuous when $A$ changes. The central observation is that the quantity $F_a^i = R(t) + E_a^i(t)/B$ does not depend on time at all! $R(t)$ does not depend on flow, so that $F_a^i$ can be calculated only once on packet arrival, and we need not recalculate $E$ numbers and resorting queues. The number $F_a^i$ is called finish number of the packet. It is just the value of $R(t)$ when the last bit of packet is sent out. Maximal finish number on flow is called finish number of flow and minimal one is "start number of flow". Apparently, flow is active if and only if $F_a \leq R$. When a packet of length $L_i$ bit arrives to flow $a$ at time $t_i$, we calculate $F_a^i$ as: If flow was inactive ($F_a < R$): $F_a^i = R(t) + {L_i \over B r_a}$ otherwise $F_a^i = F_a + {L_i \over B r_a}$ These equations complete the algorithm specification. It looks pretty hairy, but there is a simple procedure for solving these equations. See procedure csz_update(), that is a generalization of the algorithm from S. Keshav's thesis Chapter 3 "Efficient Implementation of Fair Queeing". NOTES. * We implement only the simplest variant of CSZ, when flow-0 is a explicit 4band priority fifo. This is bad, but we need a "peek" operation in addition to "dequeue" to implement complete CSZ. I do not want to do that, unless it is absolutely necessary. * A primitive support for token bucket filtering presents itself too. It directly contradicts CSZ, but even though the Internet is on the globe ... :-) "the edges of the network" really exist. BUGS. * Fixed point arithmetic is overcomplicated, suboptimal and even wrong. Check it later. */ /* This number is arbitrary */

class

Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API

From file iproute2/tc/tc_class.c

Usage: tc class [ add | del | change | get ] dev STRING [ classid CLASSID ] [ root | parent CLASSID ] [ [ QDISC_KIND ] [ help | OPTIONS ] ] tc class show [ dev STRING ] [ root | parent CLASSID ] Where: QDISC_KIND := { prio | cbq | etc. } OPTIONS := ... try tc class add <desired QDISC_KIND> help

class cbq

From file iproute2/tc/q_cbq.c

Usage: ... cbq bandwidth BPS rate BPS maxburst PKTS [ avpkt BYTES ] [ minburst PKTS ] [ bounded ] [ isolated ] [ allot BYTES ] [ mpu BYTES ] [ weight RATE ] [ prio NUMBER ] [ cell BYTES ] [ ewma LOG ] [ estimator INTERVAL TIME_CONSTANT ] [ split CLASSID ] [ defmap MASK/CHANGE ]

estimator

Kernel Config: QoS support, QoS and/or fair queueing, Rate estimator

From file iproute2/tc/m_estimator.c

Usage: ... estimator INTERVAL TIME-CONST INTERVAL is interval between measurements TIME-CONST is averaging time constant Example: ... est 1sec 8sec\

From file linux/net/sched/estimator.c

This code is NOT intended to be used for statistics collection, its purpose is to provide a base for statistical multiplexing for controlled load service. If you need only statistics, run a user level daemon which periodically reads byte counters. Unfortunately, rate estimation is not a very easy task. F.e. I did not find a simple way to estimate the current peak rate and even failed to formulate the problem 8)8) So I preferred not to built an estimator into the scheduler, but run this task separately. Ideally, it should be kernel thread(s), but for now it runs from timers, which puts apparent top bounds on the number of rated flows, has minimal overhead on small, but is enough to handle controlled load service, sets of aggregates. We measure rate over A=(1<<interval) seconds and evaluate EWMA: avrate = avrate*(1-W) + rate*W where W is chosen as negative power of 2: W = 2^(-ewma_log) The resulting time constant is: T = A/(-ln(1-W)) NOTES. * The stored value for avbps is scaled by 2^5, so that maximal rate is ~1Gbit, avpps is scaled by 2^10. * Minimal interval is HZ/4=250msec (it is the greatest common divisor for HZ=100 and HZ=1024 8)), maximal interval is (HZ/4)*2^EST_MAX_INTERVAL = 8sec. Shorter intervals are too expensive, longer ones can be implemented at user level painlessly.

filter

Kernel Config: QoS support, QoS and/or fair queueing, Packet classifier API

From file iproute2/tc/tc_filter.c

Usage: tc filter [ add | del | change | get ] dev STRING [ pref PRIO ] [ protocol PROTO ] [ estimator INTERVAL TIME_CONSTANT ] [ root | classid CLASSID ] [ handle FILTERID ] [ [ FILTER_TYPE ] [ help | OPTIONS ] ] tc filter show [ dev STRING ] [ root | parent CLASSID ] Where: FILTER_TYPE := { rsvp | u32 | fw | route | etc. } FILTERID := ... format depends on classifier, see there OPTIONS := ... try tc filter add <desired FILTER_KIND> help

filter rsvp

From file iproute2/tc/f_rsvp.c

Usage: ... rsvp ipproto PROTOCOL session DST[/PORT | GPI ] [ sender SRC[/PORT | GPI ] [ classid CLASSID ] [ police POLICE_SPEC ] [ tunnelid ID ] [ tunnel ID skip NUMBER ] Where: GPI := { flowlabel NUMBER | spi/ah SPI | spi/esp SPI | u{8|16|32} NUMBER mask MASK at OFFSET} POLICE_SPEC := ... look at TBF FILTERID := X:Y

From file linux/net/sched/cls_rsvp.h

/* Comparing to general packet classification problem, RSVP needs only sevaral relatively simple rules: * (dst, protocol) are always specified, so that we are able to hash them. * src may be exact, or may be wildcard, so that we can keep a hash table plus one wildcard entry. * source port (or flow label) is important only if src is given. IMPLEMENTATION. We use a two level hash table: The top level is keyed by destination address and protocol ID, every bucket contains a list of "rsvp sessions", identified by destination address, protocol and DPI(="Destination Port ID"): triple (key, mask, offset). Every bucket has a smaller hash table keyed by source address (cf. RSVP flowspec) and one wildcard entry for wildcard reservations. Every bucket is again a list of "RSVP flows", selected by source address and SPI(="Source Port ID" here rather than "security parameter index"): triple (key, mask, offset). NOTE 1. All the packets with IPv6 extension headers (but AH and ESP) and all fragmented packets go to the best-effort traffic class. NOTE 2. Two "port id"'s seems to be redundant, rfc2207 requires only one "Generalized Port Identifier". So that for classic ah, esp (and udp,tcp) both *pi should coincide or one of them should be wildcard. At first sight, this redundancy is just a waste of CPU resources. But DPI and SPI add the possibility to assign different priorities to GPIs. Look also at note 4 about tunnels below. NOTE 3. One complication is the case of tunneled packets. We implement it as following: if the first lookup matches a special session with "tunnelhdr" value not zero, flowid doesn't contain the true flow ID, but the tunnel ID (1...255). In this case, we pull tunnelhdr bytes and restart lookup with tunnel ID added to the list of keys. Simple and stupid 8)8) It's enough for PIMREG and IPIP. NOTE 4. Two GPIs make it possible to parse even GRE packets. F.e. DPI can select ETH_P_IP (and necessary flags to make tunnelhdr correct) in GRE protocol field and SPI matches GRE key. Is it not nice? 8)8) Well, as result, despite its simplicity, we get a pretty powerful classification engine. */

filter u32

From file iproute2/tc/f_u32.c

Usage: ... u32 [ match SELECTOR ... ] [ link HTID ] [ classid CLASSID ] [ police POLICE_SPEC ] [ offset OFFSET_SPEC ] [ ht HTID ] [ hashkey HASHKEY_SPEC ] [ sample SAMPLE ] or u32 divisor DIVISOR Where: SELECTOR := SAMPLE SAMPLE ... SAMPLE := { ip | ip6 | udp | tcp | icmp | u{32|16|8} } SAMPLE_ARGS FILTERID := X:Y:Z

From file linux/net/sched/cls_u32.c

* The filters are packed to hash tables of key nodes * with a set of 32bit key/mask pairs at every node. * Nodes reference next level hash tables etc. * * This scheme is the best universal classifier I managed to * invent; it is not super-fast, but it is not slow (provided you * program it correctly), and general enough. And its relative * speed grows as the number of rules becomes larger. * * It seems that it represents the best middle point between * speed and manageability both by human and by machine. * * It is especially useful for link sharing combined with QoS; * pure RSVP doesn't need such a general approach and can use * much simpler (and faster) schemes, sort of cls_rsvp.c.

Faster than ipchains, according to Werner Almesburger (as reported in Rusty's Diary, May 22 1999).

filter fw

From file iproute2/tc/f_fw.c

Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC ] POLICE_SPEC := ... look at TBF CLASSID := X:Y

From file linux/net/sched/cls_fw.c

* net/sched/cls_fw.c Classifier mapping ipchains' fwmark to traffic class.

filter route

From file iproute2/tc/f_route.c

Usage: ... route [ from REALM | fromif TAG ] [ to REALM ] [ flowid CLASSID ] [ police POLICE_SPEC ] POLICE_SPEC := ... look at TBF CLASSID := X:Y\n

From file linux/net/sched/cls_route.c

1. For now we assume that route tags < 256. It allows to use direct table lookups, instead of hash tables. 2. For now we assume that "from TAG" and "fromdev DEV" statements are mutually exclusive. 3. "to TAG from ANY" has higher priority, than "to ANY from XXX"

police

From file iproute2/tc/m_police.c

Usage: ... police rate BPS burst BYTES[/BYTES] [ mtu BYTES[/BYTES] ] [ peakrate BPS ] [ avrate BPS ] [ ACTION ] Where: ACTION := reclassify | drop | continue

iproute2+tc notes

Contents