IP Traffic Management

The first 2 ISO network layers are implemented in hardware and the device driver.  (Depending on how basic or advanced your NIC is, the corresponding device driver will be complex or simple.)  The other layers are implemented in software in the kernel (or firmware for routers).  Many protocols are used by an OS: one of these is TCP/IP (the most important living on 3-4 levels).  So what happens when a packet is received?

The NIC/driver  handles layers 1 and 2.  Some filtering (wrong L2 address, bad packets) is done here.  (Note some modern servers use special NPs, or network processors, which handle most of TCP/IP without involving the CPU or the kernel directly.)  The standard kernel networking code doesn’t know anything (only addresses) about first 2 layers of ISO-OSI.

Once the NIC has received the packet it notifies the kernel via an interrupt (IRQ).  The kernel uses the NIC device driver to transfer the IP packet from the NIC’s RAM to the host’s main RAM.  The packet is placed at the end of a queue (or buffer) which holds the packet until it can be processed by the kernel.  The queue has a maximum size it can grow.  (It is called the rx_ring.)

Qu: What happens when this queue gets full, because packets are arriving faster than they can be processed?  Ans: if the queue is full, the packet is dropped.

Once the packet is in the queue, it will be processed by the kernel ASAP.  The packet will be examined and broken into a structure, for easy access to the various fields.

Similarly for outgoing packets, after the kernel network code (L3) is done the IP packet is placed in an outgoing queue.  An IRQ is then sent to the NIC to tell it a packet is available.  When the NIC is ready the packet processed by the NIC’s device driver (L2 functions), is removed from the queue, and put into the NIC’s RAM.  As you might expect layer 1 is handled by the NIC.

At some point it occurred to some folks that by changing the order of the packets in the queues and by using smarter code, you could give some types of packets priority over others. And by adding a delay before making an outgoing packet available to a NIC you could implement rate limiting.  Soon the kernel developers were drooling over the possibilities of controlling the flow of packets between the kernel and the NIC.

To allow for traffic management (traffic shaping and traffic control), Linux (and to some extent Solaris, but not other *nix) have an additional layer in the stack between ISO layers 2 and 3, called the queuing layer.

The code implementing this layer controls when an outgoing packet gets sent to the NIC, (rate limiting) which NIC is used (when using link aggregation), which packets to send to the NIC next, or for incoming packets, which is sent to layer 3 code next, and many other tasks.  This is enabled by allowing the SA to create additional queues and change the functions associated with the service points (add to queue and remove from queue); these functions are called queuing disciplines (or qdiscs).

Qdiscs can examine packets and add a label or tag to them, drop them, or move them to a different queue (which may have a different qdisc, or a similar qdisc but with different parameters set, set on it).  This can be done by defining filtering rules and packet filters, similar to iptables (but completely independent) for the qdiscs to use.  Such tags can also be used in layer 3 code.  Linux allows you to use iptables to mark a packet with a tag.  The tags can be used by the kernel routing code to use non-standard routing tables (i.e., additional ones you name and define).

Linux uses the tc command to create queues and (queuing layer) packet filters, and to assign qdiscs to queues, and configure them.  To create and use additional kernel routing tables, use and define the rules when each table should be used, use the ip command.

The ip command is the modern replacement for the older IP configuration and monitoring commands including ifconfig, route, netstat, etc.  It also provides this newer functionality.  Each of these functions is provided by a separate sub-command, e.g. “ip route” or “ip addr”.

There are several types of (network) traffic control:

·        Traffic Shaping controls the rate of transmission of outgoing packets.  Shaping can be used to lower the bandwidth to what is available and to smooth out bursts of traffic.

·        Traffic Scheduling is also called prioritizing and is used to reorder packets (i.e., send them out in a different order than they were generated).  This is used to give priority to interactive traffic including VOIP.

·        Traffic Policing controls the arrival of incoming packets.  There isn’t much you can do with this except tag some packets (so iptables can match on a tag) and possibly reorder them (however the kernel usually processes packets faster than they arrive).

·        Dropping some packets can happen on either ingress or egress.

The tc command can create queues and assign qdiscs to them.  For traffic control you use tc to define filters (similar to iptables) that have matching criteria and an associated action (such as drop, tag with a mark, or move to a different queue).

Linux packet processing steps:

When processing the packets the kernel will:

1.     Manages handshake with low levels devices (like Ethernet card or modem) receiving “frames” from them.  The packet is then added to a queue.  The qdisc may process the packet at this point, say by tagging it.

2.     Determine which L3 protocol the packet is, by examining the headers.  Assume it is an IP packet.

3.     Iptables/netfilter is then called, which may drop, modify, or do nothing to the packet.  Note iptables can see tags added to packets by some qdisc.

4.     Builds TCP/IP “packets” from “frames” (recall that a single IP packet may be split into many frames.  The kernel may need to wait until all the frames are received to combine them into a single IP packet before proceeding, or it may not).  Modifications include NAT processing and packet tagging.

In older Linux kernels there was a setting to control this, /proc/sys/net/ipv4/ip_always_defrag.  To support connection tracking (for the stateful firewall), modern Linux kernels always reassemble IP packets from the fragment frames, and this setting is no longer available.

5.     Apply some rules to decide what to do with the packet.  The kernel may need to complete the initial TCP handshake at this point.  Or the packet may be dropped (with or without an error reply), forwarded to a different interface to be sent to another host, or sent to the right layer 4 code (TCP or UDP) for further processing.

Next the firewall and other kernel filters (e.g., the “rp (reverse path) filter”) examine the packet and decide what to do with it: drop or reject it, mark or tag the packet, modify parts of the packet (e.g., NAT), or some combination of these.  This is where traffic control can be used to delay some packets (rate limiting) or move others to the front of the queue (packet priority).  Note iptables may run twice: once before the routing decision is made and once afterwards.

6.     Passing the data to the right application “socket” (using the port number).

Sending data from an application is the same steps only reversed:

1.     Sends the data as a UDP datagram, or queues the data into a series of TCP “packets”, which is in turn encapsulated in a IP packet.

2.     Process the packets with iptables.  This is also the point where traffic management comes in, delaying the packets, tagging them, and deciding which network interface to send the packet out to.  (Traffic management may use load balancing to determine which gateway to use if your host is multi-homed.)  Note that iptables may run twice on the packets, once before a routing decision is made, and once after the routing decision is made.

3.     Splits the IP “packets” into “frames” (like Ethernet or PPP) and appends the packets to an outgoing queue associated with the outgoing NIC.

4.     A qdisc may delay some packets or move them to a different queue (i.e., load balancing across bonded links).  Ultimately they end up in a final transmit queue, waiting for the NIC to become available.

5.     Sends the frames using the NIC’s driver.

Solaris since version 9 supports “IP Quality of Service”, or IPQoS, based on RFC-2475.  Using the ipqosconf command you can define “classes” of services (basically queues) with different traffic shaping parameters (basically qdiscs).  You then define packet classifiers that cause matching traffic to use one or another defined class of service.

Traffic management (adopted from www.tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.rpdb.multiple-links.html)

Traffic management consists of tagging some packets for special handling, putting the packets onto one of several queues, and using special scheduling/dispatching code known as line disciplines to manage the packets on each queue.  All this is accomplished with a combination of iptables to filter and tag the packets) and the command “tc” to manage the queues and the various line disciplines.  However the easiest way to get started with traffic management is to use fancy routing table rules.

Policy Routing on Linux

Classic routing algorithms used in the Internet make routing decisions based only on the destination address of packets (and in theory, but not in practice, on the TOS field).  In some circumstances we want to route packets differently depending not only on destination addresses, but also on other packet fields including the source address, IP protocol, transport protocol, ports, or even packet payload.  This task is called policy routing.

The Linux kernel doesn’t have a single routing table (I lied).  It has several.  Each routing table can be thought of as containing a set of routing “rules”.  So a routing table can also be referred to as a set of routing rules.  All packets that match the same rules use the same routing table.

Linux can pack routes into several routing tables identified by a number in the range from 1 to 255, or by name from the file /etc/iproute2/rt_tables  (syntax: number name).  The main table (ID 254) is the one used by default if you don’t set any policy.  When using the “ip route list” command, the main table is listed by default.  However you can add “table name|all” to see the rules in other tables.  Table names are a convenience; internally each table is numbered.

Besides main, one other table always exists, the local table (ID 255).   This table consists of routes for local and broadcast addresses.  The kernel maintains this table automatically and the administrator usually should ignore it.

Multiple routing tables are only used when some policy routing has been created.  When making a routing decision for a given packet the kernel must first determine which routing table to use.  Use “ip rule list” to see the various tables defined (and used in at least one rule) and which set(s) of rules to apply.  By using different tables for different types of packets, you can create different routing policies for each.

A routing policy database (or RPDB), selects routes by executing some set of rules, which determine which routing table to use.  (Such tables typically have a single default route in them.)  You can use ip rule {add|del|list|flush}.

The kernel keeps a cache of routes for quick lookup.  After changing the rules, you must flush this cache before the new rules take effect consistently.  Use “ip route flush”.

Each policy routing rule consists of a priority, selector and an action (similar to syslog), also called the type of the rule.  The RPDB is scanned in the order of increasing priority.

The selector of each rule is matched against: the source address, destination address, incoming interface, tos, and fwmark.  A firewall mark is a tag that can be applied to packets via iptables, only they call it a mark (when matching in criteria) or connmark (when setting one as an action).  If the selector matches the packet the action is performed.

The action is one of: table table-ID (route using the specified table), prohibit (packets are discarded and the ICMP message “communication administratively prohibited” is generated), blackhole (man page says reject but it’s wrong; packets are discarded silently), unreachable (packets are discarded and the ICMP message “host unreachable” is generated), and nat address.  The default action is “table main”.

The action may return with success (i.e., packet handled).  In this case, it will either give a route or failure indication (i.e., the packet gets dropped) and the RPDB lookup is terminated.  If the action returns failure instead the RPDB continues on the next rule.

At startup time the kernel configures the default RPDB consisting of three rules:

1.     Priority:  0, Selector: match anything,
Action: lookup routing table local (ID 255).  The local table is a special routing table containing high priority control routes for local and broadcast addresses.  Rule 0 is special. It cannot be deleted or overridden.

2.     Priority:  32766,  Selector: match anything,
Action: lookup routing table main (ID 254).  The main table is the normal routing table containing all non-policy routes.  This rule may be deleted and/or overridden with other ones by the administrator.

3.     Priority: 32767, Selector: match anything,
Action: lookup routing table default (ID 253).  The default table is empty.  It is reserved for some post-processing if no previous default rules selected the packet.  This rule may also be deleted.

You can create a new table with a name by adding this line to /etc/iproute2/rt_tables:

    echo "200 cheapCust" >> rt_tables

For example, an ISP might want a certain customer to only use a slower DSL line (maybe they pay less than your other customers), and others should use the faster T3 connection.  You create one routing table for each, one for the cheap customer to use ppp0 as the default route and the rest use eth0.  Then you add policy rules to say which packets use which routing table

Now you can add rules to this table this way:

# ip route add default via 195.96.98.253 dev ppp2 \
   table cheapCust

# ip route flush cache

By default this new table isn’t used.  You need to specify a policy with ip rule add to have some packets (that match the rule) use this different routing table.

Example: multi-homed routing setup without BGP:

Let us first set some symbolical names.  Let $IF1 be the name of the first interface and $IF2 the name of the second interface.  Then let $IP1 be the IP address associated with $IF1 and $IP2 the IP address associated with $IF2.  Next, let $P1 be the IP address of the gateway at Provider 1, and $P2 the IP address of the gateway at provider 2.  Finally, let $P1_NET be the IP network $P1 is in, and $P2_NET the IP network $P2 is in.

One creates two additional routing tables, say T1 and T2.  These are added in /etc/iproute2/rt_tables.  Then you set up routing in these tables as follows:

      ip route add $P1_NET dev $IF1 src $IP1 table T1
      ip route add default via $P1 table T1
      ip route add $P2_NET dev $IF2 src $IP2 table T2
      ip route add default via $P2 table T2

This just builds a route to the gateway and adds a default route via that gateway, as you would do in the case of a single upstream provider, but put the routes in a separate table per provider.  Note that the network route suffices, as it tells you how to find any host in that network, which includes the gateway, as specified above.

Next you set up the main routing table.  It is a good idea to route things to the direct neighbor through the interface connected to that neighbor!  Note the “src” arguments, they make sure the right outgoing IP address is chosen.

        ip route add $P1_NET dev $IF1 src $IP1
        ip route add $P2_NET dev $IF2 src $IP2

Then, pick one ISP to use for the default route (load balancing is shown later):

        ip route add default via $P1

Next, you set up the routing rules. These actually choose what routing table to route with.  You want to make sure that you route out from a given interface (?) if you already have the corresponding source address:

        ip rule add from $IP1 table T1
        ip rule add from $IP2 table T2

This set of commands makes sure all answers to traffic coming in on (???) a particular interface get answered from the table for that interface.

Now, this is just the very basic setup.  It will work for all processes running on the router itself, and for the local network, if it is masqueraded.  If it is not, then you either have IP space from both providers or you are going to want to masquerade to one of the two providers.  In both cases you will want to add rules selecting which provider to route out from based on the IP address of the machine in the local network.

Load Balancing using routing table rules

The second question is how to balance traffic going out over the two providers.  This is actually not hard if you already have set up split access as above.

Instead of choosing one of the two providers as your default route, you now set up the default route to be a multipath route.  In the default kernel this will balance routes over the two providers.  It is done as follows (once more building on the example in the section on split-access):

   ip route add default scope global \
      nexthop via $P1 dev $IF1 weight 1 \
      nexthop via $P2 dev $IF2 weight 1

This will balance the routes over both providers.  The weight parameters can be tweaked to favor one provider over the other.

Note that balancing will not be perfect, as it is route based, and routes are cached.  This means that routes to often-used sites will always be over the same provider.

Traffic shaping: Queuing

While you can’t do much with received packets, you can control how your system will send packets.  Traffic shaping means controlling the flow of packets.  You can drop some, delay some, schedule some to be sent later, send them out is some priority order, or just send ‘em as fast as possible.

Earlier we discussed how packets get placed on queues to await processing.  In Linux you can imagine an extra layer, a queuing layer, between ISO layers 2 and 3.  Each queue is managed by some code, known as the queue discipline or line discipline.  A lot of documentation refers to both the queue (of packets) and the code to manage it as a queue discipline, or qdisc for short.  The qdisc allows such functions as adding a packet to a queue, querying the queue (“is there a packet ready to be sent?”  Is the queue full/empty?”), and removing a packet from the queue.

Queues come in two types: classful and classless.  A classful queue can be thought of as containing sub-queues, each of which may be classful or classless.  A classful queue contains different types or classes of packets, each of which can be processed with different rules.      Each classful qdisc needs to determine to which class it needs to send a packet. This is done using a classifier. or filter.  iptables can be used to tag or mark packets that match some iptables rules, and the qdisc can read the tag to determine what to do with the packet.  However these filters are applied using the “tc” command, not “iptables”.

Delaying or dropping packets in order to make traffic stay below a configured bandwidth is known as traffic policing.  In Linux, policing can only drop a packet and not delay it.  (The ingress qdisc is not configurable, and there’s only one.)  To “shape” incoming traffic which you are not forwarding, use the Ingress Policer. (?)

The default queue for IPV4 on Linux is pfifo_fast qdisc.  However Linux actually comes with several different qdiscs, and you can customize most of them.

The pfifo_fast qdisc queue uses a First In, First Out rule.  However this queue has 3 “bands”.  Within each band, FIFO rules apply.  But as long as there are packets waiting in band 0, band 1 won’t be processed, and if there are band 1 packets waiting, band 2 packets won’t be processed.  The kernel honors the Type of Service flag of packets by putting the packets into the appropriate band. For instance 'minimum delay' packets go into band 0.  A related qdisc is prio.

The Token Bucket Filter (TBF) is a simple qdisc that only passes packets arriving at a rate which is not exceeding some administratively set rate, but with the possibility to allow short bursts in excess of this rate.  TBF is very precise, network- and processor friendly.  It should be your first choice if you simply want to slow an interface down!  A simple but useful configuration is this:

tc qdisc add dev ppp0 root tbf rate 220kbit \
   latency 50ms burst 1540

If you have a networking device with a large queue, like a DSL modem or a cable modem, and you talk to it over a fast device, like over an Ethernet interface, you will find that uploading absolutely destroys interactivity.  This is because uploading will fill the queue in the modem, which is probably huge because this helps actually achieving good data throughput uploading.  But this is not what you want, you want to have the queue not too big so interactivity remains and you can still do other stuff while sending data.

The line above slows down sending to a rate that does not lead to a queue in the modem — the queue will be in Linux, where we can control it to a limited size.  Change 220kbit to your uplink's actual speed, minus a few percent.  If you have a really fast modem, raise “burst” a bit.

Stochastic Fairness Queuing (SFQ) is a simple implementation of the fair queuing algorithms family. It’s less accurate than others, but it also requires less calculations while being almost perfectly fair, so no one session can take over all available bandwidth.

The key word in SFQ is conversation (or flow), which mostly corresponds to a TCP session or a UDP stream. Traffic is divided into a pretty large number of FIFO queues, one for each conversation. Traffic is then sent in a round robin fashion, giving each session the chance to send data in turn.

Enqueuing and Dequeuing

Incoming packets get placed on the ingress qdisc, which either sends them to be forwarded, sends them up the TCP/IP stack for processing and eventual delivery to user programs, or may simply drop them.  This is the traffic policing.  Note that iptables doesn’t examine the packet until after it leaves the ingress qdisc.

Outgoing packets get filtered via iptables, routed, filtered again, and then get sent to a classifier.  This determines which qdisc the packet gets send to.  At the proper time the packet gets transferred to the NIC for delivery.  By default there is only one egress qdisc installed, the pfifo_fast, which always receives the packet.  This is called enqueuing.  The packet now sits in the qdisc, waiting for the kernel to ask for it for transmission over the network interface.  This is called dequeuing.

Classification and Filters (a.k.a. packet classifiers)

When traffic enters a classful qdisc, it needs to be sent to any of the classes within which means it needs to be classified.  To determine what to do with a packet, the filters are consulted.  Filters are called by the qdisc code.  The filter(s) attached to that qdisc then return with a decision and the qdisc uses this to enqueue the packet into one of the classes.  Each subclass may try other filters to see if further instructions apply.  If not, the class enqueues the packet to the qdisc it contains.

Besides containing other qdiscs, most classful qdiscs also perform shaping.  This is useful to perform both packet scheduling and rate control.

The kernel passes a packet to the root of a classful queue.  This qdisc may apply filters to place the packet in a sub-queue.  That qdisc in turn may apply other filters to categorize the packet in a sub-sub-queue.  This process can repeat until the packet is enqueued in its final queue.  When the NIC is ready for a packet the kernel will send a dequeue command to the root qdisc, which will walk the tree until a packet is found to send.  To refer to qdiscs in command, they can be assigned handles, generally a number and a colon (numbers to the right of the colon identify sibling classes, the parent is “0”, which is usually omitted from the rules.

Let's say we have a PRIO qdisc called “10:0” which contains three classes (“10:1”, “10:2”, and “10:3”), and we want to assign all traffic from and to port 22 to the highest priority band, web traffic to the next highest, and all other traffic to the lowest.  The filters would be:

# tc filter add dev eth0 protocol ip parent 10: prio 1 \
     u32 match  ip dport 22 0xffff flowid 10:1

# tc filter add dev eth0 protocol ip parent 10: prio 2 \
     u32 match ip sport 80 0xffff flowid 10:2

# tc filter add dev eth0 protocol ip parent 10: prio 3 \
     flowid 10:3

This says: attach to eth0, node 10: a priority 1 u32 type of filter (“u32” is the most common, and can be used to match any part of the TCP/IP packet) that matches on IP destination port 22 exactly and send it to band 10:1.  And it then repeats the same for source port 80. The last command says that anything unmatched so far should go to band 10:3, the lowest priority.

By matching on the source/destination IP address, you can have some destinations/sources have a higher priority than the rest.

You can mark packets with iptables and have that mark survive routing across interfaces.  This can be useful to shape traffic on eth1 that came in on eth0.  This is called the “fwmark” (firewall mark ?).  Place a mark like this:

# iptables -A PREROUTING -t mangle -i eth0 -j MARK \
    --set-mark 6

(The number 6 is arbitrary.)

Next filter outgoing packets to eth1 that have mark 6:

# tc filter add dev eth1 protocol ip parent 1:0 prio 1 \
handle 6 fw flowid 1:1

(Note that this is not a u32 match!)  This places all packets with mark ID of 6 in the high priority band.  If you don't want to understand the full tc filter syntax, just use iptables, and only learn to select on fwmark.