WAN QoS Architecture

xtecharc Route and Switch, Security Leave a Comment

As promised, this week I will go over QOS in the WAN. I will cover WANs that have carrier QOS and those where the carrier does not have it. I am not going to rehash the QOS career certification guide (although an excellent book), I am going to go over what from that book, more recent capabilities, and my experience is useful in deployments today.

 

Just as a quick refresh, why do you need QOS? The simplest example is voice. If you have a 10M pipe and you are filling all 10Mbps with backup traffic from SAN to SAN, when you try to make a call over that pipe, your voice packets may be dropped. This causes breakup of your voice on the receiver side, all in all a really bad experience. In a scenario like this, video calls may simply not give any video at all, or completely pixelated video. These are unacceptable in today’s communication world and we need QOS to determine that the voice and video packets are more important than email traffic.

 

What do carriers provide?

 

Be careful when you talk to carriers, when asked about their QOS capabilities they will say “we respect your markings.” What does that mean? The ToS bit is a mutable field in the packet, meaning it can be overwritten at any hop along the way. When a carrier says they “respect your markings” they are simply saying that they won’t overwrite them. They don’t care that they are there, they won’t day anything different because of them, they may still drop a packet labeled EF (Express Forward), but they won’t change your marking – thanks carrier, does me a lot of good J

 

Many MPLS carriers will provide you various Classes of Service, some call it a Committed Access Rate (CAR). They may give you options on how this is split up. 50% EF, 15% AF41, 10% AF31, remainder as DSCP 0 (best effort), might be one option; other percentages and other markings may be others. Some may have 4 classes, like the one above, others may have less, or maybe even more, typically you will not see more than 8, as the MPLS EXP field is only 3 bits, 2^3 = 8, and the EXP field carries the QOS markings.

 

You will typically want to match their CAR on your edge devices in the outbound direction (towards them) with their CAR towards you, so you can ensure you are inline with what they are doing. Their class of service, mostly comes into play as they exit their network towards yours. They will ensure to not overrun your inbound circuit with traffic marked as best effort when VOICE packets need to come through. It also is used in their own internal traffic engineering to ensure your low latency traffic gets treated with the best services possible.

 

To explain what this means, I will start with the AF numbers, as you can read in the QOS book, AF number is a DSCP value that can be used to determine how the packets are handled through the network. Certain numbers are more commonly used. AF31 is typically voice/video signaling, AF41 is typically video, and EF is typically voice packets, but these are just typical markings. As discussed you can change them as you want and do with them as you please, there is nothing magical about them.

 

I am not going to dive into marking at the access layer, but you should ensure your packets are marked as close to the edge as possible and those value carried throughout the network.

 

The following configurations are good examples for any systems that use Modular QoS CLI (MQC), routers do and more and more switches are beginning to. MLS QOS configurations (many layer 2 and layer 3 switches) will not be covered. MQC is a far more robust way of configuring QOS. You have a lot more control over bandwidth. For my examples I will base the configurations on routers. I recommend using routers whenever possible for the edge of your WAN as they provide a great deal more options in terms of connectivity type and are more appropriate for routing redistributions etc. As I said, more and more switches are becoming capable of using MQC, which is great, but only switches meant as cores typically have the memory set aside for large routing tables. I like to separate my WAN from my core functionality as well, so a router makes a lot of sense in most scenarios.

 

Here is a common configuration that would match the above policy

 

!create an access-list which defines network management protocols

ip access-list extended NETWORK_CONTROL_ACL

permit tcp any any eq 22

permit tcp any any eq telnet

permit tcp any eq 22 any

permit tcp any eq 23 any

!Match any packets which come in with EF OR CS5 markings.

class-map match-any VOICE_CM

match ip dscp ef

match ip dscp cs5

!Match any packets which comes in with AF41 OR CS4 markings

class-map match-any VIDEO_CM

match ip dscp af41

match ip dscp cs4

!Match any packets which come in with AF31 OR CS3 markings

class-map match-any CALL_CONTROL_CM

match ip dscp af31

match ip dscp cs3

!Match any packets which match the access-list NETWORK_CONTROL_ACL

class-map match-any CONTROL_CM

match access-group name NETWORK_CONTROL_ACL

policy-map VOICE_FIRST_PM

class VOICE_CM

priority percent 50

set dscp ef

class VIDEO_CM

bandwidth percent 15

set dscp af41

class CALL_CONTROL_CM

bandwidth percent 8

set dscp af31

class CONTROL_CM

bandwidth percent 2

set dscp af31

class class-default

fair-queue

random-detect

Inter fa0/0

!Traffic going in the outbound direction towards the ISP

description ISP CONNECTION

Service-policy output VOICE_FIRST_PM

 

 

So what does this do? The class maps match the traffic – pretty straightforward. The policy map uses the class-maps and defines a policy that they follow, in this case VIDEO_CM is set to use 15% of the bandwidth, CONTROL_CM and CALL_CONTROL_CM uses up to 10%. VOICE_CM is set to 50% but uses a different command…”priority”. Class-default was not defined, but is a built in system class for everything else. What does this effectively do? If you have a 100Mbps interface you can use all of it for surfing the internet (class-default), but if you have call control packets come in, I am guaranteeing at least 8% will be available for those packets. After that 8% they are best effort. This means that if I am not using the max % of another queue defined that other traffic can use that space.

 

How is the “priority” statement different? You can have up to two classes which you define with a priority statement, you may want to do this for voice and video, but “priority” doesn’t just mean to service these queues first assuming they have packets, that command also limits the amount of traffic they can use. If I have a 100M circuit, and a 50% priority queue for voice, if my voice starts to take 51Mbps, the additional 1Mbps will be dropped. The priority command has a built in policer, you cannot exceed this. This is VERY important as your QOS policy can actually be the cause for dropping voice packets.

 

With that out of the way, what do you need to do in the modern networks? It looks pretty simple, I setup my classes, my policy maps and apply…well, it’s not that simple, and this is where the mistakes are made. Let’s start off with the simple scenario, the carrier has a CAR.

 

In this scenario the mistake comes when you have a 1G interface and the carrier is handing you a 200Mbps circuit or some other odd number.

 

Using the configurations from above.

 

Interface gi0/0

Description 200Mbps WAN Interface

Bandwidth 200000 !200Mbps written as kilobits

Service-policy output VOICE_FIRST_PM

 

What is the problem here? Let’s go through it, EF traffic, priority, set to 100Mbps (50% of bandwidth set on interface), AF41 traffic set to 30Mbps (15% of bandwidth), AF31 Traffic set to 20Mbps. And class-default uses the rest. All looks good, right? This is the common misunderstanding. The class-default traffic uses the rest of what? The bandwidth command is just a reference for items such as QOS, EIGRP, etc…the traffic exiting the interface doesn’t give a darn what that command is set to, it will be put out on the wire at 1Gbps. What does this mean, your class default is running at 1000Mbps – 100 – 30 – 20 or 850mbps. Guess my SAN traffic will still overrun my voice as it doesn’t know better. How do you resolve this?

 

You have to nest your policies and shape them. Add this to the configurations

 

policy-map SHAPE_OUT_PM

class class-default

shape average percent 100

service-policy VOICE_FIRST_PM

and change your interface to this

interface gi0/0

service-policy output SHAPE_OUT_PM

 

What happened? I referenced the bandwidth with the shape average command, now my best effort traffic is restricted too. Problem fixed.

 

Now to a more difficult scenario, my MPLS or Metro E provide does not have QOS and I have many sites on the network.

 

In this you need to account for your traffic patterns. Typically only one or two sites will have servers which the entire company accesses. The possibility of all of your sites accessing the servers is reasonably high and you need to size your link at those one or two sites appropriately, but at the same time, you will probably maintain some level of oversubscription. 10 sites, 100M each, does not mean my head end sites need 1gbps of bandwidth. There will be site to site traffic such as voice so not all of this will be coming home. You also need to ensure that your headend site does not overrun your remotes. If the head end has a 500Mbps circuit, and it wants to send a large file to the 100Mbps remote side, it can flood the remote pipe, while still maintaining the 500M QOS policy you put in at the head end using the shaping command…what to do? You can’t have multiple bandwidth statements on the interface.

 

Hierarchical QOS is the requirement for this scenario, released with 12.4.11 and made to work in 12.4.15. In this you can define each site in your QOS strategy and set the bandwidth for the remote site. You can then use the same policy map VOICE_FIRST_PM from above and the percentage will reference the site. Before hierarchical qos came out you would have to create a new policy for each site with hard bandwidth numbers set in place for each queue, including best effort. This was very annoying to keep up with. Below is an example where remote site 1 has ip addresses in the 10.2.0.0/16 and a 50M circuit and site 2 is 10.3.0.0/16 with a 100M circuit

 

Adding the following configurations:

 

Access-list extended REMOTE_SITE1_IP_ADDRESSES_ACL

Permit ip any 10.2.0.0 0.0.255.255

Access-list extended REMOTE_SITE2_IP_ADDRESSES_ACL

Permit ip any 10.3.0.0 0.0.255.255

Class-map match any REMOTE_SITE1_CM

Match access-group name REMOTE_SITE1_IP_ADDRESSES_ACL

Class-map match any REMOTE_SITE2_CM

Match access-group name REMOTE_SITE2_IP_ADDRESSES_ACL

Policy-map HIERARCHICAL_PM

Class REMOTE_SITE1_CM

                        Bandwidth 50

                        Shape average 50000

                                    Service-policy VOICE_FIRST_PM

            Class REMOTE_SITE2_COM

                        Bandwith 100

                        Shape average 100000

                                    Service-policy VOICE_FIRST_PM

This ensure that any one site does not destroy another. Now, I may have many sites try to send to 1 site at the same time. For this I must do inbound policing. Unfortunately there is not inbound shaping capability, so when I do policing, I am setting aside a portion of my pipe for a specific purpose and nothing will use this except the traffic allowed. Below is an example for a 3Mbps policy.

 

!********INBOUND POLICY 3M

class-map match-any PRIORITY_DATA_IN_CM

match class-map VOICE_CM

match class-map CALL_CONTROL_CM

match class-map CONTROL_CM

policy-map DROP3000_PM

class PRIORITY_DATA_IN_CM

   police 3000000 conform-action transmit exceed-action transmit violate-action transmit

class class-default

   police 1536000 conform-action transmit exceed-action drop violate-action drop

 

 

What is going on here? I defined a new class-map with the previously defined classes for voice, call control and network control (I can break this up further to ensure that each one has its own space, but if you just aggregate what you believe the max would be and use that number you will be safe). I then create a policy map that defines PRIORITY_DATA_IN_CM the police this traffic up to the line limit and all actions are transmit, IE not limited, then the class-default is limited to 1.5Mbps, or half the speed and anything that exceeds the speed is dropped. Effectively I have taken a 3M line and cut it in half for EVERYTHING except PRIORITY DATA.

 

This should give you a great start to improving your WAN QOS reliability and troubleshooting your voice quality issues.

Share this Post


Fatal error: Call to undefined function x_google_authorship_meta() in /home/allowu6/public_html/xtecharc.com/wp-content/themes/x-child-integrity-light/framework/views/integrity/content.php on line 20