I am experiencing low throughput, traffic latency, and performance issues with my AWS Direct Connect connection.
To isolate and diagnose network and application performance issues, complete the following steps:
Note: It's a best practice to set up an on-premises dedicated test machine with an Amazon Virtual Private Cloud (Amazon VPC). Use Amazon Elastic Compute Cloud (Amazon EC2) instance type size C5 or larger.
Install and use the iPerf3 tool to benchmark network bandwidth, and cross check the results with other applications or tools. For more information, see What is iPerf / iPerf3? on the iPerf website.
$ sudo yum install iperf3 -y
Ubuntu
$ sudo apt install iperf3 -y
$ iperf3 -s -V
On-premises localhost (client)
$ iperf3 -c -P 15 -t 15 $ iperf3 -c -P 15 -t 15 -R $ iperf3 -c -w 256K $ iperf3 -c -w 256K -R $ iperf3 -c -u -b 1G -t 15 $ iperf3 -c -u -b 1G -t 15 -R ---------------- -P, --parallel n number of parallel client threads to run; It is critical to run multi-threads to achieve the max throughput. -R, --reverse reverse the direction of a test. So the EC2 server sends data to the on-prem client to measure AWS -> on-prem throughput. -u, --udp use UDP rather than TCP. Since TCP iperf3 does not report loss, UDP tests are helpful to see the packet loss along a path.
Example TCP test results:
[ ID] Interval Transfer Bitrate Retry[SUM] 0.00-15.00 sec 7.54 GBytes 4.32 Gbits/sec 18112 sender [SUM] 0.00-15.00 sec 7.52 GBytes 4.31 Gbits/sec receiver
The preceding example uses the following terms:
Example UDP test results:
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams[ 5] 0.00-15.00 sec 8.22 GBytes 4.71 Gbits/sec 0.000 ms 0/986756 (0%) sender [ 5] 0.00-15.00 sec 1.73 GBytes 989 Mbits/sec 0.106 ms 779454/986689 (79%) receiver
Lost is 0% on the sender side because the maximum amount of UDP datagrams are sent. Lost/Total datagrams on the receiver side is how many packets are lost and the lost rate. In this example, 79% of network traffic is lost.
Note: If the Direct Connect connection uses an Amazon Virtual Private Network (Amazon VPN) over a public virtual interface (VIF), then run performance tests without the VPN.
Check the metrics and interface counters
Check Amazon CloudWatch Logs for the following metrics:
If you use a hosted VIF that shares the total bandwidth with other users, then check with the Direct Connect owner about the connection utilization.
Check the router and firewall at the Direct Connect location for the following metrics:
Check the AWS Health Dashboard to make sure that the Direct Connect connection isn't under maintenance.
Use the Linux MTR command to analyze network performance. For Windows OS, it's a best practice to turn on WSL 2 so that you can install MTR on a Linux subsystem. Download WinMTR from the SourceForge website.
$ sudo yum install mtr -y
Ubuntu installation
$ sudo apt install mtr -y
$ mtr -n -c 100 --report$ mtr -n -T -P -c 100 --report
$ mtr -n -c 100 --report$ mtr -n -T -P -c 100 --report
Example MTR test results:
#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 20:54:39 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.7 0.7 0.6 0.9 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 266.5 267.4 266.4 321.0 4.8 4.|-- 10.110.120.1 54.5% 100 357.6 383.0 353.4 423.7 19.6 5.|-- 192.168.52.10 47.5% 100 359.4 381.3 352.4 427.9 20.6 #TCP based MTR results $ mtr -n -T -P 80 -c 100 192.168.52.10 --report Start: Sat Oct 30 21:03:48 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.9 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 264.1 265.8 263.9 295.3 3.4 4.|-- 10.110.120.1 8.0% 100 374.3 905.3 354.4 7428. 1210.6 5.|-- 192.168.52.10 12.0% 100 400.9 1139. 400.4 7624. 1384.3
Each line in a hop represents a network device that the data packet passes from the source to the destination. For more information on how to read MTR test results, see Reading MTR output network diagnostic tool on the ExaVault website.
The following example shows a Direct Connect connection with BGP peer 10.110.120.1 and 10.110.120.2. Loss percentage is observed on the 4th and 5th destination hop. This can indicate an issue with the Direct Connect connection or the remote router 10.110.120.1. Because TCP is prioritized over ICMP with the Direct Connect connection, TCP MTR result shows less loss percentage.
#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 20:54:39 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.7 0.7 0.6 0.9 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 266.5 267.4 266.4 321.0 4.8 4.|-- 10.110.120.1 54.5% 100 357.6 383.0 353.4 423.7 19.6 5.|-- 192.168.52.10 47.5% 100 359.4 381.3 352.4 427.9 20.6 #TCP based MTR results $ mtr -n -T -P 80 -c 100 192.168.52.10 --report Start: Sat Oct 30 21:03:48 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.9 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 264.1 265.8 263.9 295.3 3.4 4.|-- 10.110.120.1 8.0% 100 374.3 905.3 354.4 7428. 1210.6 5.|-- 192.168.52.10 12.0% 100 400.9 1139. 400.4 7624. 1384.3
The following example shows the local firewall or NAT device packet loss at 5%. The packet loss impacts all of the subsequent hops including the destination.
$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 21:11:22 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 5.0% 100 0.8 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 6.0% 100 265.7 267.1 265.6 307.8 5.1 4.|-- 10.110.120.1 6.0% 100 265.1 265.2 265.0 265.4 0.0 5.|-- 192.168.52.10 6.0% 100 266.7 266.6 266.5 267.2 0.0
Take a packet capture and analyze the results
Take a packet capture on the localhost and the EC2 instance. Use the tcpdump or Wireshark utility to get network traffic for analysis. The following tcpdump example command gets the timestamp and host IP address:
tcpdump -i -s0 -w $(date +"%Y%m%d\_%H%M%S").$(hostname -s).pcap portUse the TCP Throughput Calculator on the Switch website to calculate network limit, Bandwidth-delay Product, and TCP buffer size. For more information, see Troubleshooting AWS Direct Connect.