How can I troubleshoot Direct Connect network performance issues?

I am experiencing low throughput, traffic latency, and performance issues with my AWS Direct Connect connection.

Resolution

To isolate and diagnose network and application performance issues, complete the following steps:

Note: It's a best practice to set up an on-premises dedicated test machine with an Amazon Virtual Private Cloud (Amazon VPC). Use Amazon Elastic Compute Cloud (Amazon EC2) instance type size C5 or larger.

Review for network or application issues

Install and use the iPerf3 tool to benchmark network bandwidth, and cross check the results with other applications or tools. For more information, see What is iPerf / iPerf3? on the iPerf website.

    Run the following command to install iPerf3: Linux/REHEL

$ sudo yum install iperf3 -y

Ubuntu

$ sudo apt install iperf3 -y
$ iperf3 -s -V

On-premises localhost (client)

$ iperf3 -c -P 15 -t 15 $ iperf3 -c -P 15 -t 15 -R $ iperf3 -c -w 256K $ iperf3 -c -w 256K -R $ iperf3 -c -u -b 1G -t 15 $ iperf3 -c -u -b 1G -t 15 -R ---------------- -P, --parallel n number of parallel client threads to run; It is critical to run multi-threads to achieve the max throughput. -R, --reverse reverse the direction of a test. So the EC2 server sends data to the on-prem client to measure AWS -> on-prem throughput. -u, --udp use UDP rather than TCP. Since TCP iperf3 does not report loss, UDP tests are helpful to see the packet loss along a path.

Example TCP test results:

[ ID] Interval Transfer Bitrate Retry[SUM] 0.00-15.00 sec 7.54 GBytes 4.32 Gbits/sec 18112 sender [SUM] 0.00-15.00 sec 7.52 GBytes 4.31 Gbits/sec receiver

The preceding example uses the following terms:

  • Bitrate: the measured throughput or transmission speed.
  • Transfer: the total amount of data exchanged between client and server.
  • Retry: the number of re-transmitted packets. Re-transmission is observed on the sender side.

Example UDP test results:

[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams[ 5] 0.00-15.00 sec 8.22 GBytes 4.71 Gbits/sec 0.000 ms 0/986756 (0%) sender [ 5] 0.00-15.00 sec 1.73 GBytes 989 Mbits/sec 0.106 ms 779454/986689 (79%) receiver

Lost is 0% on the sender side because the maximum amount of UDP datagrams are sent. Lost/Total datagrams on the receiver side is how many packets are lost and the lost rate. In this example, 79% of network traffic is lost.

Note: If the Direct Connect connection uses an Amazon Virtual Private Network (Amazon VPN) over a public virtual interface (VIF), then run performance tests without the VPN.

Check the metrics and interface counters

Check Amazon CloudWatch Logs for the following metrics:

  • ConnectionErrorCount: Apply the sum statistic. Note that non-zero values indicates MAC level errors on the AWS device.
  • ConnectionLightLevelTx and ConnectionLightLevelRx: The optical signal readings must be within the range of -14.4 and 2.50 dBm.
  • ConnectionBpsEgress, ConnectionBpsIngress, VirtualInterfaceBpsEgress, and VirtualInterfaceBpsIngress: Make sure that the bitrate hasn't reached the maximum bandwidth.

If you use a hosted VIF that shares the total bandwidth with other users, then check with the Direct Connect owner about the connection utilization.

Check the router and firewall at the Direct Connect location for the following metrics:

  • CPU, memory, port utilization, drops, discards
  • Use show interfaces statistics or similar to check for interface input and output errors like CRC, frame, collisions, and carrier
  • Clean or replace the fiber patch lead and SFP module for worn counters

Check the AWS Health Dashboard to make sure that the Direct Connect connection isn't under maintenance.

Run MTR bidirectionally to check the network path

Use the Linux MTR command to analyze network performance. For Windows OS, it's a best practice to turn on WSL 2 so that you can install MTR on a Linux subsystem. Download WinMTR from the SourceForge website.

    Run the following command to install MTR: Amazon Linux/REHEL installation

$ sudo yum install mtr -y

Ubuntu installation

$ sudo apt install mtr -y
$ mtr -n -c 100 --report$ mtr -n -T -P -c 100 --report
$ mtr -n -c 100 --report$ mtr -n -T -P -c 100 --report

Example MTR test results:

#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 20:54:39 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.7 0.7 0.6 0.9 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 266.5 267.4 266.4 321.0 4.8 4.|-- 10.110.120.1 54.5% 100 357.6 383.0 353.4 423.7 19.6 5.|-- 192.168.52.10 47.5% 100 359.4 381.3 352.4 427.9 20.6 #TCP based MTR results $ mtr -n -T -P 80 -c 100 192.168.52.10 --report Start: Sat Oct 30 21:03:48 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.9 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 264.1 265.8 263.9 295.3 3.4 4.|-- 10.110.120.1 8.0% 100 374.3 905.3 354.4 7428. 1210.6 5.|-- 192.168.52.10 12.0% 100 400.9 1139. 400.4 7624. 1384.3

Each line in a hop represents a network device that the data packet passes from the source to the destination. For more information on how to read MTR test results, see Reading MTR output network diagnostic tool on the ExaVault website.

The following example shows a Direct Connect connection with BGP peer 10.110.120.1 and 10.110.120.2. Loss percentage is observed on the 4th and 5th destination hop. This can indicate an issue with the Direct Connect connection or the remote router 10.110.120.1. Because TCP is prioritized over ICMP with the Direct Connect connection, TCP MTR result shows less loss percentage.

#ICMP based MTR results$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 20:54:39 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.7 0.7 0.6 0.9 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 266.5 267.4 266.4 321.0 4.8 4.|-- 10.110.120.1 54.5% 100 357.6 383.0 353.4 423.7 19.6 5.|-- 192.168.52.10 47.5% 100 359.4 381.3 352.4 427.9 20.6 #TCP based MTR results $ mtr -n -T -P 80 -c 100 192.168.52.10 --report Start: Sat Oct 30 21:03:48 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 0.0% 100 0.9 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 0.0% 100 264.1 265.8 263.9 295.3 3.4 4.|-- 10.110.120.1 8.0% 100 374.3 905.3 354.4 7428. 1210.6 5.|-- 192.168.52.10 12.0% 100 400.9 1139. 400.4 7624. 1384.3

The following example shows the local firewall or NAT device packet loss at 5%. The packet loss impacts all of the subsequent hops including the destination.

$ mtr -n -c 100 192.168.52.10 --report Start: Sat Oct 30 21:11:22 2021 HOST: Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.101.222 5.0% 100 0.8 0.7 0.7 1.1 0.0 2.|-- . 100.0 100 0.0 0.0 0.0 0.0 0.0 3.|-- 10.110.120.2 6.0% 100 265.7 267.1 265.6 307.8 5.1 4.|-- 10.110.120.1 6.0% 100 265.1 265.2 265.0 265.4 0.0 5.|-- 192.168.52.10 6.0% 100 266.7 266.6 266.5 267.2 0.0

Take a packet capture and analyze the results

Take a packet capture on the localhost and the EC2 instance. Use the tcpdump or Wireshark utility to get network traffic for analysis. The following tcpdump example command gets the timestamp and host IP address:

tcpdump -i -s0 -w $(date +"%Y%m%d\_%H%M%S").$(hostname -s).pcap port

Use the TCP Throughput Calculator on the Switch website to calculate network limit, Bandwidth-delay Product, and TCP buffer size. For more information, see Troubleshooting AWS Direct Connect.