Monitoring EdgeConnect SD-Branch WAN Uplink

WAN Wide Area Network. WAN is a telecommunications network or computer network that extends over a large geographical distance. Health Check Monitoring (HCM) is a critical component in HPE Aruba Networking Central SD-Branch solution, enabling intelligent uplink monitoring and triggering failover, and dynamic path selection. HCM continuously evaluates network path performance based on connectivity and responsiveness, ensuring optimal traffic forwarding and seamless failover when required.

HCM is designed to monitor WAN uplink health to ensure that traffic is forwarded over the most reliable paths. It is achieved by using a health check process, whereby the gateway periodically probes a designated health check IP address to determine WAN uplink availability.

Path Quality Monitoring Service Overview

WAN Health Check Monitoring (HCM) relies on the ability to continuously assess and validate WAN performance against expected path quality. To support this, the HCM mechanism in SD-Branch and Microbranch deployments leverages the Path Quality Monitoring (PQM) service—a globally distributed probe responder infrastructure. The PQM service consists of more than 30 nodes strategically positioned worldwide, ensuring high availability, geographic coverage, and low-latency responsiveness.

The PQM service is continuously monitored by the HPE Site Reliability Engineering (SRE) team, which operates 24 hours a day, 7 days a week to ensure reliability and performance. And serving as a probe responder, the PQM service also provides Session Traversal Utilities for NAT Network Address Translation. NAT is a method of remapping one IP address space into another by modifying network address information in Internet Protocol (IP) datagram packet headers while they are in transit across a traffic routing device. (STUN) services, which are essential for facilitating NAT traversal in SD-Branch and Microbranch deployments.

PQM can receive probe requests using both ICMP Internet Control Message Protocol. ICMP is an error reporting protocol. It is used by network devices such as routers, to send error messages and operational information to the source IP address when network problems prevent delivery of IP packets. and UDP User Datagram Protocol. UDP is a part of the TCP/IP family of protocols used for data transfer. UDP is typically used for streaming media. UDP is a stateless protocol, which means it does not acknowledge that the packets being sent have been received. (on port 4500). To maintain security and service integrity, incoming requests are throttled to mitigate the risk of Denial of Service (DoS Denial of Service. DoS is any type of attack where the attackers send excessive messages to flood traffic and thereby preventing the legitimate users from accessing the service.) attacks. The system applies a higher request threshold for IPs associated with HPE Aruba Networking-managed devices, ensuring that legitimate network monitoring and health-check processes are not restricted.

Health Check Monitoring Mechanism Overview

When the WAN Health Check is configured to probe the PQM service, the gateway first attempts to resolve the Fully Qualified Domain Name (FQDN Fully Qualified Domain Name. FQDN is a complete domain name that identifies a computer or host on the Internet.) for PQM. Once resolved, the uplink manager stores up to 4 IP addresses from the DNS Domain Name System. A DNS server functions as a phone book for the intranet and Internet users. It converts human-readable computer host names into IP addresses and IP addresses into host names. It stores several records for a domain name such as an address 'A' record, name server (NS), and mail exchanger (MX) records. The Address 'A' record is the most important record that is stored in a DNS server, because it provides the required IP address for a network peripheral or element. response. Alternatively, the administrator can statically configure an Uplink Health-Check IP, which will be used in place of PQM.

The HCM mechanism follows a structured, sequential probing process to monitor network reachability and trigger failover when necessary. This ensures that the gateway can dynamically adapt to changing network conditions and maintain reliable connectivity across uplinks.

Once the health-check list is established, the gateway begins probing the first IP address in the list across all active uplinks to determine reachability and path quality. This IP serves as the primary validation point for determining uplink availability.

Probes are sent in batches of 5 probes, sent every 10 seconds (totaling 30 seconds per batch). If the probes to this health-check IP fail on a particular uplink for 3 consecutive batches, the gateway removes the default gateway route associated with that uplink. This prevents traffic from being forwarded over a potentially degraded or non-functional path.

If PQM is not resolved, or Uplink Health-Check IP is not configured, the gateway defaults to Headless Mode. In this mode, the health check is performed using a predefined health-check list consisting of 4 well-known IP addresses associated with global DNS services.

The diagram illustrates how the HPE Aruba Networking Branch Gateway monitors the health and performance of multiple uplinks using the Path Quality Monitoring (PQM) service.

  1. Default Gateways: The Branch Gateway is connected to two uplinks (ISP1 and ISP2), each with its own default gateway. These are used as initial paths for outbound traffic.
  2. DNS Resolution: The gateway attempts to resolve the FQDN of the PQM service using DNS over all uplinks. This step is necessary to obtain the IP addresses of the PQM responders.
  3. PQM Probes: Once the PQM IPs are resolved, the gateway sends periodic ICMP or UDP probes (on port 4500) to the PQM service through each uplink. These probes are used to measure path quality metrics including latency, jitter, packet loss, and utilization.
  4. Headless Mode (Fallback): If PQM IP resolution fails or probes are unresponsive, the gateway falls back to sending probes directly to the DNS servers using predefined health-check IPs to verify uplink reachability and maintain connectivity.

Headless Mode – WAN HCM Resilience Mechanism

To enhance network resilience, Headless Mode introduces a fallback mechanism when the PQM service becomes unreachable. This ensures that the gateway remains operational and avoids unnecessary flapping in the unlikely event of a temporary PQM or DNS outage.

If the PQM service fails to respond to probes sent from all uplink interfaces, and the four IP addresses learned over DNS become unreachable, the gateway automatically switches to a predefined list of IP addresses, which includes:

  1. Google DNS: 8.8.8.8

  2. Google Secondary DNS: 8.8.4.4

  3. Quad9 DNS: 9.9.9.9

  4. Cloudflare DNS: 1.1.1.1

Once the gateway enters Headless Mode, it follows a structured sequence to restore PQM-based health checks while ensuring network stability across multiple uplinks. First, the gateway installs all default gateway routes that were removed due to failed PQM resolution back. This allows PQM resolution attempts to resume on all uplinks. The gateway does not remove the default routes again as long as at least one predefined health-check IP remains reachable.

The predefined health-check list is probed every minute on each uplink to determine connectivity. During this time, PQM resolution attempts continue. If PQM is successfully resolved and remains reachable for three consecutive probes (three minutes) on at least one of the uplinks, the gateway considers PQM to be stable and transitions back to using PQM as the primary health-check mechanism.

On VPNCs, the Health Check Monitoring mechanism is used strictly for observability purposes, and no action is taken if an uplink probe fails. The VPNC does not remove the default gateway route, nor does it initiate Headless Mode when PQM becomes unreachable. If the configured PQM couldn't resolve the URL Uniform Resource Locator. URL is a global address used for locating web resources on the Internet. for some reason, then it switches to the default IP list. (In the headless mode, if all PQM IPs become unreachable on all uplinks, then it switches to the default IP list.)

Behavior Across Multiple Uplinks

Headless Mode operates as a system-wide behavior, meaning that the gateway does not exit Headless Mode on a per-uplink basis. Once PQM is successfully resolved on at least one uplink, the gateway will return to normal PQM-based monitoring.

If PQM remains unresolved or unreachable on all uplinks, the gateway stays in Headless Mode, continuing to probe the predefined health-check list until a valid PQM response is received on any uplink.

Handling Partial Uplink Failures

Since each uplink may receive different PQM resolution results, the gateway does not require all uplinks to recover simultaneously before switching back to PQM mode. If PQM is successfully resolved and reachable on any uplink, the gateway exits Headless Mode and will continue probing PQM on all uplinks. However, those uplinks where PQM remains unresolved will not be used for PQM-based health checks until they recover.

If both PQM and Headless Mode fail, the gateway assumes a broader network-wide issue and continues probing using the Headless Mode indefinitely. If a Backup Uplink is available, it is activated as an alternative path.

Configuration

The WAN Health Check feature can be configured directly within Aruba Central by navigating to the gateway's configuration page under the WAN → Health Check tab. The configuration page allows you to tune global parameters for WAN Health Check, which apply across all uplinks on the gateway. This enables consistent monitoring behavior and path evaluation across the WAN uplinks.

The following options are available:

  • Health Check Enablement: Enables or disables health-check monitoring on the gateway.

  • Remote Host Selection: You can choose between probing a custom IP address or using a Fully Qualified Domain Name (FQDN). When using PQM, select the FQDN option and enter pqm.arubanetworks.com as the remote host. The gateway will dynamically resolve this FQDN and use the returned IP addresses for probing.

  • Probe Mode: Select between ICMP or UDP (default is UDP). Probes will be sent using the selected protocol.

  • Probe Interval: Defines how frequently health-check probes are sent, in seconds. The default is 10 seconds.

  • Packet Burst per Probe: Specifies the number of packets sent in a single probe burst. Default is 5.

  • Probe Retries: Sets how many failed probe attempts must occur before marking an uplink as down.

  • Jitter Measurement: Enables jitter evaluation on the path in addition to basic reachability.

For LTE Long Term Evolution. LTE is a 4G wireless communication standard that provides high-speed wireless communication for mobile phones and data terminals. See 4G. uplinks, a Low Frequency Probe option is available which is designed to optimize probe behavior on cellular connections. When enabled, it sets the probe interval to 15 seconds with 2 probes per burst. This setting can be toggled under the LTE WAN uplink configuration.

Monitoring

The status of WAN Health Check Monitoring can be viewed in real-time from the Aruba Central interface by navigating to the specific branch gateway and opening the Summary tab under the WAN section.

The interface provides an overview of each WAN uplink, including:

  • WAN Status and Availability: Indicates whether the uplink is up, down, or unreachable. Availability is shown as a percentage based on recent probe success rates. WAN Unreachable means that the gateway has removed the default route for that uplink due to health-check failure. The physical interface may still be up, but the path is considered unusable for underlay traffic based on probe results.

  • Health Check IP Status: Displays the reachability of the currently active health-check IP, including the last probe result and timestamp.

  • Performance Metrics: Key indicators such as latency, jitter, packet loss, and MOS (Mean Opinion Score) are plotted over time for each uplink, allowing administrators to identify degradation patterns.

  • Usage and Throughput: Provides live statistics on the amount of traffic sent/received and current throughput per interface.

Detailed performance graphs help visualize SLA metrics across time windows, giving insight into uplink stability and path quality. Metrics such as latency trends, jitter variations, and packet loss history are continuously updated to support proactive network diagnostics.

If PQM is used as the health-check target, the associated FQDN (e.g., pqm.arubanetworks.com) is also displayed in the interface.

This visibility ensures that network teams can monitor WAN path performance in a centralized, intuitive way and respond quickly to any network anomalies.