Advanced Stability and Fault Tolerance Mechanisms of Apache APISIX

When selecting API gateways, stability and fault tolerance are crucial factors alongside functionality, scalability, and security. In the design of Apache APISIX back in 2019, stability and fault tolerance were deemed essential, given the potential for significant production incidents when handling both internal and external traffic requests.

To provide a comprehensive understanding for researchers, let's delve into the key stability and fault tolerance features of Apache APISIX.

Separation of Control Plane and Data Plane

Apache APISIX adopts a separated architecture with a control plane (i.e., etcd, Admin API) and a stateless data plane (i.e., API gateway can scale on demand). There is no dependency between them. This means that even if the control plane experiences anomalies (such as network interruptions or abnormal exits), the data plane can continue to operate normally, handling new traffic requests. This separation ensures the high availability of APISIX.

APISIX's Technical Architecture

Data Synchronization Mechanism

An efficient data synchronization mechanism exists between the data plane and the control plane. The data plane acts as an etcd Watcher, actively notified by etcd about data changes. It updates its configuration and rules accordingly. Therefore, when an administrator writes configuration to etcd via the Admin API, the data plane quickly receives change notifications and stores the configuration in memory. This mechanism avoids the need to fetch configuration from etcd for every incoming request, reducing system load. However, it's important to note that during control plane anomalies, restarting data plane instances should be avoided to prevent loss of in-memory configurations.

Control Plane Anomalies

Network Communication Interruption

In the event of a network interruption between API gateway and etcd, configurations written to etcd via the Admin API won't reach the gateway. However, the gateway continues to use the previously saved in-memory configuration to handle new traffic requests, preventing abnormal exits due to the loss of connection with etcd. Once the connection between gateway and etcd is restored, the gateway receives the latest configuration and resumes normal operation.

etcd Abnormal Crash

If etcd experiences an abnormal crash, administrators won't be able to write configurations via the Admin API. However, this does not impact the gateway's operation, which continues to work and handle traffic requests. In this scenario, the gateway's behavior is similar to the situation after a network interruption.

Multi-Node Deployment and Load Balancing

To ensure high availability, it is recommended to deploy multiple gateway instances and set up a load balancer (such as AWS Load Balancer or F5) between them. These load balancers have health check mechanisms to assess the health status of gateway instances. If a gateway instance fails, the load balancer promptly removes it from service and can add new gateway nodes. This multi-node deployment and load balancing strategy helps prevent business interruptions caused by the failure of a single node.

Conclusion

In summary, Apache APISIX demonstrates outstanding stability and fault tolerance when the control plane and data plane are disconnected. Its separation architecture, efficient data synchronization mechanism, and multi-node deployment strategy ensure high availability even in exceptional circumstances. Apache APISIX's design takes into account various network and component anomalies, making it perform exceptionally well in handling enterprise-level traffic requests.