Best Practices of API Degradation in API Gateway

In modern distributed systems, API degradation serves as a crucial fault tolerance and stability assurance mechanism. When a system faces pressure, failures, or resource constraints, degrading non-core functions or services ensures the overall system's stability and continuity of core business operations. This article explores the scenarios for API degradation and the best practices for implementing degradation at the API gateway level.

Scenarios for API Degradation

System Overload: When the system experiences a surge in requests exceeding the server's processing capacity, degrading non-core APIs is necessary to prevent system crashes and ensure the normal operation of core functions.
Service Failures: In the event of backend service failures or performance degradation, degrading the APIs of those services is necessary to prevent the spread of failures.
Resource Constraints: When system resources such as CPU, memory, or bandwidth are constrained, degrading APIs with high resource consumption is essential to ensure the overall stability of the system.

Best Practices for API Degradation at the Gateway Level

Consider a scenario where an e-commerce platform faces immense traffic pressure during a major sales event like Black Friday. To protect core functions and ensure user experience, the team decides to implement API degradation strategies at the gateway level.

1. Identifying Key APIs

The team identifies critical APIs such as user login, product details retrieval, adding items to the shopping cart, order submission, and payment processing. These APIs are crucial for users to complete the shopping process.

2. Designing Degradation Strategies

For key APIs, performance degradation handling is set up. For instance, when the upstream server returns a status code of 500 for three consecutive times, circuit breaking is initiated, with the circuit downtime automatically increasing, up to a maximum of 300 seconds, until the upstream service recovers.

For non-key APIs like user reviews, recommendation lists, and ad displays, specific degradation strategies are designed:

User Review API: Return default review lists or empty data to avoid real-time calls to the review system.
Recommendation List API: Utilize cached recommendation data or return a fixed recommendation list to reduce reliance on the recommendation engine.
Ad Display API: Return default ads or empty ad slots to ensure page layout remains unaffected.

API

3. Apache APISIX / API7 Enterprise Configuration

For key APIs, the "api-breaker" plugin is enabled on corresponding routes, with a trigger condition set to three consecutive occurrences of a 500 status code, and a maximum circuit downtime of 300 seconds.
For user evaluation API, enable the mocking plugin on the corresponding route and set response_example to empty data.
For the recommendation list API, enable the proxy-cache plugin on the corresponding route, and opt for utilizing response caching in memory.
For the advertisement display API, activate the mocking plugin on the corresponding route and set the response_example to default advertisements, ensuring normal page display while maintaining clickable advertisements.

4. Dynamic Configuration Management

To adapt to changing traffic conditions, the team opts for dynamically reloadable gateways: Apache APISIX / API7 Enterprise. They can adjust circuit-breaking thresholds, degradation strategies, and enable degradation switches based on real-time monitoring data, selectively degrading non-key APIs as traffic peaks occur.

5. Monitoring and Alerting

The team sets up alerting policies conveniently within the API7 gateway or integrates with their in-house monitoring and alerting systems using APISIX. Real-time monitoring of API response times, error rates, and call volumes enables automatic degradation execution upon reaching degradation thresholds or triggering conditions. Operations personnel are notified via SMS, email, or other means to promptly address potential issues, ensuring system stability.

6. Evaluation and Adjustment

Following the implementation of degradation strategies, the team closely monitors system performance and user experience. They evaluate the effectiveness of degradation strategies through user feedback, business data analysis, and monitoring metrics. Based on evaluation results, the team makes timely adjustments to degradation strategies to better balance system load and user satisfaction.

In Conclusion

API degradation, as a critical fault tolerance mechanism in modern distributed systems, significantly enhances system stability and user experience when implemented at the API gateway level. By identifying key APIs, designing degradation strategies, configuring gateway plugins, dynamically managing configurations, monitoring and alerting, and evaluating effectiveness, teams can flexibly address various challenges, ensuring system stability even under pressure.