etcd vs PostgreSQL

Historical Background

PostgreSQL

PostgreSQL was originally developed in 1986 under the leadership of Professor Michael Stonebraker at the University of California, Berkeley. Over the course of several decades of development, PostgreSQL has emerged as the leading open-source relational database management system available today. Its permissive license enables anyone to use, modify, and distribute PostgreSQL freely, regardless of whether it is for private, commercial, or academic research purposes.

PostgreSQL offers robust support for both online analytical processing (OLAP) and online transaction processing (OLTP), boasting powerful SQL query capabilities and a broad range of extensions that allow it to meet nearly all commercial needs. As a result, it has garnered increasing attention in recent years. In fact, PostgreSQL's scalability and high performance enable it to replicate the functionality of virtually any other type of database.

postgres architecture Image source (following CC 3.0 BY-SA license agreement): https://en.wikibooks.org/wiki/PostgreSQL/Architecture

etcd

How did etcd come into existence, and what problem does it solve?

In 2013, the startup team CoreOS developed a product called Container Linux. It's an open-source, lightweight operating system that prioritizes the automation and rapid deployment of application services. Container Linux requires applications to run in containers and provides a cluster management solution, making it convenient for users to manage services as if on a single machine.

To ensure that user services wouldn't experience downtime due to a node restart, CoreOS needed to run multiple replicas. But how would they coordinate between multiple replicas and avoid all replicas becoming unavailable during changes?

To address this issue, the CoreOS team required a coordination service that could store service configuration information and provide distributed locking capabilities, and more. So, what was their approach? They first analyzed the business scenario, pain points, and core objectives. Then, they selected a solution that aligned with their goals, evaluating whether to choose an open-source community solution or develop their own custom tool. This approach is a universal problem-solving method that is often employed when faced with challenging problems, and the CoreOS team followed the same principle.

A coordination service ideally needs to meet the following five goals:

High availability with multiple data replicas
Data consistency with version checking between replicas
Minimal storage capacity: the coordination service should store only critical metadata configuration information for services and nodes belonging to the control plane configuration, rather than user-related data. This approach minimizes the need for data sharding for storage and avoids excessive design.
Functionality for CRUD (create, read, update, and delete), as well as a mechanism for listening to data changes. It should store the status information of services, and when there are changes or abnormalities in the services, it should quickly push the change event to the control plane. This helps improve service availability and reduces unnecessary performance overhead for the coordination service.
Operational simplicity: the coordination service should be easy to operate, maintain, and troubleshoot. An easy-to-use interface can reduce the risk of errors, lower maintenance costs, and minimize downtime.

From the perspective of the CAP Theorem, etcd belongs to the CP (Consistency & Partition Tolerance) system. etcd architecture

As the central component of a Kubernetes cluster, kube-apiserver uses etcd as its underlying storage.

On the one hand, etcd is used for persistence in creating resource objects in a k8s cluster. On the other hand, it is etcd's data watch mechanism that drives the entire cluster's Informer work, enabling continuous container orchestration.

Therefore, from a technical perspective, the core reasons why Kubernetes uses etcd are:

etcd is written in Go language, which is consistent with the k8s technology stack, has low resource consumption, and is extremely easy to deploy.
etcd’s strong consistency, watch, lease, and other features are core dependencies of k8s.

In summary, etcd is a distributed key-value database designed specifically for configuration management and distribution. As a cloud-native software, it offers out-of-the-box usability and high performance, making it superior to traditional databases in this particular area of need.

To make an objective comparison between etcd and PostgreSQL, which are two different types of databases, it is important to evaluate them in the context of the same requirement. Therefore, this article will only discuss the differences between the two in terms of their ability to meet the requirements of configuration management.

Data Model

Different databases have different data models that they present to users, and this factor determines the database's suitability for various scenarios.

Key-value vs SQL

The key-value data model is a popular model in NoSQL, which is also adopted by etcd. How does this model compare to SQL and what are its advantages?

First, let's take a look at SQL.

Relational databases maintain data in tables and provide an efficient, intuitive, and flexible way to store and access structured information.

A table, also known to as a relation, is made up of columns that contain one or more categories of data, and rows, also known as table records, that include a set of data defining the categories. Applications retrieve data by using queries that employ operations such as "project" to identify attributes, "select" to identify tuples, and "join" to combine relations. The relational model for managing databases was developed in 1970 by Edgar Codd, a computer scientist at IBM.

relational database

Image source (complying with CC 3.0 BY-SA licensing agreement): https://en.wikipedia.org/wiki/Associative_entity

Records in a table do not have unique identifiers because tables are designed to accommodate multiple duplicate rows. To enable key-value queries, a unique index must be added to the field that serves as the key in the table. PostgreSQL's default index is btree, which, similar to etcd, can perform range queries on keys.

Structured query language (SQL) is a programming language for storing and processing information in a relational database. A relational database stores information in tabular form, with rows and columns representing different data attributes and the various relationships between the data values. You can use SQL statements to store, update, remove, search, and retrieve information from the database. You can also use SQL to maintain and optimize database performance.

PostgreSQL has expanded SQL with numerous extensions, rendering it a Turing-complete language. This means that SQL can perform any complex operation, facilitating the execution of data processing logic entirely on the server side.

In comparison, etcd is designed as a configuration management tool, with configuration data typically represented as a hash table. This is why its data model is structured as a key-value format, effectively creating a single large global table. CRUD operations can be performed on this table, which has only two fields: a unique key with version information, and an untyped value. As a result, clients must retrieve the full value for further processing.

Overall, the key-value structure of etcd simplifies SQL and is more convenient and intuitive for the specific task of configuration management.

MVCC (Multi-Version Concurrency Control)

MVCC is an essential feature for versioning data in configuration management. It allows for:

Querying historical data
Determining the age of data by comparing versions
Watching data, which requires versioning to enable incremental notifications

Both etcd and PostgreSQL have MVCC, but what are the differences between them?

etcd uses a globally incrementing 64-bit version counter to manage its MVCC system. There is no need to worry about overflow. The counter is designed to handle a vast number of updates, even if they occur at a rate of millions per second. Each time a key-value pair is created or updated, it is assigned a version number. When a key-value pair is deleted, a tombstone is created with a version number reset to 0. This means that every change produces a new version, rather than overwriting the previous one.

Furthermore, etcd retains all versions of a key-value pair and makes them visible to users. The key-value data is never overwritten, and new versions are stored alongside the existing ones. The MVCC implementation in etcd also provides read-write separation, which allows users to read data without locking, making it suitable for read-intensive use cases.

PostgreSQL's MVCC implementation differs from that of etcd in that it is not focused on providing incrementing version numbers, but rather on implementing transactions and different isolation levels transparently to the user. MVCC is an optimistic locking mechanism that enables concurrent updates. Each row in a table has a transaction ID record, with xmin representing the transaction ID of the row creation and xmax representing the transaction ID of the row update.

Transactions can only read data that has already been committed before them.
When updating data, if a version conflict is encountered, PostgreSQL will retry with a matching mechanism to determine whether the update should proceed.

To view an example, please refer to the following link: https://devcenter.heroku.com/articles/postgresql-concurrency

Unfortunately, using transaction IDs for version control of configuration data in PostgreSQL is not possible for several reasons:

Transaction IDs are assigned to all rows involved in the same transaction, meaning version control cannot be applied at the row-level.
Historical queries cannot be performed, and only the latest version of a row can be accessed.
Due to their 32-bit counter nature, transaction IDs are prone to overflow and reset during vacuuming.
It is not possible to implement watch functionality based on transaction IDs.

As a result, PostgreSQL requires alternative methods for version control of configuration data since built-in support is unavailable.

Client Interface

The design of a client interface is a critical aspect when it comes to determining the cost and resource consumption associated with its use. By analyzing the differences between interfaces, one can make informed choices when selecting the most suitable option.

etcd’s kv/watch/lease APIs have proved to be particularly adept at managing configurations. However, how can one implement these APIs in PostgreSQL?

Unfortunately, PostgreSQL does not provide built-in support for these APIs, and encapsulation is necessary to implement them. To analyze their implementation, we will examine the pg_watch_demo project developed by myself: pg_watch_demo.

gRPC/HTTP vs TCP

PostgreSQL follows a multi-process architecture, where each process handles only one TCP connection at a time. It uses a custom protocol to deliver functionality via SQL queries and follows a request-response interaction model (similar to HTTP/1.1, which handles only one request at a time and requires pipelining for processing multiple requests simultaneously). However, given the high resource consumption and relatively low efficiency, a connection pool proxy (such as pgbouncer) is crucial for improving performance, especially in scenarios with high QPS.

On the other hand, etcd is designed on a multi-coroutine architecture in Golang and offers two user-friendly interfaces: gRPC and RESTful. These interfaces are easy to integrate with and are efficient in terms of resource consumption. Additionally, each gRPC connection can handle multiple concurrent queries, which ensures optimal performance.

Defining Data

etcd

message KeyValue {
  bytes key = 1;
  // Revision number when the key was created
  int64 create_revision = 2;
  // Revision number when the key was last modified
  int64 mod_revision = 3;
  // Incrementing counter that increases every time the key is updated.
  // This counter is reset to zero when the key is deleted, and is used as a tombstone.
  int64 version = 4;
  bytes value = 5;
  // The lease object used by the key for TTL. If the value is 0, then there is no TTL.
  int64 lease = 6;
}

PostgreSQL

PostgreSQL needs to use a table to simulate etcd's global data space:

CREATE TABLE IF NOT EXISTS config (
  key text,
  value text,
  -- Equivalent to `create_revision` and `mod_revision`
  -- Here, a big integer incrementing sequence type is used to simulate revision
  revision bigserial,
  -- Tombstone
  tombstone boolean NOT NULL DEFAULT false,
  -- Composite index, search by key first, then by revision
  primary key(key, revision)
);

get

etcd

etcd's get API has a wide range of parameters:

Range queries, for example, setting key as /abc and range_end as /abd will retrieve all the key-value pairs with /abc as the prefix.
Historical queries, specifying revision or a range of mod_revision.
Sorting and limiting the number of returned results.

message RangeRequest {
  ...
  bytes key = 1;
  // Range queries
  bytes range_end = 2;
  int64 limit = 3;
  // Historical queries
  int64 revision = 4;
  // Sorting
  SortOrder sort_order = 5;
  SortTarget sort_target = 6;
  bool serializable = 7;
  bool keys_only = 8;
  bool count_only = 9;
  // Historical queries
  int64 min_mod_revision = 10;
  int64 max_mod_revision = 11;
  int64 min_create_revision = 12;
  int64 max_create_revision = 13;
}

PostgreSQL

PostgreSQL can perform the get function of etcd through SQL, and even provide more complex functionalities. Since SQL itself is a language rather than a fixed-parameter interface, it is highly versatile. Here we show a simple example of retrieving the latest key-value pair. Since the primary key is a combined index, it can be quickly searched by range, resulting in high-speed retrieval.

CREATE FUNCTION get1(kk text)
RETURNS table(r bigint, k text, v text, c bigint) AS $$
    SELECT revision, key, value, create_time
    FROM config
    where key = kk and tombstone = false
    ORDER BY key, revision desc
    limit 1
$$ LANGUAGE sql;

put

etcd

message PutRequest {
  bytes key = 1;
  bytes value = 2;
  int64 lease = 3;
  // whether to respond with the key-value pair data before the update from this `Put` request.
  bool prev_kv = 4;
  bool ignore_value = 5;
  bool ignore_lease = 6;
}

PostgreSQL

Just like in etcd, PostgreSQL does not execute changes in place. Instead, a new row is inserted, and a new revision is assigned to it.

CREATE FUNCTION set(k text, v text) RETURNS bigint AS $$
  insert into config(key, value) values(k, v) returning revision;
$$ LANGUAGE SQL;

delete

etcd

message DeleteRangeRequest {
  bytes key = 1;
  bytes range_end = 2;
  bool prev_kv = 3;
}

PostgreSQL

Similar to etcd, deletion in PostgreSQL does not modify data in place. Instead, a new row is inserted with the tombstone field set to true to indicate that it is a tombstone.

CREATE FUNCTION del(k text) RETURNS bigint AS $$
  insert into config(key, tombstone) values(k, true) returning revision;
$$ LANGUAGE SQL;

watch

etcd

message WatchCreateRequest {
  bytes key = 1;
  // Specifies the range of keys to watch
  bytes range_end = 2;
  // Starting revision for the watch
  int64 start_revision = 3;
  ...
}

message WatchResponse {
  ResponseHeader header = 1;
  ...
  // For efficiency, multiple events can be returned
  repeated mvccpb.Event events = 11;
}

PostgreSQL

PostgreSQL does not come with a built-in watch function, and instead, it requires a combination of triggers and channels to achieve similar functionality. By using pg_notify, data can be sent to all applications that are listening to a specific channel.

-- trigger function for distributing put/delete events
CREATE FUNCTION notify_config_change() RETURNS TRIGGER AS $$
DECLARE
  data json;
  channel text;
  is_channel_exist boolean;
BEGIN
  IF (TG_OP = 'INSERT') THEN
    -- use JSON to encode
    data = row_to_json(NEW);
    -- Extract channel name for distribution from key
    channel = (SELECT SUBSTRING(NEW.key, '/(.*)/'));
    -- If an application is watching the channel, send an event through it
    is_channel_exist = NOT pg_try_advisory_lock(9080);
    IF is_channel_exist THEN
        PERFORM pg_notify(channel, data::text);
    ELSE
        PERFORM pg_advisory_unlock(9080);
    END IF;
  END IF;
  RETURN NULL; -- Result is ignored since this is an AFTER trigger
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER notify_config_change
AFTER INSERT ON config
FOR EACH ROW EXECUTE FUNCTION notify_config_change();

Since the watch feature is encapsulated, client applications must also implement corresponding logic. Using Golang as an example, the following steps must be taken:

Start listening: When listening starts, all notify data will be cached at both the PostgreSQL and Golang channel levels.
Retrieve all data using get_all(key_prefix, revision): This function reads all existing data starting from the specified revision. For each key, only the latest revision data will be returned, with any deleted data automatically removed. If revision is not specified, it returns the latest data for all keys with the given key_prefix.
Watch for new data, including any notifications that may have been cached between the first and second steps, to avoid missing any new data that may occur during this time window. Ignore any revisions that have already been read in step two.

func watch(l *pq.Listener) {
    for {
        select {
        case n := <-l.Notify:
            if n == nil {
                log.Println("listener reconnected")
                log.Printf("get all routes from rev %d including tombstones...\n", latestRev)
   // When reconnecting, resume transmission based on the revision before the disconnection.
                str := fmt.Sprintf(`select * from get_all_from_rev_with_stale('/routes/', %d)`, latestRev)
                rows, err := db.Query(str)
                ...
                continue
            }
            ...
            // maintain a state that records the latest revision it has received
            updateRoute(cfg)
        case <-time.After(15 * time.Second):
            log.Println("Received no events for 15 seconds, checking connection")
            go func() {
                // If no events are received for a prolonged period, check the health of the connection
                if err := l.Ping(); err != nil {
                    log.Println("listener ping error: ", err)
                }
            }()
        }
    }
}

log.Println("get all routes...")
// When initializing, the application should obtain all current key-value pairs and then incrementally monitor updates through watch
rows, err := db.Query(`select * from get_all('/routes/')`)
...
go watch(listener)

transaction

etcd

etcd's transactions are a collection of multiple operations with conditional checks, and the modifications made by the transaction are atomically committed.

Regenerate response

message TxnRequest {
  // Specify the transaction execution condition
  repeated Compare compare = 1;
  // Operations to be executed if the condition is met
  repeated RequestOp success = 2;
  // Operations to be executed if the condition is not met
  repeated RequestOp failure = 3;
}

PostgreSQL

The DO command in PostgreSQL allows for the execution of any command, including stored procedures. It supports multiple languages, including built-in languages like PL/pgSQL and Python. With these languages, any conditional judgments, loops, and other control logic can be implemented, making it more versatile than etcd.

DO LANGUAGE plpgsql $$
DECLARE
     n_plugins int;
BEGIN
    SELECT COUNT(1) INTO n_plugins FROM get_all('/plugins/');
    IF n_plugins = 0 THEN
       perform set('/routes/1', 'foobar');
       perform set('/upstream/1', 'foobar');
       ...
    ELSE
       ...
    END IF;
END;
$$;

lease

etcd

In etcd, it is possible to create a lease object that applications must renew periodically to prevent it from expiring. Each key-value pair can be linked to a lease object, and when the lease object expires, all associated key-value pairs will also expire, automatically deleting them.

message LeaseGrantRequest {
  // TTL of the lease
  int64 TTL = 1;
  int64 ID = 2;
}

// Lease renewal
message LeaseKeepAliveRequest {
  int64 ID = 1;
}

message PutRequest {
  bytes key = 1;
  bytes value = 2;
  // Lease ID, used to implement TTL
  int64 lease = 3;
  ...
}

PostgreSQL

In PostgreSQL, a lease can be maintained through a foreign key. When querying, if there is an associated lease object that has expired, it is considered a tombstone.
Keepalive requests update the last_keepalive timestamp in the lease table.

CREATE TABLE IF NOT EXISTS config (
  key text,
  value text,
  ...
  -- Use a foreign key to specify the associated lease object.
  lease int64 references lease(id),
);

CREATE TABLE IF NOT EXISTS lease (
  id text,
  ttl int,
  last_keepalive timestamp;
);

Performance Comparison

PostgreSQL needs to simulate various APIs of etcd through encapsulation. So how is its performance? Here are the results of a simple test:https://github.com/kingluo/pg_watch_demo#benchmark.

etcd_vs_postgres

The results show that the read and write performance are nearly identical, with PostgreSQL even outperforming etcd. Additionally, the latency from an update occurring to the application receiving the event determines the efficiency of the update distribution, and both PostgreSQL and etcd perform similarly. When tested on the same machine for both the client and the server, the watch latency was less than 1 millisecond.

PostgreSQL, however, has some shortcomings worth mentioning:

The WAL log for each update is larger, resulting in twice as much disk I/O compared to etcd.
It consumes more CPU compared to etcd.
Notify based on channels is a transaction-level concept. When updating the same type of resource, the update is sent to the same channel, and the update requests contend for mutual exclusion locks, resulting in serialized requests. In other words, using channels to implement watch will affect the parallelism of put operations.

This highlights that to achieve the same requirements, we need to invest more in learning and optimizing PostgreSQL.

Storage

The performance is determined by the underlying storage, and how data is stored determines the database's resource requirements for memory, disk, and other resources.

etcd

Architecture diagram of etcd storage:

etcd storage

etcd first writes updates to the write-ahead log (WAL) and flushes them to disk to ensure that the updates are not lost. Once the log is successfully written and confirmed by a majority of nodes, the results can be returned to the client. etcd also asynchronously updates TreeIndex and BoltDB.

To avoid the log from growing infinitely, etcd periodically takes a snapshot of the storage, and logs prior to the snapshot can be deleted.

etcd indexes all keys in memory (TreeIndex), recording the version information of each key, but only keeps a pointer to BoltDB (revision) for the value.

The value corresponding to the key is stored on disk and maintained using BoltDB.

Both TreeIndex and BoltDB use the btree data structure, which is known for its efficiency in lookups and range lookups.

TreeIndex structure diagram:

etcd treeindex

(Image source: https://blog.csdn.net/H_L_S/article/details/112691481, licensed under CC 4.0 BY-SA)

Each key is divided into different generations, with each deletion marking the end of a generation.

The pointer to the value is composed of two integers. The first integer main is the transaction ID of etcd, while the second integer sub represents the update ID of this key within that transaction.

Boltdb supports transactions and snapshots, and it stores the value corresponding to the revision.

etcd boltdb

(Image source: https://blog.csdn.net/H_L_S/article/details/112691481, licensed under CC 4.0 BY-SA)

Example of writing data:

Writing key="key1", revision=(12,1), value="keyvalue5". Note the changes in the red parts of treeIndex and BoltDB:

etcd put

(Image source: https://blog.csdn.net/H_L_S/article/details/112691481, licensed under CC 4.0 BY-SA)

Deleting key="key", revision=(13,1) creates a new empty generation in treeIndex and generates an empty value in BoltDB with key="13_1t".

Here, the t stands for "tombstone". This implies that you cannot read the tombstone because the pointer in treeIndex is (13,1), but in BoltDB, it is 13_1t, which cannot be matched.

etcd delete

(Image source: https://blog.csdn.net/H_L_S/article/details/112691481, licensed under CC 4.0 BY-SA)

It is worth noting that etcd schedules both reads and writes to BoltDB using a single goroutine to reduce random disk I/O and improve I/O performance.

PostgreSQL

Architecture diagram of PostgreSQL storage:

postgres storage

Similar to etcd, PostgreSQL appends updates to a log file first, and waits for the log to be successfully flushed to disk before considering the transaction complete. Meanwhile, the updates are written to the shared_buffer memory.

The shared_buffer is a memory area that is shared by all tables and indexes in PostgreSQL, and it serves as a mapping for these objects.

In PostgreSQL, each table consists of multiple pages, with each page being 8 KB in size and containing multiple rows.

In addition to tables, indexes (such as btree indexes) are also made up of table pages in the same format. However, these pages are special and are interconnected to form a tree structure.

PostgreSQL is equipped with a checkpointer process that periodically flushes all modified table and index page to disk. Prior to each checkpoint, log files can be deleted and recycled to prevent the log from growing indefinitely.

Page structure:

postgres page

(Image source: https://en.wikibooks.org/wiki/PostgreSQL/Page_Layout, licensed under CC 3.0 BY-SA)

Btree index structure:

postgres btree index

(Image source: https://en.wikibooks.org/wiki/PostgreSQL/Index_Btree, licensed under CC 3.0 BY-SA)

To enhance read performance, certain SQL statements in PostgreSQL consider using bitmaps to sequentially read scattered pages, thus improving I/O performance.

EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100;

                                  QUERY PLAN
------------------------------------------------------------------------------
 Bitmap Heap Scan on tenk1  (cost=5.07..229.20 rows=101 width=244)
   Recheck Cond: (unique1 < 100)
   ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)
         Index Cond: (unique1 < 100)

Conclusion

PostgreSQL and etcd both prioritize I/O performance in their storage design. etcd even places the index of all keys in memory, and both systems optimize batch operations for sequential disk reads and writes.

As a result, as seen in the performance comparison above, PostgreSQL and etcd demonstrate similar read and write performance.

However, compared to PostgreSQL, etcd requires greater memory capacity and faster disks, as detailed in their Hardware guidelines for administering etcd clusters.

Distributed Computing

Decentralization and data consistency are the hallmark features of etcd, and they are also necessary requirements for cloud-native systems. However, how can traditional databases fulfill these requirements?

etcd

Raft is a popular distributed protocol used by etcd to distribute updates to multiple nodes, ensuring that committed data is confirmed by a majority of nodes.

Raft has rigorously defined roles and its role switching diagram is shown below:

etcd raft state machine

(Image source: https://raft.github.io, licensed under CC 3.0 BY-SA)

By default, all reads and writes are executed on the master node in etcd. While it's easy to understand the need for consistency in writes, it's worth noting the importance of consistency in reads. Consistency in reads ensures that the reads come from committed data and the latest version of the data. It also ensures that each read version is equal to or greater than the previous read version.

To implement consistent reads, the slave nodes obtain the latest version of data from the master node. If the slave node's version is older than that of the master node, it waits for synchronization.

As a result, all read and write tasks in etcd are handled exclusively by the master node, while the distribution mechanism only guarantees the availability of replicas and data consistency, without offering any load balancing capabilities.

PostgreSQL

PostgreSQL originates from a traditional database background and does not include built-in implementations of distributed protocols like Raft. Nevertheless, it does possess the necessary data replication capabilities for clustering. By incorporating third-party Raft components, PostgreSQL can function as a distributed system that operates in the same manner as etcd.

PostgreSQL comes with a variety of basic features, including:

Synchronous commit
Quorum replication
Failover trigger
Hot standby

On the master node, transactions can be configured to require confirmation from multiple nodes for successful submission, with the number of confirmation nodes set to a majority (quorum).

For more information, see: https://www.2ndquadrant.com/en/blog/evolution-fault-tolerance-postgresql-synchronous-commit/

The role of data replication can be switched through a failover trigger, and tools like pg_rewind can remove data that hasn't been confirmed by a majority of nodes in order to rejoin the cluster later.

Hot-standby allows for serializable read, akin to etcd, which permits already committed data to be read on replica nodes, although it doesn’t guarantee to be the latest version.

Below is an example of relevant configurations:

-- set quorum sync replication in postgresql.conf
-- assume you have 5 nodes, then at least 2 standbys must be committed
-- then you could tolerate 2 nodes failures
synchronous_commit on
synchronous_standby_names ="ANY 2 (*)"

-- if master fails, check flushed lsn of each standby
-- promote a standby with max lsn to master
select flushed_lsn from pg_stat_wal_receiver;

PostgreSQL provides full support for clustering on the data plane, and the provision of the Raft component on the control plane enables the creation of a decentralized cluster. As a PostgreSQL worker process, the pg_raft component, which I have provided to multiple commercial clients, offers cluster management functions such as leader election based on the Raft protocol.

Maintenance

etcd is a database designed for specific needs, which means it doesn't require much maintenance, making it one of its selling points.

Meanwhile, PostgreSQL requires less maintenance from DBAs compared to other relational databases due to its well-designed architecture. Like etcd, many of the maintenance tasks in PostgreSQL are automatic and built-in.

Database management involves various routine maintenance tasks, but this discussion will focus on two: compaction and snapshot backup.

Compaction

Maintaining multiple versions of data can lead to a bloated database and decreased read/write efficiency. To address this issue, older versions of data should be deleted when they are no longer needed, and any resulting gaps should be merged through a process called compaction.

etcd

etcd offers compact and defrag operations in its API to support this process.

The compact operation is used to delete all old versions of data before a certain revision. If the revision range includes the latest version, the latest version will be preserved. For example, if the command compact 100 is used and there is a key-value pair with key=foo, revision=87 it will be preserved, but the key-value pair with key=foo, revision=65 will be deleted. In other words, compact does not delete the current data version for each key.

etcd also provides an auto compaction feature that allows users to specify how often to run compact, such as every few hours.

When compact is used, it leaves gaps in the boltdb, which must be consolidated using defrag. However, defrag involves significant I/O and can block read and write operations, so it should be used with caution.

PostgreSQL

On the other hand, PostgreSQL's compaction process is also simple. For example, to delete old data before revision 100, one can use the following SQL command:

with alive as (
    select r as revision from get_all('/routes/')
)
delete from config
where revision < 100 and not exists (
    select 1 from alive where alive.revision = config.revision limit 1
);

If you need to execute compaction regularly, you can use crontab or pg_cron.

As for MVCC cleanup within the database, PostgreSQL has its own VACUUM command (VACUUM FULL equivalent to defrag in etcd), as well as an automated autovacuum feature.

Snapshot

Snapshot backup is a necessary maintenance task for databases, as it can be used for emergency recovery.

etcd

etcd provides an API for creating and restoring snapshots, for example:

$ etcdctl snapshot save backup.db
$ etcdctl --write-out=table snapshot status backup.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| fe01cf57 |       10 |          7 | 2.1 MB     |
+----------+----------+------------+------------+
$ etcdctl snapshot restore backup.db

PostgreSQL

PostgreSQL also has very comprehensive backup tools:

pg_basebackup is used to prepare data for new PostgreSQL replica nodes.
pgdump is used to clone the database instance online and select which tables to back up.

In fact, based on WAL and logical replication, PostgreSQL also supports more advanced backup mechanisms. Please see the following link for more information:

www.postgresql.org/docs/current/continuous-archiving.html

www.postgresql.org/docs/current/logical-replication.html

Conclusion

PostgreSQL is a versatile traditional SQL database, while etcd is a specialized distributed KV database.

Compared to a pure data access system like etcd, PostgreSQL has several additional benefits:

Rich authentication mechanisms that can implement complete RBAC and fine-grained access control, support multi-tenancy (multiple database instances), can filter IP addresses, and do not require additional proxies.
SQL has built-in schemas, supports foreign keys, and does not require additional control logic to ensure data completeness.
Supports JSON data types, JSON-based indexes, and various JSON operations, such as indexing routing configurations for routing matching.
Supports data encryption and can access HashiCorp Vault to obtain secrets through FDW (Foreign Data Wrappers).
Logical replication can achieve data synchronization between multiple independent clusters.
Support for stored procedures, which can implement additional functionality, such as implementing upstream slow start.

In terms of functionality, PostgreSQL is a superset of etcd, so PostgreSQL can reproduce the functionality of etcd through its rich built-in features and third-party components, and can also be cloud-native.

While using PostgreSQL to implement etcd's functionality is technically feasible, it is akin to converting an aircraft carrier into a cruiser. However, if there are no requirements beyond etcd's capabilities, this approach can be cost-ineffective due to significant development and maintenance costs.

The most significant advantage of etcd is its out-of-the-box nature, meeting the configuration distribution needs of the cloud-native era. etcd can also serve as a core component for features such as leader election, distributed locks, and task scheduling.