Continuous backup of PostgreSQL in NeetoDeploy

August 1, 2023

NeetoDeploy is a Heroku alternative for deploying web applications. Currently NeetoDeploy is being used by Neeto for all pull request review applications and staging application. NeetoDeploy is a platform as a service (PaaS) solution that makes it easy to manage and deploy applications with features like database backup, auto deployment from GitHub, live logs, SSL certificates management, auto scaling and more.

NeetoDeploy's PostgreSQL add-on has a continuous backup feature. This feature allows you to roll your database back to a specific time in the past. We can split the continuous backup process into data backup and recovery. Let's discuss these steps in detail.

Initial setup

Let us create a new PostgreSQL cluster and populate it with some data.

Creating parent database

initdb /usr/local/var/postgresql-parent

Creating some sample data

Start the PostgreSQL server and login using psql. Provide role and password if needed.

pg_ctl -D /usr/local/var/postgresql-parent start
psql -d postgres

Create a new table

postgres=#
CREATE TABLE customers (
  user_id SERIAL PRIMARY KEY,
  username VARCHAR(50),
  email VARCHAR(100),
  registration_date DATE
);

Seed sample data

postgres=#
INSERT INTO customers  (username, email, registration_date)
VALUES
('johndoe', '[email protected]', '2021-01-01'),
('janedoe', '[email protected]', '2021-02-15'),
('bobsmith', '[email protected]', '2021-03-10'),
('sarahlee', '[email protected]', '2021-04-05'),
('maxwell', '[email protected]', '2021-05-20');

Data backup

To perform recovery, we need to have access to the existing data. Hence we must store all the database changes in a separate location, allowing us to utilize this data for recovery.

This is where the Write Ahead Log (WAL) in PostgreSQL comes into play. PostgreSQL follows a mechanism that writes all transactions to the WAL before committing them to the table data files. By preserving the WAL, we can reconstruct the data by traversing the logged transactions. Utilizing the information stored in the WAL, we can systematically reapply the recorded transactions to recreate the data and achieve the desired recovery. Your WAL files are in your PostgreSQL data directory's pg_wal folder.

WAL archiving

As data modifications occur in the databases, new WAL files are generated while older WAL files are eventually discarded. Therefore, it is crucial to store the WAL files before they are deleted. Here we will use an AWS S3 bucket to store our WAL files. Also, we will use wal-e and envdir to simplify the process.

Install helper modules

python3 -m pip install wal-e envdir

If you face any dependency issues, please refer wal-e to fix them.

Store AWS S3 credentials

To use "envdir" with your AWS credentials, store them in a folder. The following code snippet demonstrates how to accomplish this:

mkdir ~/wal-e.env
echo "YOUR_AWS_REGION" > ~/wal-e.env/AWS_REGION
echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/wal-e.env/AWS_SECRET_ACCESS_KEY
echo "YOUR_AWS_ACCESS_KEY_ID" > ~/wal-e.env/AWS_ACCESS_KEY_ID
echo "YOUR_AWS_STORAGE_PATH" > ~/wal-e.env/WALE_S3_PREFIX

Configure WAL archiving

Now, let's configure PostgreSQL to store the WAL files in your S3 bucket before they are deleted. Open your PostgreSQL configuration file "postgresql.conf", which is located in your data directory i.e., "/usr/local/var/postgresql-parent", and make the following changes.

# To enable WAL archiving.
archive_mode = on

# Determines how much information is written to the WAL.
wal_level = replica

# Force PostgreSQL to switch WAL file in every 60 seconds.
archive_timeout = 60

# Command for pushing the wal files to the S3 bucket.
archive_command = 'envdir ~/wal-e.env wal-e wal-push %p'

You can find more information about the above configurations by referring to this documentation.

Now let's restart the server.

 data_directory="/usr/local/var/postgresql-parent"
 pg_ctl -D $data_directory restart -l $data_directory/postgresql.log

You can watch the PostgreSQL logs like this.

 tail -f /usr/local/var/postgresql-parent/postgresql.log

After starting your server, monitor the PostgreSQL logs. If you don't observe any WAL archive logs even after 60 seconds, it is because a WAL file switch will only occur if there are any modifications. To generate the desired outcome, make updates in the database. Upon doing so, you will be able to see lines similar to the following in your PostgreSQL logs:

Message: 'MSG: completed archiving to a file\nDETAIL: Archiving to "s3://neeto-deploy-backups/local/wal_005/000000020000000000001.lzo" complete at 158.209KiB/s.\nSTRUCTURED: time=2023-05-18T06:22:27.613242-00 pid=263 action=push-wal key=s3://neeto-deploy-backups/local/wal_005/000000020000000000001.lzo prefix=local/ rate=158.209 seg=000000020000000000001 state=complete'

Base backup

Since we have enabled WAL archiving, all new changes are stored in the bucket as WAL files. However, what about the data we created before configuring WAL archiving? To ensure a complete recovery, we also need that data. Therefore, we need to take a base backup of our database. The following code will initiate the base backup process and push it to the S3 bucket.

envdir ~/wal-e.env wal-e backup-push /usr/local/var/postgresql-parent

On running the above command you will get the below output.

wal_e.main   INFO     MSG: starting WAL-E
DETAIL: The subcommand is "backup-push".
STRUCTURED: time=2023-05-18T06:24:27.613242-00 pid=41178
wal_e.worker.upload INFO     MSG: begin uploading a base backup volume

DETAIL: Uploading to "s3://neeto-deploy-backups/local/basebackups_005/
base_000000010000000000005_00000040/tar_partitions/part_00000000.tar.lzo".
STRUCTURED: time=2023-05-18T06:24:29.527622-00 pid=41178
wal_e.worker.upload INFO     MSG: finish uploading a base backup volume

DETAIL: Uploading to "s3://neeto-deploy-backups/local/basebackups_005/
base_000000010000000000005_00000040/tar_partitions/part_00000000.tar.lzo"
complete at 1459.68KiB/s.
STRUCTURED: time=2023-05-18T06:24:33.121716-00 pid=41178
NOTICE:  all required WAL segments have been archived

After completing the base backup, you can verify the presence of the base backup folder in your S3 bucket. Now we have the backup and new WAL files are being pushed to our bucket.

Data recovery

Let's consider a scenario where you need to perform a rollback. Suppose you accidentally deleted a table from your database that contained multiple records, making it impractical to recall and reinsert them manually. As a first step, make a note of the time at which the table was deleted. For instance, let's say it occurred on 2023-05-18 11:57:11 UTC. The objective is to roll back to a point before that time. Let's roll back to 2023-05-18 at 11:57:00 UTC.

Let's make a mistake

We are going to drop a table.

postgres=#
DROP TABLE customers;

Note the time the table got deleted. Here it is 2023-05-18 11:57:11 UTC. Now let's discuss how we can restore the deleted table.

Creating a new data directory

Let's create a new data directory where we can recover the data.

initdb /usr/local/var/postgresql-child

Fetch the base backup

The base backup serves as the starting point, while the remaining data will be regenerated by PostgreSQL using the WAL files. To fetch the base backup, you can use the following code snippet.

envdir ~/wal-e.env wal-e backup-fetch /usr/local/var/postgresql-child LATEST

The "LATEST" tag will pull the latest base backup if multiple base backups are in the bucket. On running the above command you will receive an output similar to the following.

wal_e.main   INFO     MSG: starting WAL-E
DETAIL: The subcommand is "backup-fetch".
STRUCTURED: time=2023-05-18T06:29:22.826910-00 pid=41789
wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
DETAIL: The partition being downloaded is part_00000000.tar.lzo.
HINT: The absolute S3 key is local/basebackups_005/
base_000000010000000000000005_00000040/tar_partitions/part_00000000.tar.lzo.
STRUCTURED: time=2023-05-18T06:29:24.674572-00 pid=41789

Configuring recovery

Now, we need to regenerate the remaining data from the WAL files. To do that, make the following changes in the PostgreSQL configuration file.

#postgresql.conf

# Command to fetch each wal files from the bucket.
restore_command = 'envdir ~/wal-e.env wal-e wal-fetch %f %p'

# Time at which we want to stop recovery.
recovery_target_time  = '2023-05-18 11:57:00'

We can stop the parent server since it is no longer needed.

pg_ctl -D /usr/local/var/postgresql-parent stop

To initiate the server in recovery mode and to allow PostgreSQL to regenerate the data using the WAL files, create a file named "recovery.signal" in the new data directory.

touch /usr/local/var/postgresql-child/recovery.signal

Start the child server pointing to the new data directory by running the following command.

 data_directory="/usr/local/var/postgresql-child"
 pg_ctl -D $data_directory start -l $data_directory/postgresql.log

If you examine the PostgreSQL logs, you will notice that the recovery process is paused at the specified time.

2023-05-18 12:07:41.688 IST [43159] LOG:  restored log file "000000010000000000006" from archive
2023-05-18 12:07:41.692 IST [43159] LOG:  recovery stopping before commit of transaction 743, time 2023-05-18 11:57:11.135927+05:30
2023-05-18 12:07:41.692 IST [43159] LOG:  pausing at the end of recovery
2023-05-18 12:07:41.692 IST [43159] HINT:  Execute pg_wal_replay_resume() to promote.

It's important to note that you won't be able to connect to the server using psql at this point because it is not ready to accept connections. By default, PostgreSQL pauses the recovery process after reaching the recovery target time. To change this behavior and to allow the server to accept connections after reaching the target time, add the following configuration to your PostgreSQL configuration file.

recovery_target_action  = 'promote'

After making the above change, restart the server once again by running the following command.

data_directory="/usr/local/var/postgresql-child"
pg_ctl -D $data_directory restart -l $data_directory/postgresql.log

Now, you will observe that the server is ready to accept connections after completing the recovery process. Once the recovery is successfully completed, PostgreSQL automatically deletes the "recovery.signal" file. You can verify the successful deletion of the recovery file from your data directory.

Once the server is up and ready, connect to the server and check if the deleted table is present. You will find that the deleted table is now restored and available in the database.

postgres=#
select * from customers;

 user_id | username |        email         | registration_date
---------+----------+----------------------+-------------------
       1 | johndoe  | [email protected]  | 2021-01-01
       2 | janedoe  | [email protected]  | 2021-02-15
       3 | bobsmith | [email protected] | 2021-03-10
       4 | sarahlee | [email protected] | 2021-04-05
       5 | maxwell  | [email protected]  | 2021-05-20
(5 rows)

To learn more about PostgreSQL continuous backup, you can refer to the official PostgreSQL documentation.

Challenges faced

Restoration of PostgreSQL roles

When we restored the PostgreSQL database, the roles and passwords were also restored in the database. However, we create a new database with the desired data in the continuous rollback feature. Therefore, we should use a new role and password to ensure security. We resolved this issue by changing the restored role and password to a new one.

Early recovery completion

Another issue we encountered was the early recovery of the database before reaching the specified target time. Suppose no activity occurs in the database, and we attempt to roll back to the current time. PostgreSQL initiates the recovery process and recovers all the available WAL files. However, it fails to locate the target time since no corresponding WAL file is available for the given timestamp. PostgreSQL raises the following error in this scenario and shuts down the server.

2023-05-18 02:12:15.846 UTC [76] LOG:  last completed transaction
was at log time 2023-05-18 02:04:02.755919+00
2023-05-18 02:12:15.846 UTC [76] FATAL:  recovery ended before configured
recovery target was reached
2023-05-18 02:12:15.850 UTC [75] LOG:  startup process (PID 76) exited with
exit code 1
2023-05-18 02:12:15.850 UTC [75] LOG:  terminating any other active server
processes
2023-05-18 02:12:15.853 UTC [75] LOG:  shutting down due to startup
process failure
2023-05-18 02:12:15.859 UTC [75] LOG:  database system is shut down

In this case, we can remove the specified target time and restart the server. If we do not provide any target time, PostgreSQL will recover until the last available WAL file, which will contain the latest transaction of our database.

Recovery is an asynchronous process

As mentioned above, we must update the restored role and password to new ones.

update-role-and-password(){
# Create a new temporary user as we cannot update the current session user.
psql -d <DB_NAME> \
-c "CREATE USER tempuser WITH SUPERUSER LOGIN PASSWORD 'tempuser-pwd';"

# Login as temporary user and update the parent role and password.
PGPASSWORD=tempuser  PGUSER=tempuser-password  psql -d <DB_NAME> \
-c "ALTER ROLE <parent-role> RENAME TO <NEW_ROLE>;
ALTER ROLE <NEW_ROLE> WITH PASSWORD <NEW_PASSWORD>;"

# Login as the new user and remove the temporary user.
PGPASSWORD=<NEW_ROLE> PGUSER=<NEW_PASSWORD> psql -d <DB_NAME> \
-c "DROP ROLE tempuser;"
}


# Start the server in recovery mode.
pg_ctl -D /usr/local/var/postgresql-child start

# Call the method to update role and password.
update-role-and-password

If you try the above code, you will encounter the following error.

ERROR:  cannot execute CREATE ROLE in a read-only transaction

Since the recovery process is asynchronous, the system may attempt to update the role before completing the recovery. It is important to note that the server will be in read-only mode during the recovery, which can lead to the error mentioned.

To solve this issue, it is necessary to determine when the recovery process was completed. This can be achieved by checking for the existence of the "recovery. signal" file in a loop.

wait_until_recovery_is_completed()
{
  recovery_signal_path="/usr/local/var/postgresql-child/recovery.signal"
  while [ -f "$file_path" ]; do
    sleep 1  # Wait for 1 second before checking again.
  done
}

pg_ctl -D /usr/local/var/postgresql-child start
wait_until_recovery_is_completed
update-role-and-password

If this blog was helpful, check out our full blog archive.