MySQL / Heatwave: HeatWave MySQL Disaster Recovery with Bi-Directional Replication

HeatWave MySQL is a MySQL database service that is part of Oracle Cloud Infrastructure (OCI). It's a fully managed database service that allows you to run MySQL workloads in the cloud.

Replication with MySQL is Asynchronous in nature. It is uni-directional. HeatWave has Channel to define the in-bound Replication from another "MySQL" or "HeatWave". If we have 2 HeatWave DB Systems in different Region, Replication Channel is a good use to define replication between the Regions.

To setup HeatWave Channel between Regions, please refer to the following URL for more information

Ref : https://docs.oracle.com/en/learn/oci-mysql-dr-copy/index.html

Although Replicatoin is uni-directional, people might want to define Bi-directional Replication between the TWO DB Systems. It is the TWO Replication Channels, where each DB System is defined with a Channel pointing each other.

Having Bi-Directional Replication can be beneficial in most cases. However, disaster may happen which switch back might not be easy.

For example, Data corruption in Primary Database where DB cannot be resumed. Or Network has problem in Primary where data has been updated in Primary but not with DR site. The data content between Primary and DR is not the same. DR could be activated for certain reasons. Switching back to production might have problem to rectify the Data content between the DBs. Or The DR DB has been activated for long time without Production DB alive. By the time, Production DB is resumed, Binlog content in DR has been expired. Replication from DR to Production is therefore not possible. Thereafter, Bi-directional Replication might not be universal fit for all situations. Under certain scenarios, switching back may take another approach of RECREATING the environment.

Allowing Replication between TWO DB Systems. the first part is to ensure the GTIDs between the TWO DB Systems to be exactly the same. DB Systems on either side looks for any missing GTIDs from the SOURCE. Replication fails if source does not have the GTID transaction in binary log.

To ensure GTIDs to be the same between the 2 DB Systems :

1. If there is any EVENT (schedule jobs) running with DB System, it might create transaction to the DB System. To avoid any data changes (GTIDs) with the DB System on either side, the EVENT can be disabled before Backup or Dump operation.

2. Cross Region Backup (or HeatWave Backup) is used to create NEW DB.

3. If the DB System is located as different tenancy where HeatWave Backup is not possible, MySQL Shell (dumpInstance / loadDump) should be used.

a. The New DB System is provisioned and no operation should be done. The GTID is EMPTY.

b. loadDump to the New DB System should be provided with updateGtidSet:"replace". So the exact GTIDs can be applied to the New DB System. loadDump should be applied to Standalone DB System. If HeatWave High Availability is to be deployed, Apply HA after the setup completes.

For more info : https://dev.mysql.com/doc/mysql-shell/9.3/en/mysql-shell-utilities-load-dump.html

"do not use this option when Group Replication is running on the target MySQL instance."

Switch Over and Switch Back : Due to the nature of Asynchronous Replication, HeatWave DB System allows READ-ONLY / READ-WRITE opreation setting. To ensure no data conflict between Source and Target DB Systems, it has to be at most ONLY one READ-WRITE access on the DB Systems. Be careful with EVENT definition, Only one site EVENT is activated. The following diagram shows activation to go thru a READONLY stage for both sites. As such, we ensure and validate the GTIDs in Source and Target. Once it is validated correctly and GTIDs are the same for both DB Systems, we can turn on READ-WRITE access to the specific DB System.

There are few issues which may cause the REPLICATION to stop working.

1. For some reasons, the GTIDs on DR has more than Production (e.g. someone creates unnecessary transaction. EVENT (schedule job) triggers additional records into the tables.

2. People might be seeing DUPLICATED KEY and the Replication Channel stops working. e.g. Definition between DB Systems are not the same, where tables without primary in SOURCE DB System but Primary Key definition in Target DB System.

3. The binlog data on SOURCE has been purged. e.g. The TARGET has been shutdown for a period of time and when the DB System is started up. The binlog on SOURCE has been purged.

4. Relay log space is not enough. e.g. Applier thread issue which turns out to be no way to apply transaction to Target. The relay log will get fill up eventually.

4. Catching up with Replication : One way of speeding up replication but at risk is to disable Crash Recovery. Sometimes, with long lagging of Replication due to other reasons, people may want to sync up data. Disabling Crash Recovery is one option but at risk.

5. Getting Transaction Size error where DR is HA, where the SOURCE can be from other MySQL Source (not heatwave). Application on source DB System might running with big Transaction Data. HeatWave HA runs with Group Replication which it has the max transaction size limit.

The following highlights the steps to create Bi-Directional Replication Channels

1. Creating Replication User on Primary HeatWave MySQL instance. The following Replication User will be replicated to Target Server once REPLICATION channel is started on Target.

-- Create replication user
CREATE USER 'replication_user'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'replication_user'@'%';

2. Create Channel on Target DB System

Fill in the Replication User / Password and Target Server IP/Port
For other fields, they can be DEFAULT (with HeatWave DB Systems as SOURCE).

Wait until the Channel creation completes without Error

3. Create Channel on Source DB System

Fill in the Replication User / Password and Target Server IP/Port
For other fields, they can be DEFAULT (with HeatWave DB Systems as SOURCE).

4. Verify replication

Check the replication status on both instances using SHOW REPLICA STATUS FOR CHANNEL '<channel_name>';
Verify that data is being replicated correctly between the two instances.

Appendix :

1. List EVENTs in the DB System

select event_schema as database_name, event_name, event_definition, event_type, concat(interval_value, ' ', interval_field) as 'interval', case when event_type = 'RECURRING' then starts else execute_at end as execute_time, ends, status from information_schema.events -- where event_schema = 'your database name' order by event_schema, event_name;

2. Using MySQL Shell to dump DB Content from HeatWave. The latest version of MySQL Shell can be used.

MySQL Shell JS> util.dumpInstance( "<folder>", {("", {"compatibility":["strip_restricted_grants", "strip_definers"], "ocimds":true, "targetVersion":"x.y.z"})

Notes : There are couple of options to be considered with "compatibility" if necessary. For details : https://dev.mysql.com/doc/mysql-shell/9.3/en/mysql-shell-utilities-dump-instance-schema.html

a. "ignore_missing_pks" , "create_invisible_pks"

b. "force_non_standard_fks"

3. Using MySQL Shell to loadDump to HeatWave. The latest version of MySQL Shell can be used.

MySQL Shell JS> util.loadDump( "<folder>", {updateGtidSet:"replace", ignoreVersion:true,loadUsers:true, progressFile: "progress.json", resetProgress:true})

Notes : For details with options, https://dev.mysql.com/doc/mysql-shell/9.3/en/mysql-shell-utilities-load-dump.html

a. createInvisiblePKs

b. threads

4. Checking Primary Key for DB Systems when enabling HA for HeatWave : https://docs.oracle.com/en/database/mysql/heatwave-aws/ha-prerequisites.html

SELECT t.table_schema, t.table_name
FROM information_schema.tables t
  LEFT JOIN (SELECT table_schema, table_name 
             FROM information_schema.statistics
             WHERE index_name = 'PRIMARY' 
             GROUP BY table_schema, table_name, index_name
             ) pks 
  ON t.table_schema = pks.table_schema AND t.table_name = pks.table_name 
WHERE pks.table_name IS NULL
  AND t.table_type = 'BASE TABLE' 
  AND t.table_schema NOT IN ('mysql', 'sys', 'performance_schema', 'information_schema');

5. About Event with Replication Environment, "ALTER EVENT <event> enable" activates EVENT on the Server but REPLICA is disabled via Replication. For details please refer to https://dev.mysql.com/doc/refman/9.3/en/replication-features-invoked.html

In the case of CREATE EVENT and ALTER EVENT:

The status of the event is set to REPLICA_SIDE_DISABLED on the replica regardless of the state specified (this does not apply to DROP EVENT).

6. Using MySQL Shell utility.loadDump(... {updateGtidSet:"replace"| "append"}) might create extra GTIDs to the Target. The GTIDs on Source does not have those GTIDs which are those existing DDL statements. Assuming those GTIDs are related to any existing SOURCE transactions. We can execute the following on Source DB System to add extra GTIDs so that when Replication Channel creation as Reverse direction, the extra GTIDs are not requested to reapply to the SOURCE again. Using the "+" sign to indicate adding GTIDs to Purged GTID.

MySQL Shell source JS> call sys.set_gtid_purged("+<the extra GTIDs from the Target>")

MySQL / Heatwave

Tuesday, July 1, 2025

HeatWave MySQL Disaster Recovery with Bi-Directional Replication

No comments:

Post a Comment