Working with a high volume of data can be challenging, but migrating and manipulating this high volume of data is even more challenging and tricky. It requires you to be extremely careful to make sure you don’t affect your users.
In this post, we will talk about how we migrated a database that has a mind-boggling amount of data, the issues we faced, and how we detected and resolved them.
Instabug runs on more than two billion devices worldwide and is used by some of the top apps on the market, including Verizon, Gojek, 9GAG, and more. Our users care about the quality of their app's user experience and they rely on Instabug's Crash Reporting in production. This makes the scale of crashes we receive extremely high and pushes us to make sure our crashes service is very reliable for our users.
At the moment, our crashes table consists of billions of records (we rely on MySQL as our primary storage for crashes).
Our use case
We have a data retention period for our plans after which the data is deleted. As the scale of our crashes database is huge and growing, implementing our data retention on that table can be a big headache. We need to select the rows that need to be retained, which takes a lot of time.
So, we decided to partition the crashes table on a "deleted_at" column, so MySQL can drop a partition that we need to retain almost instantaneously.
In order to partition the crashes table, we needed to add the "deleted_at" and partition the table on that column.
To migrate a table, here are the options that we can do:
- "ALTER TABLE" query
This locks the entire table to apply the new schema changes, which is not suitable in our case (more on that shortly).
- Migration tools
There are migration tools that tackle known problems of MySQL schema migration. The most common are:
- Large Hadron Migrator (LHM)
Now let's talk about how each of these options works, to understand why they are or are not suitable for our requirements.
How it works
In MySQL 5.7 (the version we are using), there are two algorithms to alter the table:
- Locks the table.
- Makes a copy of the table.
- Applies the changes to the new table.
- Copies all rows into the new table.
- Swaps old and new tables.
- Unlocks the table.
- Acquires exclusive metadata lock on the table during the migration.
- Rebuilds the table in place.
Why this is not suitable for our database migration
As mentioned above, both algorithms need to acquire a lock that will block any incoming transactions or connections from referring to or using the table. Additionally, at the scale we are working with, it will take a prohibitively long period of time to complete the process.
Since our crashes table is mission-critical, and a hot table that has a lot of insert operations, we had to pass on both of these algorithms.
Large Hadron Migrator (LHM)
We previously used LHM to migrate tables at the scale of millions of records and it worked quite well. So, we decided to try LHM next for migrating our crashes database. Unfortunately, once we started the migration, we faced some issues that made LHM not suitable for our migration requirements.
Before we talk about these issues, let’s find out what LHM is and how it works.
How it works
In a nutshell, LHM is a rails database migration gem developed by SoundCloud to tackle the large-table migration pains that were caused by the locking nature of the "ALTER TABLE" algorithms. Unfortunately, it has not been actively maintained since 2018.
LHM consists of four main steps or components:
- Creates a target table, copies the existing schema, and applies changes using ALTER on that table.
- Creates triggers for each of the action types — insert, update and delete.
- By adding triggers, any changes to that source table will also be copied to the target table which acts as a replica of the source tables.
- Runs INSERT IGNORE INTO target_table ... SELECT ... from source_table on the source table rows which acquires a read lock.
- Reads ranges and pauses for a specified time to minimize the lock/writing contention.
- Atomic and Locked Switcher
- LockedSwitcher: locks the table being migrated and issues two ALTER TABLE statements.
- AtomicSwitcher: uses a single atomic RENAME TABLE query and is the favored solution.
Why this is not suitable for our database migration
LHM's solution depends mainly on triggers for data consistency, but these triggers have certain drawbacks that can affect our migration. The issues we faced with LHM are:
- Latency in crash creation
Triggers run in the same transaction space where the query executes, which slows down the query. So when LHM adds an insert trigger in the source table to insert the same record in the target table, this trigger will slow down our crash insertion queries.
- Lock contention
We have multiple indexes in the crashes table. A lot of insert queries are trying to run and can compete for acquiring locks (check MySQL Insert Intention Lock for more info). After LHM adds the triggers, it adds another independently competing for other locks in the target table (because the insertion in both tables runs in the same transaction space).
We faced a lot of deadlocks as there were two concurrent queries trying to acquire a lock on an index which resulted in a deadlock that aborted the migration.
- Not suspendible
We were not able to suspend the migration during our high-load hours. Even though LHM provides throttling, the triggers should continue working to avoid the loss of data consistency.
gh-ost is a triggerless online schema migration solution for MySQL developed by Github to tackle the migration pains of using "pt-online-schema-change", which uses the same approach with triggers like LHM. Github's main issues with pt-online-schema-change were:
- Some migrations cause a high load in the DB cluster.
- Some tables need to be migrated during the weekend because the migration is heavyweight.
- Some tables couldn’t be migrated because of their huge scale.
- They suffered outages on their busiest tables due to migration.
gh-ost's triggerless design uses an asynchronous approach. It does not require triggers because it does not require having a changelog table like the fb-osc does. This is because it acquires the changelog from a different place, the binary log.
What is the binary log?
The binary log is a set of log files that contain information about data modifications made to MySQL. It contains events that describe database changes such as table creation operations or changes to table data, but not for statements that do not modify data such as SELECT or SHOW.
One of the binary log's important purposes is replication. The binary log is used on master servers as a record of the statements to be sent to replica servers. The master server sends the events contained in its binary log to its replicas, which execute those events to make the same data changes that were made on the master.
There are two types of binary logging:
- Statement-based logging
Events containing SQL statements that produce data changes (inserts, updates, deletes).
- Row-based logging
Events describing changes to individual rows.
How it works
- Initial setup
- Creates a target table and copies the existing schema, then applies changes using "ALTER TABLE".
- Creates a unique shared key to split the table into chunks and process it.
- Sets up the binary log listener and starts listening to binary log events for source table DML (data manipulation language).
- Copy flow
- Works through the source table's rows range, chunk by chunk, queueing copy tasks onto the target table.
- Reads DML events from the binary logs, queueing apply tasks onto the target table.
- Processes the "copy tasks" queue and the "apply tasks" queue and sequentially applies them to the target table.
- Cut-over and completion
- Locks the source table for writes, working on what remains from the binary log events backlog (keep in mind that this is an asynchronous operation, and we can still have unhandled events even when the table is locked).
- Swaps the source table and the target table.
- Cleans up the tables.
Why gh-ost is suitable for our database migration
Since gh-ost does not use triggers, it no longer needs to overload multiple or concurrent writes combined with lock contention on the source and target tables.
MySQL does not perform well on multiple concurrent massive writes to a specific table and locking becomes an issue. This is why gh-ost chose to migrate in a single thread and to alternate between the massive row-copy and the ongoing binary log events backlog.
Issues we faced with gh-ost and our solutions
- Not saving the latest migrated chunk
- Issue: While the migration was running, the pod we were using in the migration was killed for some reason and we could not find out what the last row copied from the source table was.
Solution: We edited gh-ost to save the latest row it copies and an option to start copying from a specific row.
- Slow migration
- Issue: We noticed that copying from the source table is very slow. As we checked gh-ost's code, we discovered that while gh-ost is queueing the copy and events tasks, it favors the events tasks (binary log events) over the copy tasks, which slows the copy process. This happened because the rate of insertions to our crashes is very high, which makes the binary logs events very high.
Solution: We edited gh-ost to favor copy tasks over event tasks.
- Issue: The migration process was still very slow. This time we suspected that it was because of the indices (lock contention).
Solution: We decided to drop the indices and add them again after the copy tasks are done. After dropping the indices, the migration went back to normal speed, and after the copying was done, we added them again.
After the copying was done and the replication lag became extremely low, we switched the two tables and dropped the old crashes table.
Working with a huge volume of data can be very challenging and introduces many issues and complexities that you need to overcome. And while there are many great tools available to help you handle and manipulate your data, there are no one-size-fits-all solutions.
You need to understand exactly how these tools work to find out which of them will fit your needs and what edits and customizations you need to apply to the tool to fulfill your requirements.
Instabug is growing! Check out our open positions.
- How We Migrated Our Front-End While Scaling at Instabug
- Why We Automated Our Front-End Testing at Instabug
- Mobile App Testing Vs. Mobile SDK Testing: What is the Difference?
Instabug’s triple threat of APM, Crash Reporting, and Bug Reporting tools will empower you to optimize your app’s performance and deliver the high-quality experience your users expect and deserve.
Find out how you can optimize your app performance with Instabug