Compacting SQLite Databases in GlusterFS

Gluster

2016-09-02

Tiering is a powerful feature in Gluster. It divides the available storage into two parts: the hot tier populated by small fast storage devices like SSDs or a RAMDisk, and the cold tier populated by large slow devices like mechanical HDDs. By placing most recently accessed files in the hot , Gluster can quickly process requests for clients. The system needs to store a lot of metadata on the files clients use to determine what should stay in cold storage and what should move, or migrate, to hot storage. Gluster stores this in a database on each brick. However, as clients continue to use Gluster, database operations slow down. This results in slow tier migration and overall bad performance.

We looked into shrinking the database file using SQLite’s built-in compaction commands:VACUUM, full auto_vacuum, and incremental auto_vacuum. These commands remove database fragmentation and can shrink the actual file.

The patch can be found at http://review.gluster.org/#/c/15031

Contents
1 Tiering
    1.1 Migration and Metadata
    1.2 The Problem
2 Database Fragmentation
3 Compaction Methods in SQLite3
    3.1 VACUUM
    3.2 auto_vacuum
4 Gluster Changes
5 Experimental Results
    5.1 Experimental Setup
    5.2 Creation Experiment
    5.3 Deletion Experiment
    5.4 Renaming Experiment
6 Conclusion
7 References

Tiering

Gluster is a distributed file system. It allows someone to use a set of storage devices, or bricks, across the network as if they were one large device on their local system. Tiering is a feature in Gluster to improve performance. It divides the bricks into two sets or tiers, hot and cold. The hot tier stores recently used files, while the cold tier stores all other files. The hot tier is usually populated by faster, but smaller disks like SSDs on the order of GBs. The cold tier consists of larger, but slower disks like mechanical HDDs on the order of TBs. By moving recently used files to the hot tier, a client can read and write to the files at a much faster rate than before without changing all the bricks into more expensive SSDs. The tier daemon is in charge of all of this. [6]

Migration and Metadata

The tiering daemon must decide what files to migrate, or move, from the cold tier to the hot tier. When the smaller hot tier runs out of space, the daemon must also decide what files in the hot tier must return to the cold tier. In particular, the daemon need the access times for every file the client uses. The daemon stores this metadata in databases on each brick.

When the client accesses a file from some brick B, the daemon passes the I/O to the process on B. The I/O then passes through the ChangeTimeRecorder, or CTR. The CTR records all access times in the database on B. When the daemon wants to figure out which files to migrate, it asks the CTRs on each brick B to query the database for eligible files. In both cases, the CTR uses a database-agnostic library, libgfdb, to interact with the database.

The Problem

Gluster currently uses SQLite3 as the database. Joseph Elwin Fernandes goes into why they chose SQLite3 and how they optimized the SQLite3 for writes in his blogpost. In short, he uses a Write-Ahead Log (WAL) to record writes quickly. The database slowly consumes those writes and updates over time. [7] However, this solves half the equation for migration. Writes are fast, but they slow down as the database grows in size over time. The graph below shows the performance of performing a writing operation on the database. For every number of files N, we performed N INSERTs, UPDATEs, and DELETEs each. We report the time it took to complete each batch of operations.

The database also takes up precious space in the more expensive hot tier and means the CTR’s operations takes a longer time.

Is there a way we can keep the database, but improve its, and therefore migration’s, performance? To answer that question, we explored the idea of compacting the databases in Gluster. To understand what compaction is and what benefits it provides, we must go over what it fixes: fragmentation.

Database Fragmentation

SQLite divides its database file into 4KB blocks called pages. These pages are in one of the following states:

Used by a table in the database. There can be many tables in one database, but no two tables can own the same page at once.
Used by SQLite to store metadata.
Unused or free.

A SQLite database is fragmented when the data for a single table A is no longer contiguous. That is, either a free page or a page for another table B exists between two pages for A.

Consider Gluster’s database files, which have two tables, F for file accesses, and H for hard links. The array below represents a database file. Each cell is a page used by the database. The database below is not fragmented since all pages for table F is together.

Now say we delete enough entries from table F to free a page. The database is now fragmented by that page.

If we insert data into table F now, the database is no longer fragmented. Why? SQLite will reuse the space made by deletion instead of using completely empty space (the space at the end).

Now say we free the same page, but insert data into table H. Again, SQLite will reuse the space from the free page. We still have fragmentation.

However, this opens up the possibility of the following scenario. Suppose we expand the database schema to have 3 tables, F for file accesses, H for hard links, and S for symbolic links. Next, suppose we have N pages. Furthermore, the pages in the database file were laid out as below

Consider the following query

SELECT * from F

This query should return all rows from table F and it still will. However, to get all the rows, we need to read in all pages with F‘s data. However, they are scattered around the file. This is the worst case read performance caused by fragmentation. If we can move the data for table F around so all data from table F was clustered together, then we are reading sequentially.

Compaction Methods in SQLite3

Users can remove fragmentation by compacting the database. This technique is not defragmentation. As we will see in Section 3.2, the database may not end in a perfect state

SQLite has a set of commands for this called VACUUMs. There are three types in SQLite3 at the time of writing: VACUUM, full auto_vacuum, and incremental auto_vacuum.

VACUUM

VACUUM will reorganize the database by inserting all the data into a new transient database. This places all used pages from the same tables next to each other and any free pages at the end. This eliminates all fragmentation in the database when called. [1]

However, VACUUM is not a silver bullet.

During the reorganization, the database cannot process any other transactions. Therefore, no client can add new data to the database. Similarly, VACUUM fails the database is in the middle of a transaction. [3]

The process can cause the database file to use at most twice the original database’s space. This occurs when free pages are not the cause of fragmentation.

Someone must call VACUUM periodically to prevent fragmentation from building again.

Finally, VACUUM takes time even when there is no fragmentation present in the database.

The last two points beg an interesting question: when is the best time to call for a VACUUM? If we call it too early, we waste more time on compaction than we would lose allowing fragmentation to build. If we call it too late, we already suffered from the penalties of fragmentation. The developers of SQLite provided a solution to this question in the form of the auto_vacuum.

auto_vacuum

The auto_vacuum is a compaction mechanism that uses metadata to ease the work of compaction. auto_vacuum comes in two flavors, full and incremental. Each flavor helps in different ways.

full auto_vacuum: A full auto_vacuum removes all free pages in the database and truncates the file after every commit. With this, no user has to invoke the compaction regularly. We also do not lose time if there are no free pages at the end of the file. [4]

incremental auto_vacuum: An incremental auto_vacuum is like a full auto_vacuum, but it removes N free pages from the file, where N is user-specified. If N is not specified, all free pages are removed. This causes the file to shrink in size on disk. Despite the namesake, this version will only remove free pages when invoked with a specific pragma, incremental_vacuum(N). [4]

Both flavors of auto_vacuum are not silver bullets either.

Neither one completely eliminates fragmentation. There is no guarantee that data from the same table will be next to each other after the operation completes. In fact, auto_vacuum can make fragmentation worse. If a free page exists between a page for table A and a page for table B, moving or deleting that page still leaves the database fragmented.

Gluster Changes

An admin can activate the compaction capability with

gluster volume set <volname> tier-compact <off|on>

Once active, the system will trigger compaction at regular intervals. The admin can change frequency of compaction on the hot and cold tiers using the following command line options

gluster volume set <volname> tier-hot-compact-frequency <int>
gluster volume set <volname> tier-cold-compact-frequency <int>

Recall the tier daemon handles tier migration for a volume. When a compaction is triggered for a given tier, the daemon sends a compaction IPC to the CTR. The specification for this IPC follow:

IPC: GFDB_IPC_CTR_SET_COMPACT_PRAGMA
INPUT: gf_boolean_t compact_active: Is compaction currently running?
       gf_boolean_t compact_mode_switched: Did the user flip the compaction switch on/off?
OUTPUT: 0: Compaction succeeded
        Non-zero: Compaction failed`

libgfdb now has an abstract method for triggering compaction which the IPC uses to tell the database to compact.

void compact_db (void *db_conn, int compact_type, int old_compact_type);

When gfdb_sqlite receives the call to compact, it must decide which compaction technique to call and when. Since we can turn compaction off and on, there are times where it must call for a manual VACUUM to initiate a pragma change despite setting up a full auto_vacuum. The steps to make that decision follow:

Change the auto_vacuum pragma as necessary to vacuum_type.
If the pragma changed from OFF to FULL or INCR, perform manual VACUUM. Similar if the pragma changed from FULL or INCR to OFF.
Otherwise, if the auto_vacuum pragma is INCR, perform INCREMENTAL_VACUUM(N).
Otherwise, if the database is using manual compaction, perform manual VACUUM.

Experimental Results

Experimental Setup

Now we want to measure the impact of using different compaction methods on Gluster. In particular, we focused on the impact of frequent compactions. We ran Gluster on a single computer with a 4 brick setup: 2 for replicated cold storage on the HDD and 2 for replicated hot storage on RAMDisk limited to 784MB each. We used smallfile, a testing tool for measuring file I/O, to stress the storage in the following ways: [5]

Create files
Create files, then delete them all.
Create files, then rename them.

In each experiment, we used 10000 files. Each one was 10 bytes. We wanted to stress the metadata collection and use in Gluster which are integral to tiering.

We set the compaction frequency for both the hot and cold tiers to a minute each.

We set the incremental auto_vacuum method to remove all free pages. We want to measure the impact of full auto_vacuum removing all free pages every database commit instead of every minute.

Creation Experiment

Table 1 shows the amount of space a run with each compaction used normalized to a run with no compaction. Table 2 compares the runtime of each run to a run with no compaction.

We see that all the compaction methods regained very little space. The manual method recovered the most space since it is the only compaction method that removes all fragmentation. Since this experiment only creates files, it can only add metadata to the databases. Therefore, there are no free pages for the auto_vacuum methods to collect. This explains their smaller gains. All three runs with compaction also lost very little time, with manual compaction taking the longest with a 1% increase over a run with no compaction.

Compaction	Brick 1	Brick 2	Brick 3	Brick 4
manual	−1.169%	−1.205%	−3.083%	−2.910%
full	0.241%	0.241%	−0.357%	−0.179%
incremental	0.102%	0.083%	−0.022%	0.157%

Table 1: Space used when creating 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method reclaimed space.

Compaction	Time
manual	1.044%
full	0.185%
incremental	0.524%

Table 2: Time spent creating 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method ran faster.

Deletion Experiment

Table 3 shows the amount of used space a run with compaction normalized to a run without compaction. Table 4 shows the same for runtime.

The runtime numbers are similar to those in the creation experiment. However, we see the runs with compaction regained a lot of space. As we delete files, we remove data from the database. This creates free pages and fragmentation caused by those pages. The compaction methods all handle this type of fragmentation well. We see the incremental auto_vacuum almost matches full auto_vacuum on the hot bricks. However, it does not do as well on the cold bricks. Another call to the incremental auto_vacuum clears the remaining space. Thusly, this difference is due to timing alone.

Compaction	Brick 1	Brick 2	Brick 3	Brick 4
manual	−81.304%	−81.328%	−96.163%	−96.164%
full	−99.944%	−99.944%	−99.865%	−99.865%
incremental	−79.162%	−79.198%	−95.602%	−95.603%

Table 3: Space used when creating and deleting 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method reclaimed space.

Compaction	Time
manual	1.807%
full	0.622%
incremental	0.617%

Table 4: Time spent creating and deleting 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method ran faster.

Renaming Experiment

Table 5 shows the amount of space each run used. Table 6 shows the same for runtime.

Runs using the manual VACUUM or incremental auto_vacuum methods finished the experiment. On the other runs, smallfile reported not enough of its requests were processed. This resulted in an abrupt end to the experiments. We include the space used by the databases in the off run to show manual and incremental did indeed reclaim space on the cold bricks. When removing a file, Gluster removes an entry from the database and creates a new one. The time shows us that manual is still slowest with an additional 140 seconds over the incremental auto_vacuum run.

Compaction	Brick 1	Brick 2	Brick 3	Brick 4
off *	58.298	58.298	20.238	20.238
manual	56.881	56.881	14.823	14.823
incremental	56.271	56.693	18.227	18.199

Table 5: MBs of space used when creating and renaming 100000 files with compaction active. A lower number means less space was used. We marked off since it did not finish the experiment.

Compaction	Time (s)
off *	—
manual	3571.385
incremental	3433.134

Table 6: Time, in seconds, spent creating and renaming 100000 files with compaction active. A lower number means less space was used. We marked off since it did not finish the experiment.

Conclusion

We added options to GlusterFS to enable database compaction. Our results found that if we compact every minute, we get the most space savings after deletions at little extra cost to runtime. However, the experiments use a relatively small number of files. If an admin decides to use Gluster, then they will be using millions of files. Preliminary results with millions of files show us that using compaction this frequently does not scale at all, despite recovering a similar amount of space. Future work must determine what the right frequency is.