The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

Compacting SQLite Databases in GlusterFS

Gluster
2016-09-02

Tiering is a powerful feature in Gluster. It divides the available storage into two parts: the hot tier populated by small fast storage devices like SSDs or a RAMDisk, and the cold tier populated by large slow devices like mechanical HDDs. By placing most recently accessed files in the hot , Gluster can quickly process requests for clients. The system needs to store a lot of metadata on the files clients use to determine what should stay in cold storage and what should move, or migrate, to hot storage. Gluster stores this in a database on each brick. However, as clients continue to use Gluster, database operations slow down. This results in slow tier migration and overall bad performance.

We looked into shrinking the database file using SQLite’s built-in compaction commands:VACUUM, full auto_vacuum, and incremental auto_vacuum. These commands remove database fragmentation and can shrink the actual file.

The patch can be found at http://review.gluster.org/#/c/15031

Tiering

Gluster is a distributed file system. It allows someone to use a set of storage devices, or bricks, across the network as if they were one large device on their local system. Tiering is a feature in Gluster to improve performance. It divides the bricks into two sets or tiers, hot and cold. The hot tier stores recently used files, while the cold tier stores all other files. The hot tier is usually populated by faster, but smaller disks like SSDs on the order of GBs. The cold tier consists of larger, but slower disks like mechanical HDDs on the order of TBs. By moving recently used files to the hot tier, a client can read and write to the files at a much faster rate than before without changing all the bricks into more expensive SSDs. The tier daemon is in charge of all of this. [6]

Migration and Metadata

The tiering daemon must decide what files to migrate, or move, from the cold tier to the hot tier. When the smaller hot tier runs out of space, the daemon must also decide what files in the hot tier must return to the cold tier. In particular, the daemon need the access times for every file the client uses. The daemon stores this metadata in databases on each brick.

When the client accesses a file from some brick B, the daemon passes the I/O to the process on B. The I/O then passes through the ChangeTimeRecorder, or CTR. The CTR records all access times in the database on B. When the daemon wants to figure out which files to migrate, it asks the CTRs on each brick B to query the database for eligible files. In both cases, the CTR uses a database-agnostic library, libgfdb, to interact with the database.

The Problem

Gluster currently uses SQLite3 as the database. Joseph Elwin Fernandes goes into why they chose SQLite3 and how they optimized the SQLite3 for writes in his blogpost. In short, he uses a Write-Ahead Log (WAL) to record writes quickly. The database slowly consumes those writes and updates over time. [7] However, this solves half the equation for migration. Writes are fast, but they slow down as the database grows in size over time. The graph below shows the performance of performing a writing operation on the database. For every number of files N, we performed N INSERTs, UPDATEs, and DELETEs each. We report the time it took to complete each batch of operations.

sql-hdd-show

The database also takes up precious space in the more expensive hot tier and means the CTR’s operations takes a longer time.

Is there a way we can keep the database, but improve its, and therefore migration’s, performance? To answer that question, we explored the idea of compacting the databases in Gluster. To understand what compaction is and what benefits it provides, we must go over what it fixes: fragmentation.

Database Fragmentation

SQLite divides its database file into 4KB blocks called pages. These pages are in one of the following states:

  • Used by a table in the database. There can be many tables in one database, but no two tables can own the same page at once.
  • Used by SQLite to store metadata.
  • Unused or free.

A SQLite database is fragmented when the data for a single table A is no longer contiguous. That is, either a free page or a page for another table B exists between two pages for A.

Consider Gluster’s database files, which have two tables, F for file accesses, and H for hard links. The array below represents a database file. Each cell is a page used by the database. The database below is not fragmented since all pages for table F is together.

No fragmentation here.

Now say we delete enough entries from table F to free a page. The database is now fragmented by that page.

Free page caused fragmentation

If we insert data into table F now, the database is no longer fragmented. Why? SQLite will reuse the space made by deletion instead of using completely empty space (the space at the end).

No fragmentation here.

Now say we free the same page, but insert data into table H. Again, SQLite will reuse the space from the free page. We still have fragmentation.

Fragmentation due to table H

However, this opens up the possibility of the following scenario. Suppose we expand the database schema to have 3 tables, F for file accesses, H for hard links, and S for symbolic links. Next, suppose we have N pages. Furthermore, the pages in the database file were laid out as below

Bad fragmentation

Consider the following query

SELECT * from F

This query should return all rows from table F and it still will. However, to get all the rows, we need to read in all pages with F‘s data. However, they are scattered around the file. This is the worst case read performance caused by fragmentation. If we can move the data for table F around so all data from table F was clustered together, then we are reading sequentially.

Compaction Methods in SQLite3

Users can remove fragmentation by compacting the database. This technique is not defragmentation. As we will see in Section 3.2, the database may not end in a perfect state

SQLite has a set of commands for this called VACUUMs. There are three types in SQLite3 at the time of writing: VACUUM, full auto_vacuum, and incremental auto_vacuum.

VACUUM

VACUUM will reorganize the database by inserting all the data into a new transient database. This places all used pages from the same tables next to each other and any free pages at the end. This eliminates all fragmentation in the database when called. [1]

However, VACUUM is not a silver bullet.

During the reorganization, the database cannot process any other transactions. Therefore, no client can add new data to the database. Similarly, VACUUM fails the database is in the middle of a transaction. [3]

The process can cause the database file to use at most twice the original database’s space. This occurs when free pages are not the cause of fragmentation.

Someone must call VACUUM periodically to prevent fragmentation from building again.

Finally, VACUUM takes time even when there is no fragmentation present in the database.

The last two points beg an interesting question: when is the best time to call for a VACUUM? If we call it too early, we waste more time on compaction than we would lose allowing fragmentation to build. If we call it too late, we already suffered from the penalties of fragmentation. The developers of SQLite provided a solution to this question in the form of the auto_vacuum.

auto_vacuum

The auto_vacuum is a compaction mechanism that uses metadata to ease the work of compaction. auto_vacuum comes in two flavors, full and incremental. Each flavor helps in different ways.

full auto_vacuum: A full auto_vacuum removes all free pages in the database and truncates the file after every commit. With this, no user has to invoke the compaction regularly. We also do not lose time if there are no free pages at the end of the file. [4]

incremental auto_vacuum: An incremental auto_vacuum is like a full auto_vacuum, but it removes N free pages from the file, where N is user-specified. If N is not specified, all free pages are removed. This causes the file to shrink in size on disk. Despite the namesake, this version will only remove free pages when invoked with a specific pragma, incremental_vacuum(N). [4]

Both flavors of auto_vacuum are not silver bullets either.

Neither one completely eliminates fragmentation. There is no guarantee that data from the same table will be next to each other after the operation completes. In fact, auto_vacuum can make fragmentation worse. If a free page exists between a page for table A and a page for table B, moving or deleting that page still leaves the database fragmented.

Gluster Changes

An admin can activate the compaction capability with

gluster volume set <volname> tier-compact <off|on>

Once active, the system will trigger compaction at regular intervals. The admin can change frequency of compaction on the hot and cold tiers using the following command line options

gluster volume set <volname> tier-hot-compact-frequency <int>
gluster volume set <volname> tier-cold-compact-frequency <int>

Recall the tier daemon handles tier migration for a volume. When a compaction is triggered for a given tier, the daemon sends a compaction IPC to the CTR. The specification for this IPC follow:

IPC: GFDB_IPC_CTR_SET_COMPACT_PRAGMA
INPUT: gf_boolean_t compact_active: Is compaction currently running?
       gf_boolean_t compact_mode_switched: Did the user flip the compaction switch on/off?
OUTPUT: 0: Compaction succeeded
        Non-zero: Compaction failed`

libgfdb now has an abstract method for triggering compaction which the IPC uses to tell the database to compact.

void compact_db (void *db_conn, int compact_type, int old_compact_type);

When gfdb_sqlite receives the call to compact, it must decide which compaction technique to call and when. Since we can turn compaction off and on, there are times where it must call for a manual VACUUM to initiate a pragma change despite setting up a full auto_vacuum. The steps to make that decision follow:

  1. Change the auto_vacuum pragma as necessary to vacuum_type.
  2. If the pragma changed from OFF to FULL or INCR, perform manual VACUUM. Similar if the pragma changed from FULL or INCR to OFF.
  3. Otherwise, if the auto_vacuum pragma is INCR, perform INCREMENTAL_VACUUM(N).
  4. Otherwise, if the database is using manual compaction, perform manual VACUUM.

Experimental Results

 

Experimental Setup

Now we want to measure the impact of using different compaction methods on Gluster. In particular, we focused on the impact of frequent compactions. We ran Gluster on a single computer with a 4 brick setup: 2 for replicated cold storage on the HDD and 2 for replicated hot storage on RAMDisk limited to 784MB each. We used smallfile, a testing tool for measuring file I/O, to stress the storage in the following ways: [5]

  • Create files
  • Create files, then delete them all.
  • Create files, then rename them.

In each experiment, we used 10000 files. Each one was 10 bytes. We wanted to stress the metadata collection and use in Gluster which are integral to tiering.

We set the compaction frequency for both the hot and cold tiers to a minute each.

We set the incremental auto_vacuum method to remove all free pages. We want to measure the impact of full auto_vacuum removing all free pages every database commit instead of every minute.

Creation Experiment

Table 1 shows the amount of space a run with each compaction used normalized to a run with no compaction. Table 2 compares the runtime of each run to a run with no compaction.

We see that all the compaction methods regained very little space. The manual method recovered the most space since it is the only compaction method that removes all fragmentation. Since this experiment only creates files, it can only add metadata to the databases. Therefore, there are no free pages for the auto_vacuum methods to collect. This explains their smaller gains. All three runs with compaction also lost very little time, with manual compaction taking the longest with a 1% increase over a run with no compaction.

Compaction Brick 1 Brick 2 Brick 3 Brick 4
manual −1.169% −1.205% −3.083% −2.910%
full 0.241% 0.241% −0.357% −0.179%
incremental 0.102% 0.083% −0.022% 0.157%
Table 1: Space used when creating 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method reclaimed space.

Compaction Time
manual 1.044%
full 0.185%
incremental 0.524%
Table 2: Time spent creating 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method ran faster.

Deletion Experiment

Table 3 shows the amount of used space a run with compaction normalized to a run without compaction. Table 4 shows the same for runtime.

The runtime numbers are similar to those in the creation experiment. However, we see the runs with compaction regained a lot of space. As we delete files, we remove data from the database. This creates free pages and fragmentation caused by those pages. The compaction methods all handle this type of fragmentation well. We see the incremental auto_vacuum almost matches full auto_vacuum on the hot bricks. However, it does not do as well on the cold bricks. Another call to the incremental auto_vacuum clears the remaining space. Thusly, this difference is due to timing alone.

Compaction Brick 1 Brick 2 Brick 3 Brick 4
manual −81.304% −81.328% −96.163% −96.164%
full −99.944% −99.944% −99.865% −99.865%
incremental −79.162% −79.198% −95.602% −95.603%
Table 3: Space used when creating and deleting 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method reclaimed space.

Compaction Time
manual 1.807%
full 0.622%
incremental 0.617%
Table 4: Time spent creating and deleting 100000 files with compaction active normalized to a run without compaction. Negative percentage means the run with a compaction method ran faster.

Renaming Experiment

Table 5 shows the amount of space each run used. Table 6 shows the same for runtime.

Runs using the manual VACUUM or incremental auto_vacuum methods finished the experiment. On the other runs, smallfile reported not enough of its requests were processed. This resulted in an abrupt end to the experiments. We include the space used by the databases in the off run to show manual and incremental did indeed reclaim space on the cold bricks. When removing a file, Gluster removes an entry from the database and creates a new one. The time shows us that manual is still slowest with an additional 140 seconds over the incremental auto_vacuum run.

Compaction Brick 1 Brick 2 Brick 3 Brick 4
off * 58.298 58.298 20.238 20.238
manual 56.881 56.881 14.823 14.823
incremental 56.271 56.693 18.227 18.199
Table 5: MBs of space used when creating and renaming 100000 files with compaction active. A lower number means less space was used. We marked off since it did not finish the experiment.

Compaction Time (s)
off *
manual 3571.385
incremental 3433.134
Table 6: Time, in seconds, spent creating and renaming 100000 files with compaction active. A lower number means less space was used. We marked off since it did not finish the experiment.

Conclusion

We added options to GlusterFS to enable database compaction. Our results found that if we compact every minute, we get the most space savings after deletions at little extra cost to runtime. However, the experiments use a relatively small number of files. If an admin decides to use Gluster, then they will be using millions of files. Preliminary results with millions of files show us that using compaction this frequently does not scale at all, despite recovering a similar amount of space. Future work must determine what the right frequency is.

References

formatted by Markdeep   

BLOG

  • 26 Apr 2019
    Gluster Monthly Newsletter, April 2...

    Upcoming Community Happy Hour at Red Hat Summit! Tue, May 7, 2019, 6:30 PM – 7:30 PM EDT https://cephandglusterhappyhour_rhsummit.eventbrite.com has all the details. Gluster 7 Roadmap Discussion kicked off for our 7 roadmap on the mailing lists, see [Gluster-users] GlusterFS v7.0 (and v8.0) roadmap discussion https://lists.gluster.org/pipermail/gluster-users/2019-March/036139.html for more details. Community...

    Read more
  • 24 Apr 2019
    Community Survey Feedback, 2019

    In this year’s survey, we asked quite a few questions about how people are using Gluster, how much storage they’re managing, their primary use for Gluster, and what they’d like to see added. Here’s some of the highlights from this year!

    Read more
  • 24 Apr 2019
    How to Deploy the OpenVPN Encryptio...

    This is part of a new series on using Gluster! OpenVPN is open source software that serves as the basis for a Virtual Private Network capable of supporting a point-to-point or site-to-site connection. Along with the fact that it’s free to use, it also has the benefit of being one...

    Read more