Tiering is a powerful feature in Gluster. It divides the available storage into two parts: the hot tier populated by small fast storage devices like SSDs or a RAMDisk, and the cold tier populated by large slow devices like mechanical HDDs. By placing most recently accessed files in the hot , Gluster can quickly process requests for clients. The system needs to store a lot of metadata on the files clients use to determine what should stay in cold storage and what should move, or migrate, to hot storage. Gluster stores this in a database on each brick. However, as clients continue to use Gluster, database operations slow down. This results in slow tier migration and overall bad performance.
We looked into shrinking the database file using SQLite’s built-in compaction commands:VACUUM
, full auto_vacuum
, and incremental auto_vacuum
. These commands remove database fragmentation and can shrink the actual file.
The patch can be found at http://review.gluster.org/#/c/15031
Gluster is a distributed file system. It allows someone to use a set of storage devices, or bricks, across the network as if they were one large device on their local system. Tiering is a feature in Gluster to improve performance. It divides the bricks into two sets or tiers, hot and cold. The hot tier stores recently used files, while the cold tier stores all other files. The hot tier is usually populated by faster, but smaller disks like SSDs on the order of GBs. The cold tier consists of larger, but slower disks like mechanical HDDs on the order of TBs. By moving recently used files to the hot tier, a client can read and write to the files at a much faster rate than before without changing all the bricks into more expensive SSDs. The tier daemon is in charge of all of this. [6]
The tiering daemon must decide what files to migrate, or move, from the cold tier to the hot tier. When the smaller hot tier runs out of space, the daemon must also decide what files in the hot tier must return to the cold tier. In particular, the daemon need the access times for every file the client uses. The daemon stores this metadata in databases on each brick.
When the client accesses a file from some brick B
, the daemon passes the I/O to the process on B
. The I/O then passes through the ChangeTimeRecorder, or CTR. The CTR records all access times in the database on B
. When the daemon wants to figure out which files to migrate, it asks the CTRs on each brick B
to query the database for eligible files. In both cases, the CTR uses a database-agnostic library, libgfdb
, to interact with the database.
Gluster currently uses SQLite3 as the database. Joseph Elwin Fernandes goes into why they chose SQLite3 and how they optimized the SQLite3 for writes in his blogpost. In short, he uses a Write-Ahead Log (WAL) to record writes quickly. The database slowly consumes those writes and updates over time. [7] However, this solves half the equation for migration. Writes are fast, but they slow down as the database grows in size over time. The graph below shows the performance of performing a writing operation on the database. For every number of files N
, we performed N
INSERT
s, UPDATE
s, and DELETE
s each. We report the time it took to complete each batch of operations.
The database also takes up precious space in the more expensive hot tier and means the CTR’s operations takes a longer time.
Is there a way we can keep the database, but improve its, and therefore migration’s, performance? To answer that question, we explored the idea of compacting the databases in Gluster. To understand what compaction is and what benefits it provides, we must go over what it fixes: fragmentation.
SQLite divides its database file into 4KB blocks called pages. These pages are in one of the following states:
A SQLite database is fragmented when the data for a single table A
is no longer contiguous. That is, either a free page or a page for another table B
exists between two pages for A
.
Consider Gluster’s database files, which have two tables, F
for file accesses, and H
for hard links. The array below represents a database file. Each cell is a page used by the database. The database below is not fragmented since all pages for table F
is together.
Now say we delete enough entries from table F
to free a page. The database is now fragmented by that page.
If we insert data into table F
now, the database is no longer fragmented. Why? SQLite will reuse the space made by deletion instead of using completely empty space (the space at the end).
Now say we free the same page, but insert data into table H
. Again, SQLite will reuse the space from the free page. We still have fragmentation.
However, this opens up the possibility of the following scenario. Suppose we expand the database schema to have 3 tables, F
for file accesses, H
for hard links, and S
for symbolic links. Next, suppose we have N
pages. Furthermore, the pages in the database file were laid out as below
Consider the following query
SELECT * from F
This query should return all rows from table F
and it still will. However, to get all the rows, we need to read in all pages with F
‘s data. However, they are scattered around the file. This is the worst case read performance caused by fragmentation. If we can move the data for table F
around so all data from table F
was clustered together, then we are reading sequentially.
Users can remove fragmentation by compacting the database. This technique is not defragmentation. As we will see in Section 3.2, the database may not end in a perfect state
SQLite has a set of commands for this called VACUUM
s. There are three types in SQLite3 at the time of writing: VACUUM
, full auto_vacuum
, and incremental auto_vacuum
.
VACUUM
will reorganize the database by inserting all the data into a new transient database. This places all used pages from the same tables next to each other and any free pages at the end. This eliminates all fragmentation in the database when called. [1]
However, VACUUM
is not a silver bullet.
During the reorganization, the database cannot process any other transactions. Therefore, no client can add new data to the database. Similarly, VACUUM
fails the database is in the middle of a transaction. [3]
The process can cause the database file to use at most twice the original database’s space. This occurs when free pages are not the cause of fragmentation.
Someone must call VACUUM
periodically to prevent fragmentation from building again.
Finally, VACUUM
takes time even when there is no fragmentation present in the database.
The last two points beg an interesting question: when is the best time to call for a VACUUM
? If we call it too early, we waste more time on compaction than we would lose allowing fragmentation to build. If we call it too late, we already suffered from the penalties of fragmentation. The developers of SQLite provided a solution to this question in the form of the auto_vacuum
.
The auto_vacuum
is a compaction mechanism that uses metadata to ease the work of compaction. auto_vacuum
comes in two flavors, full
and incremental
. Each flavor helps in different ways.
full auto_vacuum
: A full auto_vacuum
removes all free pages in the database and truncates the file after every commit. With this, no user has to invoke the compaction regularly. We also do not lose time if there are no free pages at the end of the file. [4]
incremental auto_vacuum
: An incremental auto_vacuum
is like a full auto_vacuum
, but it removes N
free pages from the file, where N
is user-specified. If N
is not specified, all free pages are removed. This causes the file to shrink in size on disk. Despite the namesake, this version will only remove free pages when invoked with a specific pragma, incremental_vacuum(N)
. [4]
Both flavors of auto_vacuum are not silver bullets either.
Neither one completely eliminates fragmentation. There is no guarantee that data from the same table will be next to each other after the operation completes. In fact, auto_vacuum
can make fragmentation worse. If a free page exists between a page for table A
and a page for table B
, moving or deleting that page still leaves the database fragmented.
An admin can activate the compaction capability with
gluster volume set <volname> tier-compact <off|on>
Once active, the system will trigger compaction at regular intervals. The admin can change frequency of compaction on the hot and cold tiers using the following command line options
gluster volume set <volname> tier-hot-compact-frequency <int>
gluster volume set <volname> tier-cold-compact-frequency <int>
Recall the tier daemon handles tier migration for a volume. When a compaction is triggered for a given tier, the daemon sends a compaction IPC
to the CTR
. The specification for this IPC
follow:
IPC: GFDB_IPC_CTR_SET_COMPACT_PRAGMA
INPUT: gf_boolean_t compact_active: Is compaction currently running?
gf_boolean_t compact_mode_switched: Did the user flip the compaction switch on/off?
OUTPUT: 0: Compaction succeeded
Non-zero: Compaction failed`
libgfdb now has an abstract method for triggering compaction which the IPC uses to tell the database to compact.
void compact_db (void *db_conn, int compact_type, int old_compact_type);
When gfdb_sqlite
receives the call to compact, it must decide which compaction technique to call and when. Since we can turn compaction off and on, there are times where it must call for a manual VACUUM
to initiate a pragma change despite setting up a full auto_vacuum
. The steps to make that decision follow:
auto_vacuum
pragma as necessary to vacuum_type
.OFF
to FULL
or INCR
, perform manual VACUUM
. Similar if the pragma changed from FULL
or INCR
to OFF
.auto_vacuum
pragma is INCR
, perform INCREMENTAL_VACUUM(N)
.VACUUM
.
Now we want to measure the impact of using different compaction methods on Gluster. In particular, we focused on the impact of frequent compactions. We ran Gluster on a single computer with a 4 brick setup: 2 for replicated cold storage on the HDD and 2 for replicated hot storage on RAMDisk limited to 784MB each. We used smallfile, a testing tool for measuring file I/O, to stress the storage in the following ways: [5]
In each experiment, we used 10000 files. Each one was 10 bytes. We wanted to stress the metadata collection and use in Gluster which are integral to tiering.
We set the compaction frequency for both the hot and cold tiers to a minute each.
We set the incremental auto_vacuum
method to remove all free pages. We want to measure the impact of full auto_vacuum
removing all free pages every database commit instead of every minute.
Table 1 shows the amount of space a run with each compaction used normalized to a run with no compaction. Table 2 compares the runtime of each run to a run with no compaction.
We see that all the compaction methods regained very little space. The manual method recovered the most space since it is the only compaction method that removes all fragmentation. Since this experiment only creates files, it can only add metadata to the databases. Therefore, there are no free pages for the auto_vacuum
methods to collect. This explains their smaller gains. All three runs with compaction also lost very little time, with manual compaction taking the longest with a 1% increase over a run with no compaction.
Compaction | Brick 1 | Brick 2 | Brick 3 | Brick 4 |
---|---|---|---|---|
manual | −1.169% | −1.205% | −3.083% | −2.910% |
full | 0.241% | 0.241% | −0.357% | −0.179% |
incremental | 0.102% | 0.083% | −0.022% | 0.157% |
Compaction | Time |
---|---|
manual | 1.044% |
full | 0.185% |
incremental | 0.524% |
Table 3 shows the amount of used space a run with compaction normalized to a run without compaction. Table 4 shows the same for runtime.
The runtime numbers are similar to those in the creation experiment. However, we see the runs with compaction regained a lot of space. As we delete files, we remove data from the database. This creates free pages and fragmentation caused by those pages. The compaction methods all handle this type of fragmentation well. We see the incremental auto_vacuum
almost matches full auto_vacuum
on the hot bricks. However, it does not do as well on the cold bricks. Another call to the incremental auto_vacuum
clears the remaining space. Thusly, this difference is due to timing alone.
Compaction | Brick 1 | Brick 2 | Brick 3 | Brick 4 |
---|---|---|---|---|
manual | −81.304% | −81.328% | −96.163% | −96.164% |
full | −99.944% | −99.944% | −99.865% | −99.865% |
incremental | −79.162% | −79.198% | −95.602% | −95.603% |
Compaction | Time |
---|---|
manual | 1.807% |
full | 0.622% |
incremental | 0.617% |
Table 5 shows the amount of space each run used. Table 6 shows the same for runtime.
Runs using the manual VACUUM
or incremental auto_vacuum
methods finished the experiment. On the other runs, smallfile
reported not enough of its requests were processed. This resulted in an abrupt end to the experiments. We include the space used by the databases in the off
run to show manual and incremental did indeed reclaim space on the cold bricks. When removing a file, Gluster removes an entry from the database and creates a new one. The time shows us that manual is still slowest with an additional 140 seconds over the incremental auto_vacuum
run.
Compaction | Brick 1 | Brick 2 | Brick 3 | Brick 4 |
---|---|---|---|---|
off * | 58.298 | 58.298 | 20.238 | 20.238 |
manual | 56.881 | 56.881 | 14.823 | 14.823 |
incremental | 56.271 | 56.693 | 18.227 | 18.199 |
Compaction | Time (s) |
---|---|
off * | — |
manual | 3571.385 |
incremental | 3433.134 |
We added options to GlusterFS to enable database compaction. Our results found that if we compact every minute, we get the most space savings after deletions at little extra cost to runtime. However, the experiments use a relatively small number of files. If an admin decides to use Gluster, then they will be using millions of files. Preliminary results with millions of files show us that using compaction this frequently does not scale at all, despite recovering a similar amount of space. Future work must determine what the right frequency is.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...