In this blog post series, we will deep dive into Change Block Tracking (CBT), how CBT works, how it is affected by change rate and growth, and how this in turn affects image-based backup in Veeam Backup & Replication.
- Backup and Change Rate, Part 1: Change Rate Explained
- Backup and Change Rate, Part 2: Querying Change Rate in vSphere
Understanding the File System
When you start using Veeam Backup & Replication (VBR) for the first time, the biggest difference with traditional vendors is quite apparent. You do not have to install anything inside the virtual machine (VM) you are protecting. Instead of processing the data inside the VM, VBR processes it from the outside, which means VBR does not deal with individual files, but with complete hard disks.
On the hardware side of things, disks are mostly defined by bytes, or rather, groups of bytes called blocks. These blocks almost always have a fixed size – the ‘block size’. So imagine you have a block size of 4KB and you have a hard drive of 1GB, that means you have 1GB / 4KB = 262144 blocks or 262144 groups each containing 4KB. So you can think of a disk as a single row matrix, where each cell is one block of 4KB.
However, from a logical view, the file system itself defines its own block size. In essence, the block size (sometimes referred to as ‘cluster size’ or ‘allocation unit size’) is a collection of one or more sectors. The reason for this is scalability. In fact, the potential size of an NTFS volume is directly linked to the block size chosen. The bigger the block size, the bigger the volume can be. In the end, it might be easier to think about block size as a collection of bytes that is addressed as one unit by the file system. That also means that every file has its own blocks which can not be shared. If a file is too big to fit into a single block, it needs to be cut in pieces. If it is too small, it will be stored in one block, but the remaining space in the block will be wasted space.
The file system is responsible for assigning files to these blocks. When you create a new file, special data is kept to describe these files, commonly known as ‘metadata’. They record where which file is stored and how long it is. For storing the real data, the file system will try to fill up the disk quite linearly, and for a single file it tries to assign a continuous stream of data, where blocks are stored next to each other, contiguously. This has a couple of advantages: First of all, the metadata can be kept small, just the start location of the file blocks and its length, for example 10MB (instead of referencing each and every block). Also, with mechanical hard drives, having the relevant bits of data stored very near to each other improves performance, because most of the time, once a process access a block of a file, this is will have a tendency to need to read the rest of the file as well.
As discussed, NTFS has a block size for the file system designated as “bytes per cluster” (cluster size). But the metadata we discussed has a block size too, NTFS calls this “bytes per filerecord segment” (filerecord size). These file records contains metadata and links to the actual blocks, and are stored in the Master File Table (MFT). The MFT is a hidden index file on the file system and every file reference is a record in this file. By default for NTFS, the cluster size is 4KB and the filerecord size is 1KB.
Using FSUtil to get File System information
In the fsutil output below you can see references to MFT and MFT2. Information about these values appears mixed, but it seems that MFT (start lcn) is a pointer referring to where the filerecord is stored containing all the metadata about the complete MFT. Since this might be the most important filerecord it is kept twice, once at the location referenced by MFT and secondly at MFT2. In the bootsector, NTFS writes these pointers to disk, so that at boot, the kernel knows where the MFT filerecords are stored. Armed with that information it can then figure out where the complete MFT is located.
If the block size is 4KB (cluster size), it means that every file will occupy at least 4KB, and referencing it will be at least one filerecord which is 1KB (filerecord size). In reality though, NTFS has a trick up its sleeve, which might seem confusing. If the data is small enough so that it can fit inside a filerecord segment together with the metadata, NTFS will store the data inside the filerecord, with no data written outside of the MFT. So if you wish to test writing files to disk with specific allocation sizes, make sure your file data is bigger than the filerecord size, otherwise you might get a confusing statement that the file does not take up the minimum block size on disk (as it is completely stored in the file record).
Finally the fsutil output shows something else interesting called the ‘MFT zone start’. To make sure that the MFT does not fragment, the file system reserves continuous space in advance. Online resources mention that a default of 12.5% of the file system is reserved when you use a setting of NtfsMftZoneReservation = 1. However during testing, it appears that Windows 10 and Windows Server 2016 have use NtfsMftZoneReservation = 0, which results in a smaller zone being reserved.
If we look at the ‘MFT zone start’ value and convert 0x2a9aa to decimal we get 174506. Multiply that number with bytes per clusters (4KB) and it will tell you that the MFT starts around 680 into the file system (in this example). Knowing that there are around 523519 (0x7fcff) clusters at 4KB, we know that the disk is around 2GB in size. Interestingly enough, you will see in the next sections that there are always some blocks changed in this area (around a third of the way into the disk). That also shows a very important conclusion. Even if you only change the name of a file, or update access attributes, something will be changed on disk. It might not be the data itself, but remember also the metadata needs to be backed up. So, even if you add a file and it is perfectly aligned, there will always be a couple of extra blocks that need to be updated. The good thing about the MFT is that it is pre-reserved while formatting the drive, so that even with a heavily fragmented file system, the blocks inside the MFT should be quite contiguous.
This probably did not sound like rocket science, but people tend to forget that this is how disks work. And as Veeam Backup & Replication backs up a virtual machine from the outside, it deals with blocks, not files.
If you want to know even more about NTFS, Microsoft has an excellent blog post about it (and yes, it gets even more complicated):
Understanding Change Block Tracking
Change Block Tracking (CBT) is a framework developed by VMware to allow backup applications to identify what data has changed between points in time. As the name suggests, it tracks what blocks change when a VM writes them to a disk. Notice the word ‘disk’. This means that although you enable CBT on a VM level, changes are being tracked per disk. You can see this when looking at files on your datastore as well. For each VMDK a new “-ctk.vmdk” (CTK) file will be created.
CBT itself is blissfully unaware of the file system block size. In fact, it does not try to understand it. That is a good thing because it makes CBT independent of the chosen file system and operating system. Instead it chooses an arbitrary block size. It seems that the block size it starts with is 64KB but gets bigger depending on the size of the disk<sup>1</sup>. When testing CBT this appears valid.
So what happens when a VM writes to a disk? Well, VMware will identify which blocks have been affected, and flag these blocks as changed. It does so by recording this information in the CTK file. This has some impact on backup. Imagine, if someone changed 1KB of data, effectively a complete block of 64KB is flagged as changed. In most cases that is fine. As we saw earlier, the file system tries to fill up the disk in a continuous stream, as we will show later.
Some people are afraid that using CBT for both backup and replication jobs on the same VM will yield corrupt backups and replicas. They think Change Block Tracking will only return the blocks that have changed since the previous query. In reality there is nothing to be afraid of. In the background every CBT disk has a timestamp called a ‘changeId’. You can imagine this as some kind of integer that is incremented every time a change occurs (in reality it seems that the changeId is not constantly incremented but only when enough changes occur). The first time you backup you get a changeId, you store it somewhere safe and backup all the data on that disk. The next time you backup, you request the current changeId and store it again. Now you have two timestamps, and can ask the Vmware vSphere API what blocks have been changed between the current changeId and the previous changeId.
You might visualize the CTK as a log file (though the vSphere API hides the exact implementation details). The first time you back up, you acquire a timestamp or changeId ‘2’ (as per the example above). You do not have a point of reference yet. You can actually still query the API with a single reference, but it will tell you all the changes since the beginning of time (or rather, the birth of your VM). Now as per the example, after your backup is complete, your replication job starts. Since there are enough changes, the API returns a changeId of ‘3’. Finally your backup job restarts the next day. Again some changes have occurred so the API now returns a changeId of ‘4’. Armed with two changeIds you ask the API what blocks have been changed between changeId ‘2’ and ‘4’. This will return you blocks 9,10,11 and 3. The API tries to consolidate, so it just returns block ‘3’ with a length of 64KB and block ‘9’ with a length of 192KB.
To get a timestamp you have to first make a snapshot of a VM. Somewhere in the snapshot, you can find the disk itself and the timestamp or changeId at that specific point in time. This makes sense because you are certain that the data your are backing up is consistent and matches your CBT query.
Veeam Backup & Replication
Because the CBT block size is not defined statically (it depends on disk size), by default Veeam Backup & Replication uses a larger size. That means it divides the disk into bigger blocks and then translates the result to a bigger block size.
You can actually configure this block size in the storage optimization settings as part of a backup or replication job. There are four settings and starting with VBR v9, these are the options:
– Local Target (16TB+) – 4096KB
– Local Target – 1024KB
– LAN Target – 512KB
– WAN Target – 256KB
The default is ‘local target’ and it gives the best balance between performance and disk savings in most cases. Do note that every change will flag a complete 1MB block in this scenario. Again, in normal scenarios it does not have an impact because the file system tries to fill up the blocks on disk contiguously. We will discuss some extreme corner cases and possible solutions later on.
Visualizing Change Block Tracking
To show how Change Block Tracking works in combination with Veeam Backup & Replication we developed a small utility called cbtquery. In the following example we will show what happens when you add and delete files. The results are from a secondary 2GB hard drive. The configured block size in this example is 256KB (WAN target). The amount of block is 2GB/256KB or 8192 blocks. You don’t have to believe it, you can also count the blocks.
Like we discussed earlier, when you format a volume, some metadata is created. This is mainly the MFT. When we look at our 2GB disk being formatted, even without adding files, changes already occur.
Now lets add a 100MB file.
When we query CBT we get quite a predictable result.
The file system (NTFS) has put the data at the very beginning of the disk. You might have noticed a couple of small blocks around a third of the way into the disk. One can never be 100% sure from a Change Block Tracking perspective but as we saw this is the filerecord metadata referring to the actual file blocks on disk. Now lets add a few more files.
Now we added four files that are around 25MB each.
Again the same result. The file system has filled up the disk with contiguous blocks. Now what happens when we delete some files? Notice that we left some newer 25MB files in lace.
Well nothing really. In this example, we deleted the first big file and some other small files but the only thing that changed was probably the metadata referring to those blocks.
If we now add more small files again. You will see that the file system tries to reuse the data blocks.
It literally overwrote the blocks at the beginning of the disk.
However, notice that those 25MB files are still there. That means that if we would add a new very big file, the file system would have two choices. Either to split the data over multiple regions, or to start somewhere after those 25MB files. So lets add a 250 MB file.
In this case the file system chose to start after the existing file to get a nice continuous region.
In fact if we check the VBR backup, we will see that it is very much in line with the file size itself.
Finally, as we saw, the file is around 250MB, but we need to backup slightly more. That is because, not only do we need to backup the file, but also the metadata that has been created, and a small misalignment might add an extra block.
Next we will cover how to manually query Change Block Tracking data via vSphere’s Managed Object Browser (MOB), and after that we will discuss how file system changes can have explosive effects on change rate.