ZFS Inode Generations

Quickly determining the ZFS transaction generation of an inode.

I recently had reason to write a tool to replay some of the ZFS snapshot history of a server. I wanted to merge two old volumes into three new volumes in order to have their snapshot retainment schedules better match their content. The bulk of that is a story for another time.

To be efficient, the tool needed to be able to track files as they were changed through snapsnots and "replay" those changes. An early idea was to match up inodes (and e.g. compare ctimes, etc.), but as I started testing the tool I noticed that there were occasionally inode pairs that did not make sense; files in one snapshot had the same inode as directories in a later one.

Only then did it click that inodes are recycled, and matching inodes alone was not enough.

ZFS ships with a zdb command which lets you dive into the underlying structures. There is tons of information that would be useful in that output, but running it takes a very long time:

$ time zdb -dddd tank/test 2
Dataset tank/test [ZPL], ID 5274, cr_txg 14228119, 192K, 7 objects, rootbp DVA[0]=<0:144d818d1000:3000> DVA[1]=<0:3185543e2000:3000> [L0 DMU objset] fletcher4 uncompressed LE contiguous unique double size=800L/800P birth=14652201L/14652201P fill=7 cksum=b6cc140f9:c8e0ea52271:90488037fa4f8:4f85961ce1f2c11

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    512      0     512    512  100.00  ZFS plain file
                                               168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
    dnode maxblkid: 0
    uid     0
    gid     0
    atime   Thu Feb 21 11:14:16 2019
    mtime   Thu Feb 21 11:15:29 2019
    ctime   Thu Feb 21 11:15:29 2019
    crtime  Thu Feb 21 11:07:02 2019
    gen 14652061
    mode    100644
    size    2
    parent  34
    links   1
    pflags  40800000004


real    0m3.783s
user    0m1.412s
sys 0m0.540s

Note that it takes almost 4 seconds to run this command! That is untenable if I'm calling it many thousands of times to get a full picture of the files. (I get the feeling that it is acquiring a lock that is only available at transaction boundaries, but I'm not sure. Regardless I need to keep the filesystems mounted while running this replay.)

I needed to get this information faster, so cue a few hours of digging through the ZFS on Linux source code.

I learned that one of the methods the ZFS commands use to communicate with the kernel module is ioctl calls. You open /dev/zfs, populate a zfs_cmd struct, and then call ioctl with the appropriate zfs_ioc enum.

Of particular interest to be was the ZFS_IOC_OBJ_TO_STATS request code which is used by zfs diff to get a little more detail on inodes. This ends up calling zfs_obj_to_stats_impl, which populates mode, gen, links, and ctime fields on the zfs_cmd struct. We have access to most of those via stat, except gen.

Taking a step back: ZFS writes files to disk in "transactions". These transactions are generally triggered every 5 seconds (by default, and it also can vary depending on load and other factors). Each transaction is numbered with an incrementing counter. The transaction number of when an inode is created is baked in as the "generation" of that inode. When you modify a file the contents of the inode will change, but the generation will remain the same.

Therefore, an (inode, generation) tuple will uniquely identify inodes, even when inodes themselves get recycled (as the generation will necessarily be higher for later inodes)!

To that end I wrote a small C command which reads stdin for <snapshot> <inode> lines, and outputs the generation:

int main(int argc, char** argv) {

    int res = 0;
    char dataset[256];

    int fd = open("/dev/zfs", O_RDONLY);
    if (!fd) {
        printf("[zgen] ERROR opening /dev/zfs\n");
        return 1;
    }

    zfs_cmd_t zc = {"\0"};

    while (1) {

        zc.zc_obj = 0;
        res = scanf("%s %lu", dataset, &zc.zc_obj);
        if (!zc.zc_obj) {
            fprintf(stdout, "[zgen] ERROR while reading\n");
            return 2;
        }
        strncpy(zc.zc_name, dataset, sizeof (zc.zc_name));

        fprintf(stdout, "%s %lu ", dataset, zc.zc_obj);

        res = ioctl(fd, ZFS_IOC_OBJ_TO_STATS, &zc);

        if (res < 0) {
            fprintf(stdout, "ERROR %d %s\n", errno, strerror(errno));
        } else {
            fprintf(stdout, "%lu\n", zc.zc_stat.zs_gen);
        }
        fflush(stdout);

    }

    return 0;

}

I've left out all of the includes, and the gnarly copy-pasting of ZFS headers I needed to compile this tiny tool. Unfortunately I was not able to figure out how to build this with any headers installable via the OS' package manager, nor could I figure out the build-system of ZFS enough to use the headers that come with the source. So... I copy-pasted all of the definitions I needed into my file. Gross.

Anyways, this is an awesome tool that can process many thousands of inodes a second, and really helped speed up my snapshot replays. Here is the basic usage:

./zgen
tank/test 2          # Input.
tank/test 2 14652061 # Output.

I wish I was able to get just a little bit more information (such as the addresses of indirect blocks that store the actual file content) as then I could immediately tell if a file's content has changed or not. I'm unsure about how to proceed on that, and I'm not sure if it is too niche for the maintainers to care about.

I'd love to hear from you how to compile this a little nicer, or how to get more information in a speedy manner.

Posted on February 21, 2019. Categories: