looking for a specific FUSE based filesystem

Dear lazy web,

Friend of mine ask me for help to find out bottleneck of performance problems he’s dealing with.
Basically it looks like the most problematic are I/O operations and main reason for that are thousands of read/write operations on directory having about 150k files.

Solution is easy right? Spread those files among some hashed subdirectories and we’re fine.
But it is not so easy. There are about 200 bash/perl scripts operating on those files.
So it’s not so simple as `mv aaaaaaaaaaaaa.txt a/aa/aaaaaaaaaaaaaa.txt` and so on.
Those scripts were developed for the past 10 or more years… so yes… noone really knows how and why do they work.

So I’m thinking about another approach which should be much easier and cheaper than rewriting hundreds of scripts.

Something like proxy filesystem between those scripts and files.
So we could still move those files to hashed subdirectories but show them as they are in single directory for the scripts.

Anyone knows about such filesystem? Or is it something that has to be written?

And another question… is it worth to go this way? Is there a chance that some kind of FUSE proxy filesystem with really fast hashed subdirectories will be faster than filesystem with one directory with a lot of files?

Anyone tried such approach?

  • Bookmark and Share

10 thoughts on “looking for a specific FUSE based filesystem

  1. noon

    Is it not also the FS API that makes large directories a pain? This won’t be solved by a “virtual” FS.

    Reply
    1. fEnIo Post author

      Well I’m thinking that we can workaround that large directory with plenty of files by FUSE based proxy… but I really don’t have idea if it can be faster.

      Reply
      1. Bjartur Thorlacius

        Depending on how filesystem sycalls and functions are implemented on your system, the problem might be reading through all the filenames in the directory each time you’re opening a file.
        ext2/3 traverses the directory you’re searching until it finds the file you asked for. You can fix that by using a more modern filesystem, such as ext4(1) or btrfs. If you’re using ext2/3, you can upgrade to ext4 without reformatting. But if you’re sequentially processing _all_ of the entries in a directory, you’re better of sticking with ext3 and simplifying your scripts. Features intended for interactive use in manually organized hierarchies, most notably sorting, make your processes grow linearly in space, instead of remaining constant, and up to O(n^2) in execution time.
        You might be using globbing or ls (sans -f) in your scripts. Everywhere you currently use ls in pipelines, try using ls -f1. Most importantly, this disables sorting. If you’re using Bash, carefully replace * with $(ls -1f).(2) If you can, use dash instead of Bash.

        (1) http://ext2.sourceforge.net/2005-ols/paper-html/node3.html
        (2) Please set IFS=’\n’ in future scripts, so they can handle spaces in filenames.

        Reply
  2. Joachim Breitner

    I have heard that with modern file systems (e.g. ext4) this having many files in one directory is no longer a problem. Of course you shouldn’t start to glob filenames or call ls in the scripts; but then the FUSE approach won’t help you either.

    Reply
  3. Iñigo

    Never tried to solve something like that with FUSE.

    Other approach could be to spread I/O un layers under the filesystem (LVM striping, RAID, or… spend on an SSD or SSD raid).

    If there is 10 years of development over that, the cost of two SSDs (and a controller if the original does not support them) is minimal and the performance boost (on a reliable way) could be bigger than with stripping or software proxies, I think.

    Reply
  4. Bo Bai

    If directory hashing can solve the problem you are in luck. Just enable it on ext3 or use ext4. But that is unlikely to be the main problem.

    I would guess that many of the shell scripts are slow because they end up listing all file names and even calling stat on them. There are many ways this could happen. Like doing an ls and grepping for a set of files.

    I would start by using ext4 and adding RAM to improve caching and see how far that gets you. If you still need to improve performance. Try tracing the scripts with strace -f -ttt -e trace=file. You will get timestamps for each system call that takes a filename. Look for scripts making many stat calls. Also look for scripts that has many failed open calls. Failed open calls can often be caused by search paths including too many directories.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *