VFS

From ASK

Contents

VFS documentation

The VFS documentation is part of the larger Manchester Framework Documentation . Some reference is made in that documentation to Bodington 3, these references should read Manchester Framework.


What is the VFS and why does ASK need one?

The virtual file system (VFS) is a library (supplied as a jar file) that provides a file system with some interesting properties:

  • It is cross-platform. An administrator can copy a set of VFS files from Windows to UNIX to Mac and the VFS will supply an identical view of the files and their properties.
  • It provides fine-grained permissions. Security principals (users and/or roles) may be allowed or denied the ability to perform a variety of operations (see, read, write, delete, change metadata, change permissions, take ownership, ...) on the files. The permission system feels very much like Microsoft's NTFS, not least because that's what Peter modelled it on.
  • It provides an interface following the Observer pattern that allows client code to be run when operations are performed on files. This can be used for auditing, quota or notification purposes, for example.
  • It can associate arbitrary XML metadata with each file and directory.
  • It is implemented entirely in Java and we have control over the source code. Therefore, it can be extended with extra functions as required.
  • It provides a core 'common' filestore, with the ability for users to overlay their own version of that filestore without affecting the core. For example, a tutor on a HTML course could provide a sample HTML page and some images in a directory; each student could modify the HTML page and upload their version, overlaying the tutor's version. Each student could see either their overlayed version or the tutor's, but not another student's version.
  • It allows a more complete version of UNIX symbolic links (symlinks), where entire directory structures can overlay ('shadow') one another and any file is obtained from the 'front-most' readable copy. This can be used to provide 'template' directories that can be linked to and then partially customised, but where changes to the template instantly show through.

The per-user filestore and shadowing features may not be useful to ASK, but the other features are of relevance to a repository and are not otherwise available in a pure-Java form.

Limitations to the VFS

Must run on a Linux system to avoid the windows file handle problem that results in occasional deletes not working. This could be avoided almost entirely if we were to move storage of the metadata from the .metadata file into eXist - see possible workpackage below.

  • (HN) What was the Tomcat mod you did, this is possibly a limitation if other people doing installations need to use a non-standard version of Tomcat
  • (PC) Please distinguish 'the virtual file system' (VFS) from 'the Manchester PLE/VLE Framework'. The VFS is a standalone component (actually a jar) that provides a cross-platform filesystem with some interesting properties. The framework took that, plus a servlet container (Tomcat 5.0.28), and modified the servlet container to look for its webapps via the VFS rather than via the operating system's filestore and still to fulfil the Servlet Spec. Many mods were made to Tomcat in order to get it to do this, but those mods are not relevant to ASK because we are not expecting the servlet container to use the VFS to find its own files. Instead, we are expecting some servlets or Web services to make explicit calls to the VFS. This makes life much easier, as we're not expecting an unmodified servlet running in the VFS's servlet container to be able to obtain files from the VFS. It's a conventional integration problem instead, and we end up with Spec-compliant servlets or Web services that will run in any container.

Workpackages

The following testing or improvements are known or suspected as necessary:

Thread-safety tests

PC is already funded by B3 to provide an extra half day to finish up some complex multi-user and thead-safety tests on the VFS. This is not an ASK workpackage, but it is noted here for completeness.

Speed improvement 1: Cache permissions

At present, the VFS makes no attempt to cache the metadata that it associates with each file and directory, such as permissions and access times. This slows down access to the files, as the VFS must re-open the metadata file and re-read the contained XML for each user access.

There are three approaches to dealing with this:

  1. Ignore it and accept a slower system.
  2. Move the metadata into an XML database such as eXist, removing the parallel hierarchy of .metadata files on the disk. This would remove the bug that deletes can fail under Windows, allow fast VFS-wide searches of eg. permissions and change times, but would not of itself speed up the VFS and may even make access slower depending on the relative speed of queries to eXist vs file reads. It would also require very careful control of access to the eXist database, as the database would now contain the permissions of the stored files.
  3. Add an in-memory cache of (at least) permissions and change dates to the current VFS code. This is somewhat complex as the VFS is multi-threaded, and the cache would need to be kept coherent in the face of changes to the file metadata on any thread.

MvH has suggested PC spending 5 days to see what can be done here re the final option. This may or may not be the correct size for this workpackage; it is undoubtedly too small if we also want to move permissions into an XML database.

HN - possible overlap here with work already under place with eXist on JAFER project. Need to get Adrian, Peter and Matthew Dovey in conversation.

Speed improvement 2: Hashed file lookup

We do not yet know how the repository will be used. In particular, we do not know whether there will be any hierarchical organisation that corresponds to a hierarchy of directory structures on disk. As a result, we may end up storing a very large number of files in one (conceptual or physical) directory. Operating system storage structures and APIs are getting better at dealing with tens of thousands of files per directory, but this can cause a significant slowdown on older filesystem types such as FAT32.

If (and only if) there is the potential for very many files to be stored in one conceptual directory, we may wish to investigate breaking up those files for storage on the OS filestore.

Quotas

At present, the VFS applies no limits to the amount that can be stored by one user. As the VFS runs in a JVM, the operating system identifies it as a single user, so OS limits cannot be used.

We may wish to provide an API to set per-user storage limits. This may also interact with versioning (see below) - notably, should prior stored versions count against a user's quota? If so, a naive user who edits frequently may run out of quota unexpectedly. If not, a user may store an arbitrary amount on the system simply by uploading (say) their entire music collection as 5,000 successive MP3s with the same name.

Role hierarchies

The VFS is agnostic about user and role names and memberships; it does not itself maintain a list of roles or role memberships. At present, the code in vfs.jar assumes that all roles are distinct and that it has been passed in a transitively-closed list of the user's role memberships. For example, if Peter is a member of the 'Directors' role, and the 'Directors' role is a member of the 'Top Brass' role, then the VFS assumes that it will be passed credentials of the form <Peter, {Directors, Top Brass}>.

Firstly, we may (or may not) wish to revise this behaviour so that the VFS can be passed an initial list of role memberships and can expand them as necessary.

Secondly (and note this is from memory - Mark, do you have more complete notes?), as part of a Group service rather than as part of the VFS, we may wish to organise roles hierarchically. I (Peter) have to confess that I can't remember why; on thinking this through, it seems that the existing containment relationships between roles should be sufficient.

Version control

Version control opens up a whole new can of worms, notably:

  • Whose model for version control do we follow? If the content management system uses WebDAV, we should probably try to follow Delta-V
  • Do we allow branches?
  • Is the user metadata versioned along with the file?
  • Is the file metadata versioned along with the file? For example, if Alice could read the file in version 1, but cannot now read the file in version 2, can she still read version 1 of the file?

We could choose to replace the filestore with a version control system, such as Subversion or CVS. However, there are not Java. Open-source networked Java systems include SourceJammer (GPL/LGPL, we may have license issues), Stellation (CPL) and Superversion. File-based: CBE (Code Building Environment) appears to be the only one. All would require moderate changes to the VFS - there are relatively few places where the VFS actually obtains the bytes of the file itself, and (if the file metadata is not versioned) those are the main places we would need to amend. We'd also need to work on operations like directory creation.

An alternative is to roll our own. We already have per-file metadata; it would be relatively simple to extend that file metadata to include a list of revisions (again, the file metadata itself would not be versioned). Revisions could be stored as separate files, which provides simplicity and rapid access to their contents at the expense of disk space. However, disk is cheap.

Still another (heretical!) alternative is to ditch the VFS entirely and go with the repository in Jakarta Slide.

Turning trees into graphs

TODO: RESOLVEME: Depending on the assumptions in Collections, aggregations and resources, we may need to expend considerable effort in allowing the VFS or some combination of VFS + database to use a directed graph for collection and aggregation membership rather than a tree.

Transaction support

Jakarta Slide presupposes transaction support in its stores. If the VFS is to be turned into such a store, we should add transaction support (or risk some mayhem on rolling back a WebDAV transaction).