Posted: 09/05/2017
At the Joint Geospatial Standards Working Group (JGSWG) TMP meeting in May there was some discussion about the problem of inserting additional metadata into different file formats like Controlled Image Base (CIB) data. This metadata is increasingly seen as necessary to help the geo community identify, track, process, serve and manage the large amount of various geo data we have to deal with on a day-to-day basis. CIB data actually consists of many files making a data set which has a specific folder structure and naming convention that must be adhered to in order for the data to work correctly with reading (or editing) software. Other common examples of data types consisting of a complex file structure and naming convention include Digital Terrain Elevation Data (DTED), Esri Shapefile and File Geodatabase for vector data. Creating and adding metadata files into the folder structure of such data sets is unsupported by regular software and could cause incompatibility problems. Furthermore the additional files may be overwritten or lost when the software copies or makes a change to the data set. Every file format is different and could pose a different set of problems when trying to find a way of adding additional metadata files to the data. (For these reasons I am sceptical of this approach.) Rather than creating additional files and adding these, should we consider other methods to achieve the same end? For example, the additional information could be inserted into the file system itself, or the file system could be used to make it appear as though the information is attached to the file. One way to do this is to attach additional files to the data set in a transparent way e.g. using Alternate Data Streams (ADS) in the NTFS file system. Or for Linux computers we might use the Filesystem in Userspace (FUSE) to achieve this independently from the underlying file system. These methods would allow any amount of complex information to be added without impacting on the editing software at all, but they are specific to the file system and may not be transferable or work well when copying or moving data between file systems. Another approach is to keep the additional information completely separate. This can be achieved by using a unique identifier associated with each file in the dataset and keeping all the information related to the management of the data elsewhere. Preferably this would be achieved by using an identifier that is already incorporated as part of the existing file format. Or the file system's own file identifier could be used (but this would fail as soon as the file is copied from one file system to another because a new file identifier would be created). The advantage of this approach is that the data set remains unaltered when the metadata changes. It also allows the data management function to be developed independently, which could then be more software- and file-format independent. But it is not obvious whether each relevant file format even has a suitable unique identifier. In some cases one may have to be created in such a way that it does not break compatibility. To achieve this I think there has to be a focus on three things: 1. Standardising a way of inserting / extracting a unique identifier into every relevant file format in a compatible way, 2. identifying what is the appropriate metadata (the "management metadata") required at each stage of the management process e.g. to identify, track, process, serve... the data 3. developing a mechanism for relating the relevant metadata to the relevant unique identifier(s) to enable the whole thing to work. I don"t know if anyone has any experience in this subject but I would appreciate feedback on this topic from people who have experience working on these kind of problems. Robert Nowak NATO Joint Warfare Centre