by Rayan Zachariassen
Caching to improve performance has been prevalent in computing and networking at nearly every level: CPUs, disks, memory, web and DNS servers, databases, and just about everything else that has built-in caching schemes. Application-specific caches allow for context awareness and cache in ways that best suit how the application operates. Databases cache tables, rows, and/or query results. Web servers cache hot pages and transactional results. Context awareness allows for more efficiency, thus increasing performance.
Faster and larger storage devices like PCIe Flash cards and SSDs have opened new, promising areas for caching. Thanks to flash, caching backend storage is not only possible – it’s practical. Local high-performance flash media caching often meets the challenges of storage I/O performance and latency.
There are both block and file-based approaches to caching. Each has strengths and weaknesses, and each is evolving as new developments make their mark on the industry.
Block-based caches run at a lower level in the operating system and work directly with block storage devices, while file-based caches operate in the file stack of the operating system at a higher level. Block-based caches operate beneath the file-system level, and cache data in the same format as it is stored on disk. Block-based caching can also be implemented in hardware and in that case is likely to be operating system-agnostic. Although much data is in the form of files, block-based systems can be useful for raw or unstructured data.
File-based caching is implemented in the file system stack, layered on top of the underlying block structure of the disk. File caching is tightly integrated with the operating system and – depending on the application – often meets or exceeds the raw performance of a block-based solution.
About Block-Based
Caching
Block caching can be very fast because all
of the caching happens in the lowest levels of the operating system kernel. All
that is needed is to intercept the disk accesses and decide if they are going
to be cached or not, or served from cache or backend storage. This interception
can take place just before the device drivers in the kernel. While the caching
system itself can be operating system-independent, device drivers would still
be needed for each supported operating system. Operating systems that have fast
block access drivers such as Linux and other Unix derivatives are able to drive
good performance from a block caching system.
To define what data is to be managed by the cache, you target specific Logical Unit Numbers, or LUNs, (effectively the disk volumes themselves) and any active blocks in those LUNs are cached. The advantage of this is relatively light management; in other words, you need only define the LUNs to be cached. Ironically, the disadvantage is also relatively light management because of a lack of context applied to data – you cannot prioritize data except through a tedious process of centralizing or moving critical and application-specific data to specific volumes.
Some solutions include a ‘read near’ mode (sometimes incorrectly referred to as ‘read around’ mode), which brings up sequentially located blocks believed to be needed in the near future. The benefit of read-near should initially outweigh any inefficiencies with bringing up potentially inactive or unneeded data, but as disks become fragmented, the system reads unrelated files and/or free space, bumps needed data out of the cache, and degrades utilization. However, there is ongoing research dedicated to improving predictive caching, including better algorithms. A Google search of the phrase 'block cache schemes’ will return with a wide range of hits from university dissertations and journal articles.
Block-based caching complements block-based devices such as direct-attached disk or SAN. However they cannot be used directly with non-block devices such as NAS or file-based cloud storage because remote file services are not accessed through a block device.
About File-Based
Caching
File-based caching is exactly what it sounds
like: caching that uses information about files and their contained data instead
of the constituent blocks of a disk drive. The cache solution must integrate at
a higher level in the operating system kernel and permits more context to be
understood as well as more intelligence in management and customization.
File-based caching’s reputation initially suffered from the assumption that it involves caching files in their entirety. It is often wasteful to cache extremely large files, considering only a small part of those files might ultimately be accessed. In fact, file-based caching systems are able to cache only active portions of each file, and repopulate the file as parts are accessed. This resembles a block caching scheme, except that it also retains the advantage of the file context when making caching decisions.
Intelligence in caching completely sets apart file-based caching from block caching. File-based caching can be configured to only cache files belonging to specific applications or needs. For example, file-based caching systems can adhere to rules such as: “If the file is less than one megabyte, cache the whole file, otherwise cache the amount being read, plus an additional megabyte up to the end of the file.”
In addition, file-based caching can identify and act upon key parameters, including what user and/or process is accessing the file, which lends another level of control over cache operations – like “always cache finance department files at month-end.” By better accommodating high-value applications or business processes, this application and data context awareness can provide significantly better performance.
File-based caching requires tighter integration with the operating system itself, and porting requires more effort than a block-based system. Most block storage devices are used for file storage, so this is usually not an issue. However, there are cases where file systems are not used, such as some database deployments or equipment systems that store data on raw partitions. These would need to be moved to use a file system to be supported by the cache.
The converse is scenarios where the data access has no underlying block structure, for example data accessed over the network from a NAS or cloud storage solution. These types of storage systems present data as files, and are excellent candidates for a file-based caching system with little or no change.
In summary, block-based caching systems are fast but low-level, context-starved caching schemes. File-based systems can provide context-aware and application-specific caching to increase performance. After carefully evaluating both methods, a file-based approach provided clear and necessary customer benefits. These include: caching only active portions of files to address any file-based performance concerns, selectively optimizing the cache for specific applications and data, integration with Windows Server memory cache, and supporting non-block data sources such as a NAS or cloud storage systems.
Rayan Zachariassen is the Chief Technology Officer at NEVEX (Toronto, Ontario, Canada). www.nevex.com

