The Internet has evolved over the past couple decades from serving simple static HTML pages to a rich collection of formats. The New Media consists of video, audio, images, scripts, etc. that are all combined to provide an engaging user experience. At the same time, the total amount of content and the aggregate bandwidth being delivered to users continues to grow at a tremendous pace. Cache Storage at the Origin data center and at various points in the delivery network has become a critical element of the ecosystem in order to scale these services in a cost-effective manner as well as provide good user experience. In this article, I examine some key aspects of optimizing the data storage architecture to meet the challenges and growing demand in this domain.
A key aspect of the New Media content is that it has a very dramatic and dynamic popularity curve. A small fraction is “hot content” that occupies a significant portion of the delivery bandwidth. On the other end of the spectrum, the bulk of the content is a “long tail” that is accessed infrequently. The remaining “mid-tail” content fills the remaining portion both in terms of size and popularity. Cache Storage can incorporate this trend in two important ways:
- Network hierarchy. Data storage close to edge should serve the “hot content”. It needs to be optimized for high delivery rates and a relatively small cache capacity. On the other hand, data storage close to the origin must have a large capacity to serve part of the “long-tail” content. The “mid-tail” content can get served from regional caches in the network that need to have a good blend of capacity and delivery rates. A good end-to-end solution needs flexibility in the storage organization and delivery bandwidth of the Cache nodes.
- Device hierarchy. Data storage on the cache nodes can be composed of devices of varying characteristics of capacity, read/write performance, cost, etc. This typically includes RAM, HDD of various types (SATA, SAS, etc) and Solid State devices (SSD, PCI cards, etc). In order to exploit the popularity curve of the content, the Cache must treat the storage as a seamless hierarchy. Cached objects should migrate amongst the devices to offer optimal delivery performance and capacity utilization.
Storage bandwidth typically becomes the bottleneck for Caching solutions. This is specially the case for the New Media workload. Large video objects (ranging from Megabytes to several Gigabytes) and the huge numbers of objects like images (ranging to 100s of millions for large sites) demand high storage capacity. In addition to the read throughput, Live video feeds can also stress the storage write performance, especially given the fact that they need to be encoded to different formats and bit-rates. Device specific optimizations are critical to maximize the performance potential of each device. Some examples of this are:
- HDD. General purpose filesystems are not able to provide sufficient and predictable performance from HDDs. They suffer from poor performance due to small random IOs which becomes worse over time due to fragmentation. Furthermore, an application cannot control the layout of data to optimize the access performance. A SATA drive will only provide around 5 to 10 MB/sec when operated in this fashion. However, with a purpose built file system designed around the specific characteristics of the workload it is possible to consistently achieve around 50MB/sec for the same device.
- Solid State Storage. The same performance issue applies to SSDs, although to a smaller extent due to their fundamental ability to provide fast random reads. Another equally significant issue is the ability to optimize the lifetime of these devices. SSDs have limited erase/write cycles. Limiting the number of writes and aligning them correctly can vastly improve the life of the device.
- RAM. Caching solutions often rely on a general purpose OS to cache file content in RAM. The use of small fixed size pages results in high processing overhead and poor utilization of RAM. The cache management algorithms are often simplistic (e.g. LRU) and unable to exploit application knowledge. A purpose built caching engine can address these issues.
A Cache will need to use multiple device units to scale both capacity and throughput. A typical approach has been to use RAID since it provides a single device abstraction. However, this can result in a sub-optimal solution for this application. Any form of striping further reduces IO size for each device. Redundancy at the RAID level does not exploit any application knowledge about the importance of each object. A solution that can virtualize the total cache while treating each device as a separate cache unit can provide linear scaling across devices as well as better utilization. Furthermore, the failure of a single device does not impact the performance of the other devices, thereby providing better fault containment. The failed device can be replaced and put back into service with no service disruption.
In conclusion, New Media delivery is posing some new challenges to cache storage, from the Origin data center across the delivery network. We can continue to use existing solutions by throwing more hardware at the problem. However, improvements in the storage architecture can be applied to dramatically reduce cost and improve user experience. It is important to explore innovative solutions that incorporate such techniques.
Jaspal Kohli is the chief architect and co-founder at Ankeena Networks.

