By Duncan Greatwood
Everybody has seen the dreaded email notification from IT – your email box is over quota. As soon as the notification arrives, you know you need to interrupt your real work and spend time deleting and filing emails.
The reality is that many current email servers do not have enough storage space for their users’ growing needs. This limitation requires employees to more frequently manage their mailboxes, reduces productivity, and often results in emails being filed away in local systems that are not backed up regularly and that may not be subject to corporate compliance policies. Email is an ever-growing critical resource, and mailbox size should not be driven by product limitations. Osterman Research shows email use growing at 20 percent per year and message stores growing at more than 35 percent per year, driven by pictures, PDFs, PowerPoint presentations, and video clips. The reason most often cited for limited mailbox size is that because Email storage is costly, corporate IT must restrict employees’ email boxes, often to a range between 250Mb and 500 Mb.
As a result, the mailbox quota is the number-one complaint the IT department faces. Employees do not have time to manage their email and do not want to delete messages and attachments that are their primary record of business activity. User complaints are sharpened by the fact that public portals offer several gigabytes of email storage for free.
Today’s most commonly deployed corporate email servers store data using a database architecture that results in inefficient read and write operations – driving a need for expensive, high performance disk systems whenever mailbox size grows beyond a minimal level. The difficulties of backing up and restoring these proprietary databases also provide IT with a powerful incentive to limit mailbox size.
Breaking the Upward Storage-Cost Spiral
To break the upward storage-cost spiral, new and more open message storage capabilities are based on modern Linux file systems that overcome the inefficiencies of legacy databases, simplify backup and restore, and reduce system-administration complexity. File-based storage lets these email systems scale cost effectively, maximizing user productivity while minimizing IT costs.
To take full advantage of a file-system approach, an email server needs to be designed from the ground up for storage efficiency. For example, optimization can mean that the server writes data directly into the filing system using a one-file-per-message approach. This approach allows visibility into disk operations and careful optimization of system performance.
Attaining Bottomless Emails
At a technical level, attaining bottomless mailboxes means the server must be designed to minimize disk reads, disk writes, and especially disk-head seeks. There are a number of optimizations to consider:
- One-shot read and write – Message files are read in a single shot and written in a single shot, while streaming files are read / written in large blocks.
- Fragmentation avoidance – Message files are limited to a size that takes advantage of the filing-system’s fragmentation-avoidance strategies.
- File-length and file-name information encoding – Certain information is encoded in the low bits of the file length (by rounding file size appropriately) and in the filename to optimize the most common folder scans.
- Optimized content catalogs – Folder catalogs are optimized to reflect the most commonly retrieved content data, usually avoiding the need to read individual message files.
- Simplified indexes – Rather than using classic database-style indexes, which can be expensive to maintain and update, the email server should use indexes that look more like simple lists of items to speed retrieval and enable more indexes to be maintained in memory at a given time. With message data in any case present in message files on the disk, a lazy update strategy can also be adopted without loss of consistency, minimizing index writes.
- Request throttling and elevator filling – Oftentimes, allowing complex disk requests to occur in parallel can slow the system down because disk heads spend time seeking backwards and forwards servicing fragments of each request. Reducing the number of requests allowed to reach the disk controller corrects this, causing all the requests (even the ones held off) to complete sooner. Conversely, in some situations many simple requests should be let through simultaneously to allow the disk controller to service them with only short disk-head-seeks between each request. With many requests outstanding, the disk head operates an elevator algorithm - sweeping from one side of the platter to the other, servicing all requests as it goes. Since there are many requests, the distance between requests that are serviced consecutively on the platter is smaller and disk-head seeks relatively faster. With care, the server can adapt between request throttling and request parallelizing adaptively, varying its behavior depending on the nature of the load at any given time.
- Linking / single-instance behavior – Single-instance behavior is particularly important where the email store is concerned, since a great deal of storage space can be quickly consumed by storing multiple copies of voice-mail recordings, graphic files, presentations, video clips, and other large attachments. Rather than copying files, the email server should provide links to them wherever possible, including across multiple stores. A link generally saves at least one seek and one write compared to copying, and can save many writes depending on the size of the file. Of course, linking will also have a positive impact on throughput as well as on disk usage.
- Continuous consistency – To take full advantage of file system capabilities, the email server should write its data in such a way as to ensure that the store contents are always consistent. This hugely simplifies data management operations, including backup, off-site replication, and server restart following a power cut by avoiding the requirement that legacy databases be forced into a consistent state before access.
Modern Linux filing systems such as XFS and Ext3 are capable of leveraging these optimizations and in addition, are fast, flexible, reliable, and efficient. These file systems also support such features as journaling, to ensure hierarchy integrity following a power cut, clustering and replication (for instance via DRBD), and snapshots (for instance via LVM). Of course, many cost-effective commercially packaged storage systems can also provide similar features.
The key question, of course, is “what does file system optimization buy you when it comes to storing email?†As described above, leveraging a file-based email store offers significant performance improvements and cost savings – driving down the cost of building and owning the email infrastructure.
The Benefits of a File-based Email Store
If the goal of the IT department is to reduce costs, increase performance, and give each user a significantly larger or even bottomless email store, a file system fulfills that goal, making email server storage easy to manage and maintain. In addition to the already-described advantages, this approach addresses a variety of other critical issues of email storage:
- Backup operations – Using a file-based storage system for backup operations is simple, live (no freeze or snapshot step is required), incremental, and detailed down to the message (file) level. This makes backing up the mail store as simple as backing up a file server. Additionally, file-server backup allows incremental backups (backing up just messages that have changed since the previous day) with industry-standard backup tools. And administrators can make mailboxes significantly larger because backup time is reduced.
- Restoration – Backup records let enterprises easily restore records that are accidentally lost or deleted, or that are required for compliance or other regulatory purposes. The file system’s “one file per message†architecture simplifies restoration because it has no database-synchronization issues. This allows a detailed restoration; IT can restore a single message by restoring a single file, a folder by restoring a folder, a user by restoring that user’s folder and subfolders, or the whole store by restoring the folder tree that contains all the users — without worrying about synchronizing the live database with the backup.
- Corruption – File-based storage eliminates the problem of database corruption because it has no intermediate database that can fragment or become corrupted. Each user has an individual folder within the store; each folder contains subfolders corresponding to the calendar, in-box, and other email functions. Each message in a subfolder is represented by a file. This “one file per message†approach means that any corruption that occurs from a disk malfunction is limited to a single file (single message) and will not spread to a point where it can crash the entire system, as can happen with legacy databases.
- Disaster recovery – Recovering from a disaster such as a disk crash is also faster and simpler using a file-system architecture because it provides an easy way to build low-cost server clusters that dramatically improve disaster recovery by eliminating database-synchronization issues.
- Compaction – With no database, file-based systems avoid the need for database compaction. Database compaction can be very time consuming and also expensive when, as is common, it requires a doubling of disk storage to hold a duplicate copy of the database.
All the advantages of file-based storage systems add up to a powerful new way for enterprises to provide bottomless email to their users and to bring the capabilities of their email system in line with the needs of their users. Users get all the storage they need, and enterprises gain a far easier, more cost-effective method to handle email data.
Duncan Greatwood is chief executive officer at PostPath, www.postpath.com. Mr. Greatwood graduated with a B.A. in Mathematics and an M.Sc. in Computer Science from Oxford University, and an M.B.A. from London Business School.