Online advertising is just one example of the Internet applications driving the growth of log data. In this case, tracking online advertising delivery and click-throughs helps companies to gain extensive insight into users’ online behaviors, creating a slew of opportunities to tailor ads to a well-defined target market. However, it also creates significant challenges for IT to capture and manage an ever-expanding amount of log data. With the wealth of valuable data now available, infrastructure and application managers must adequately track and analyze this critical information for ad relevance and engaging user experiences.
Log data capacity is already a burden for today’s insufficient storage and file system technologies. Newer architectures — supporting the tracking and analysis of every single click by every single user — are needed for organizations to better manage this information. A highly-optimized distributed file serving infrastructure can help alleviate the woes experienced by social networking, photo sharing and ad serving companies — and in the long run, prevent them from drowning in their own log data.
Problems with Batch Log Upload
Local log capture on each individual Web server is a common practice. Large Web properties often have hundreds to thousands of Web servers, each of which uploads portions of its log data to a centralized storage repository at a set interval.

Figure 1: Problems with batch log upload
This process has several drawbacks…most notably:
-
Excessive client coordination to avoid storage overload. Web servers must be manually set not to upload log files at the same time to avoid bottlenecks
-
Custom rsync scripts required for fine tuning scheduling and operations
-
Must throttle systems to avoid I/O contention
-
Infrequent sync times by necessity, not preference. Portions of logs spread between Web servers and storage adds inconvenience and potential for errors
-
Problems scaling. Traditional NAS servers can easily run out of resources with too many simultaneous client connections
Most of these limitations come from a centralized storage system that cannot process log updates in parallel. Traditional network attached storage systems that are not distributed suffer from these kinds of bottlenecks.
Achieving Scalable and Efficient Log Capture
Unlike conventional storage and file systems, new distributed systems easily handle the workload demands of hundreds to thousands of Web servers. This delivers the following capabilities for logging:
- No coordination required: Distributed systems place individual log files across all the nodes, thus allowing extensive load sharing. Some systems can also manage atomic log appends so that client applications need not be involved in locking the log files to make sure appended records preserve their individual integrity.
- No need for custom rsync scripts: As the system can handle numerous simultaneous connections, customers can avoid customizing rsync scripts
- No throttling required: Distributed systems expand horizontally to accommodate higher throughput, eliminating the need to throttle batch uploads
- More frequent sync operations now feasible: With enough throughput and connection handling easily available, customers can sync more frequently
- Scale seamlessly: Adding a new node takes a few minutes, allowing customers to scale easily and expand the amount of log data they can retain.

Figure 2: Solving batch uploads with distributed systems.
Direct Logging to Network Storage
Many Web and application developers avoid logging to network storage due to availability concerns. In some cases, lack of storage availability can cause logging processes and applications to hang.
Distributed systems address this concern by replicating data across nodes for availability. Write operations are distributed across nodes, and if a node fails for any reason, it is automatically substituted with a spare resource. The system remains available to accept log file operations throughout this process, ensuring that log operations complete without constraining the application.
Another concern for Web developers around logging to network storage is the need to manage locking when multiple clients are writing to the same file. This process can be more trouble than it is worth and often results in a brute force fallback option of saving individual log files and then batch uploading at set intervals.
Systems offering an atomic append mode enable easy updates to log files. This dramatically simplifies the process of creating atomic log files shared by hundreds to thousands of servers.

Figure 3: Append mode creates a more efficient logging infrastructure.
Conclusion
Log file capture is difficult to manage with conventional storage and file systems that do not scale and cannot handle the simultaneous load that is part of batch log updates.
Distributed systems can handle simultaneous batch uploads with ease. Coupled with atomic append mode features, these systems that makes logging directly to network storage practical and efficient, and do not require the use of distributed locks to coordinate clients.
Together, these capabilities help Web companies take control of log data, allowing them to effectively capture and analyze vast amounts of information to successfully run their business.
Gary Orenstein is vice president, technical solutions at MaxiScale. In addition to being a regular contributor to GigaOM, Orenstein hosts the podcast, The Cloud Computing Show. Orenstein is the author of IP Storage Networking: Straight to the Core. He holds an MBA from the Wharton School at the University of Pennsylvania, and a BA from Dartmouth College.

