Convergence in "on-line" and archived email?

By Scott Dietzen on February 19, 2006 in Open Source

The email archiving market is growing explosively with the proliferation of retention and compliance policies (often motivated by the increased regulatory overhead of Sarbanes Oxley (SOX), Health Insurance Portability and Accountability Act of 1996 (HIPAA), and so on).

While email archiving is frequently grouped with more general-purpose archiving and data warehousing solutions designed for files and databases, the underlying requirements are actually very different …

(1) “Single copy” storage – The aggregate storage requirements for email are huge (approaching 2Gs in my own mailbox, and I just restarted from scratch when joining Zimbra in early 05). Moreover, email (as well as instant messages (IMs), voicemails, and so on) are write-once, read-many. So it is best to store each message just once (modulo “implicit” redundancy for RAID, auto back-up, disaster recovery, etc.).

The data in file systems and databases, on the other hand, is frequently changed on disk, so it is crucial to take periodic snapshots and manage multiple versions. Moreover, for relational databases in particular, the on-line (ortransactional) data of record is typically represented quite differently than the view required for query-only, decision support systems such as on-line analytical processing (OLAP)/business intelligence (BI). So relational data is often “archived” simply to convert it for use by different applications.

With Email archiving, the problem is rather one of better managing this single copy by, for example, auto-aging messages from faster spindles to slower ones (via hierarchical storage management) and ensuring that they are retained as long as necessary, but no longer.

(2) Unified query models & meta-data/indices – Search has become the tool for navigating large datasets, such as the Google view into the World-wide Web or the Spotlight view into your Mac desktop.

Once you get your hands on rich search capabilities for navigating your mailbox without the a priori overhead of folders, labels, etc., you cannot imagine going back to trying to do it by hand. Zimbra users can search their mailboxes based on virtually any syntactic property of their email, contacts, appointments, and so on: content, dates, domain (e.g., mail sent from or to stanford.edu), attachment contents, attachment type, objects in email (URLs, ticker symbols, phone numbers, employee IDs), etc.

In fact, such arbitrarily rich syntactic search should be available both to individual users (so that they can lay their fingers on the right email as quickly as possible) as well as administrators (to enable the most efficient, accurate cross-mailbox search for discovery and compliance).

(3) Low-overhead of compliance and cross-mailbox discovery – Many ad-hoc decision support queries must be directed against data warehouses because the performance overhead on the on-line systems (which are generally not optimized for ad hoc queries) is prohibitive. However, for email systems the access models for both on-line (users) and “off-line” (administrators doing cross-mailbox search/compliance) are converging as per (2).

Moreover, the aggregate workload of the “off-line” processing (cross-mailbox search, compliance-related discovery) is negligible relative to that of the normal on-line functioning of the email system.

Our claim: the combination of factors (1), (2), and (3) is going to (over time) drive convergence between today’s disparate on-line messaging server and email archiving solutions. There is simply insufficient justification for the increased total cost of ownership (TCO) of
(*) Multiple message stores;
(*) Multiple meta-data/index stores;
(*) Multiple query models;
(*) Multiple server configuration and administration;
(*) Multiple security models; and
(*) Multiple scalability and fault tolerance solutions.

Instead, we are convinced that future messaging servers are going to natively provide
(1) rich intra-mailbox search based on meta-data and indexing;
(2) rich cross-mailbox search & discovery via the same meta-data/indices;
(3) volume and hierarchical storage management (for auto-aging message bodies);
(4) real-time policy enforcement;
(5) write-once/read-many (WORM) hardware support;
and so on. In the mean time, of course, it is essential that solutions like Zimbra easily integrate with existing email archiving solutions, but we expect these technologies to be much more tightly integrated in the future.

Innovation and competition at reducing the TCO for email archiving will be one of the most critical factors in picking the future winners in enterprise messaging and collaboration. We in the Zimbra Community believe we have a modest head start in driving this convergence, but believe the benefits are sufficiently compelling that the greater market will ultimately follow this path.

Navigation

Convergence in "on-line" and archived email?

Comments are closed.