Archiving content in Amazon S3 with the Storage service

Introduction

When new content arrives, we first store it locally and add a reference in the storage_* database tables. If the content is searchable we also index an optimized copy in the fulltext_* database tables, which are the only part of Cerb5's schema that need to use MySQL's MyISAM storage engine; the rest of your database can benefit from the InnoDB storage engine, which doesn't support full text indexes, but provides other benefits like transactions, point-in-time binary logs, row-level locking, etc.

This approach means that the database doesn't need to be polluted with large, immutable content. We've created a storage system that can archive and retrieve content from external repositories, such as Amazon S3, remote databases, and distributed filesystems.

Frequently Asked Questions

What is the consequence of archiving content to Amazon S3?

When large, immutable content becomes idle the Storage service will archive it for you in long-term storage. This behavior is configured in Setup->Storage->Content.

If archived content is requested from Amazon S3 then we retrieve it and cache it locally until it becomes idle again.

What is the performance penalty of retrieving content from Amazon S3?

The major performance consideration is the latency between your server and Amazon S3. If you are hosting Cerb5 in an EC2 instance then the latency should be negligible. Conversely, if you're hosting Cerb5 on a residential DSL line then the latency may be more pronounced. Most commercial datacenters should provide adequate bandwidth. For example, from our SoftLayer servers (in both Seattle and Dallas) we routinely push over 20MB/sec to S3, and we pull content even faster.

The other consideration is the size of the content you're retrieving. Obviously, a 50KB PNG image will download faster than a 30MB ZIP file; however, you may also notice that a 1MB ZIP file downloads at the same speed as a 100KB PNG, because they both transfer in a fraction of a second, and the only latency comes from the HTTP connection.

It's possible in some environments that the HTTP connection may take around a second from Amazon S3 depending on your connection, or behind-the-scenes issues at Amazon. That's independent from the time it takes to download content. A second may not seem like a long time to wait, but if you're pulling 10 archived messages out of storage to review an old ticket it will be a noticeable delay. However, it will only be a delay on the first viewing. If you send someone else a link to the ticket the content will be served locally.

You should run benchmarks from your server (using something like jets3t) to make sure you're satisfied with the performance.


Properties ID: 000051   Views: 1470   Updated: 2 years ago
Filed under:
knowledgebase comments powered by Disqus