Most Popular
Recently Added
Recently Updated

Retroactively merge duplicate attachments

Introduction

The Cerb 6.6 release introduced duplicate attachment detection, but this feature only covers duplicate attachments from that point forward. For environments that upgraded from earlier versions, it is generally desirable to retroactively merge existing attachments in order to conserve storage space and reduce the amount of content that needs to be backed up.

Environments that frequently send the same outgoing files, like PDF forms or ebooks, will benefit the most from this process.

Instructions

Make a backup!

This process will modify your database. While these instructions have been tested in many environments, it is always wise to make a backup in case something goes wrong.

Install the script

Download the attached cerb660_dupe_folding.txt script and copy it to your /cerb web directory.

Rename the file to cerb660_dupe_folding.php

Run the script

This script will discover the most frequently duplicated attachments in your database (by name and size), hash them with SHA-1 to verify they are indeed identical, and then rewrite the attachment links to the same file. The duplicates will be automatically removed by Cerb's daily maintenance scheduled job.

The script accepts an optional argument for the number of distinct duplicate objects to process at once (i.e. "batch size"). You can leave the batch size argument blank for the default of 25, but you may need to run the script many times to discover all duplicates. The batch size should be between 1 and 10000.

The script is designed to be run multiple times in order to reduce the load impact on your server and database. We recommend running the script with a low batch size at first (like 25), and then increasing it with each successive run. You'll want to run the script several times until no more duplicates are discovered. The reason to start with a smaller number is because the first attachments to be merged will have the most duplicates (potentially thousands). After a few runs of the script you'll be dealing with the "long tail" of fewer duplicates, where a large batch size like 10000 will help you finish the process faster (at the risk of increasing your server's load).

Execute the script on the command line using the php command:

php cerb660_dupe_folding.php 100

Repeat the above command (possibly with a larger batch size) until the script reports:

No duplicate storage objects were found.

Finishing up

Once you have merged all duplicate files, delete the cerb660_dupe_folding.php script from your web server.

Because the attached script only hashes files that have at least one duplicate, for efficiency, it is possible for one duplicate (at most) to be created in the future for an existing file that currently has no duplicates. This should only happen rarely, and you can run these steps again at a later date to merge those duplicates as well. This script is designed to merge the most duplicated files, since unnecessarily hashing millions of distinct files could bog down your server for over an hour.

You can compare the before and after disk space usage from Setup -> Storage -> Content.

You won't notice a reduction in used filesystem space right away, because Cerb's daily scheduled maintenance job needs to run in order to delete storage objects that no longer have records pointing at them. You can force maintenance to run from Setup -> Configure -> Scheduler by clicking the run now link next to Maintenance.


Properties ID: 000106   Views: 2670   Updated: 8 months ago
Filed under:
Attachments
knowledgebase comments powered by Disqus