Infrastructure Issue Affecting Customer Sites

Incident Report for Pantheon Operations

Postmortem

At 0002 UTC July 2, Pantheon observed that the contents of some customer files had been replaced with the contents of other files, possibly those belonging to other Pantheon customer(s). The problem was found to have been caused by a code change we made to address a bug in our filesystem persistence layer.

The platform uses a write-back cache for small files uploaded to a site, and the process responsible for writing files under 1MB from the cache to actual storage had a section of code that was not thread safe. This issue only manifested in very specific conditions. This translated into writing a corrupted file which could include bytes of memory that may not belong to the given destination site/file.

We reverted the change we had pushed to our systems that enabled this bug to manifest and proceeded to aggressively delete the files that could have been corrupted in order to minimize chances of exposure. For risk management reasons we focused primarily on live sites, we have also since disabled access to backups that were generated during that time, accessing these backups will require you to go through our customer support.

We are currently running an exhaustive audit on all our volumes for all environments to ensure that we have covered all possible corrupted files, and will delete any further corrupted files found. Any new findings will be disclosed directly to the relevant organizations via a support ticket. This audit is estimated to take 48-72hrs.

We have high confidence that new corrupted files are not being generated, but we also are implementing a few changes: We are making our off-loader process thread safe, but also investigating our ability to remove our write-back cache completely as part of a file system rewrite. We will also look to add a layer of content verification logic to prevent the platform from serving such files in the future.

Posted Jul 07, 2020 - 17:12 PDT

Resolved

This incident has been resolved. We have fixed the regression and removed corrupted files.

Additional update:

The incident caused some files under 1MB being written either by the CMS or by a manual upload (e.g. SFTP) to be corrupted. Those files have been purged. Their content is gone. The window of the incident began on the evening of June 29th Pacific Time (0300 UTC June 30th) and lasted until approximately midnight July 1st (0700 UTC July 2nd). As of right now affected files will show up in directory listings, but do not contain any content.

Two classes of files widely impacted were aggregated CSS or JS assets, as well as image thumbnails. If you are experiencing issues related to these assets, you should immediately flush all caches or otherwise trigger the regeneration of those files. Now that the file persistence layer is stable, they should perform as usual after being regenerated.

We will be conducting and communicating a full audit of corrupted files for all affected customers, as well as removing all errant references from directory listings. This will take some time, but work is already underway.

If you need to find out if a file uploaded or written by the CMS was lost, you can review recently added files via your WordPress or Drupal admin interface, and see if they are still available. If not, you should re-upload them if possible. This is the only path to restoring lost content.

Backups will not contain the missing files, but as a last resort restoring from backup is a way to get a site back to a previous stable state. For safety, you should use a backup from before 0300 June 30th (8pm PT June 29th). Restoring from backup disrupts a site, causing some downtime. For sites with a small content footprint, a restore to the live environment can complete in a few minutes.
To minimize downtime, you can individually import the elements of a backup to another environment (e.g. test) by copying the url to the backup elements and pasting it into the other environments “import” field. Once that workflow completes, clear your edge cache, test your changes, then use the content sync workflows to sync the db and files back over to live. Import via URL only works where the backup elements are under 500MB in size, but it will minimize the disruption in the live environment.
You can open a support ticket or engage chat for further consultation if needed.

Posted Jul 02, 2020 - 14:52 PDT

Update

Starting approximately at 0300 UTC June 30, the writes from the persistent cache to long term storage became heavily delayed for a portion of our customers. This had no effect on the filesystem, but caused affected files to not appear in backups. Starting at approximately 1900 UTC July 1, we deployed a change to address the delayed writes to long term storage which ultimately corrupted a portion of those files in long term storage. We addressed the issue by auditing files in long term storage against their validation hash, and deleted files that did not match the validation hash.

We have fixed the regression and removed corrupted files.

Recommended customer remediation: Follow their normal restore-from-backup procedure with backup started before 0300 UTC June 30

Remediation for customers that can't follow the recommendation: Find all references (e.g. DB/html) to files in their ./files/ path. Read all files. If the file disappears, either remove the reference or re-upload the file. Do not rely on directory listings. The audit period should at least cover files created between 0300 UTC June 30 to 0700 UTC July 2

We will carry out a full post mortem and update this page within 3 business days (EOD Wednesday).

Posted Jul 02, 2020 - 12:53 PDT