Drupal 9: Removing Base64 Encoded Files From Content

Occasionally, I have come across Drupal sites that have base64 encoded images embedded into content fields. This is the approach of taking the binary data contained in a file and converting it into a string of characters. The original binary data can then be re-created using this string and the data is understood by lots of different technologies (including web browsers).

Whilst this is technically possible, it massively balloons the size of the database and can often slow down page load times due to the database being slow to respond to the request. Instead of fetching a few kilobytes of data from the table the database is forced to fetch many megabytes of data, which can create a bottleneck for other requests.

When you download a file from the web your browser can make a decision on whether to fetch that file a second time. By injecting files into the content you are forcing your users to download very large pages every time they want to request a page. It isn't possible for the browser to make that decision any more and that can lead to more slowdown for the user.

If you can't tell, I really dislike this method of image storage. Whilst it is technically possible, it creates more problems than it solves and even sites with a couple of thousand nodes can have databases of many gigabytes in size due to this issue. It can also put unnecessary strain on the database due to the increased time taken to return data.

Let's say that when you embed an image into some copy on a Drupal site using the normal media or file embed features. You might see an image element that looks like this.

<img alt="A blank image" src="/sites/default/files/images/blank.png">

In certain situations it is possible to embed images directly into content. The image element would look something like this.

<img alt="A blank image" src="... and so on ...">

The main culprit of this base64 image encoding problem is when a user drags an image into the body field of a Drupal post. Instead of creating the file reference, CKEditor will instead embed the file directly into the content area. When the post is saved the data is saved to the database along with the encoded image.

I have also seen this happen in a couple of decoupled Drupal sites where users were given a CKEditor box to edit content and this sends base64 encoded images back to the Drupal site. The developers then saved this content directly into the database, which then created a problem as the Drupal API then became slow to respond.

The Solution

Once the problem has been identified the first thing to do is create a mechanism to pull the base64 encoded images out of a block of content and create file entities. To do this I used the DOMDocument libraries built into PHP as regular expressions aren't the best thing in the world to trust with HTML.

Assuming we have some content we can generate a DOMDocument object using that content. As the content must be treated as UTF-8 we need to inform the DOMDocument object that this is what we are working with. Without this in place you will find that the resulting output has lots of encoding errors when the document is put back together again.

Not only do we need to set the DOMDocument object up as UTF-8, but as the content entered into a body field (or similar) is an incomplete document we also need to prefix the content with a XML arribute. This ensures that the block of content is treated in the correct way.

With all that in place we can finally call the getElementsByTagName() method to get all of the images in the DOMDocument object.

// Load the HTML from the content and find all of the img elements.
$dom = new \DOMDocument('1.0', 'UTF-8');
$dom->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . $content, LIBXML_NOERROR);
$dom->encoding = 'UTF-8';
$dom->substituteEntities = TRUE;

$images = $dom->getElementsByTagName('img');

The $images variable now contains an DOMNodeList object that is essentially a collection of DOMNode objects. Using this we can loop through the list and pick out each image. 

We the image element may (or may not) contain a base64 encoded image that is the first step in the loop. We don't want to be doing anything with properly added images so we just skip these items in the list.

// Create internal count to allow multiple files to be named differently.
$count = 0;

foreach ($images as $image) {
  $count++;

  // Search for base64 encoded data within the img element.
  $verify = preg_match('/(data|image\/png):?([^;]*);base64,(.*)/', $image->getAttribute('src'), $match);
  if (!$verify) {
    // Skip if this isn't a base64 encoded file.
    continue;
  }

  // ... Process the image ... 
}

Once we are sure we have a base64 encoded image we then need to extract the encoded information about the file and ensure that it is a file we can work with.

This block of code deals with that process and uses a couple of PHP functions to decode the file and find out what sort of file type it is. In this case, if the file isn't an image we skip it, but it's probably worth adding in logging to this section so that we have some form of record of the rejected file.

// Extract data for the file.
$dataRaw = explode(',', $image->getAttribute('src'));
$fileData = base64_decode($dataRaw[1]);

// Extract the mime type for the file so that we can save it in the right
// format.
$finfo = finfo_open();
$mimeType = finfo_buffer($finfo, $fileData, FILEINFO_MIME_TYPE);

// We need to make sure the encoding of the base64 data is actually an image
$verifyMime = preg_match('/image\/(png|jpg|jpeg|gif)/', $mimeType, $mime_match);

if (!$verifyMime) {
  // Skip this file since it's not an image.
  \Drupal::logger('mymodule.base64_file_manager_service')->info('File of type @mime not decoded when processing content', ['@mime' => $mimeType]);
  continue;
}

If we have a file we want to process then the next step is to create a file from that data. Drupal has a writeData() method in the 'file.respository' service that takes in raw binary file data and will write this to managed file. A managed file in Drupal is one that contains a record in the database that points to the file, which is how most files in content should be managed.

$fileName = $directory . '/' . $id . '-' . $count . '.' . $mime_match[1];

/** @var \Drupal\file\FileInterface $file */
$file = \Drupal::service('file.repository')->writeData($fileData, $fileName, \Drupal\Core\File\FileSystemInterface::EXISTS_REPLACE);

if (!$file) {
  throw new FileNotExistsException('Could not create the file ' . $fileName);
}

The result of this operation is a new object that contains information about the file. Using this object we can then replace the attributes of the image with our new data. We also add in the 'data-entity-uuid' and 'data-entity-type' attributes to the image element so that the new file can be referenced by the element correctly.

// Update the img src and add needed attributes.
$image->setAttribute('src', $file->createFileUrl());
$image->setAttribute('data-entity-uuid', $file->uuid());
$image->setAttribute('data-entity-type', 'file');

As all of the objects present here are passed by reference this also changes the image attributes for the element in the main DOMDocument. Once the loop finishes then we just need to reconstruct the document with the new data.

As it happens, recreating the document with the new data isn't quite as straight forward as just calling saveHTML(). This is because the original DOMDocument we created now contains a body tag and other elements needed for a HTML document to exist. What we actually need to to create a block of HTML without all of that information in place since that isn't important to the content we are processing.

I had a few problems figuring this out, but what we essentially need to do is create another DOMDocument object and then feed in the child elements of the body element into that object. Once that is complete we can then convert the second DOMDocument into a HTML string.

// The DOM document currently contains the doctype, head and body elements.
// To remove these we create a mock document and copy the children
// of the body element into it.
$mock = new \DOMDocument('1.0', 'UTF-8');
$mock->encoding = 'UTF-8';
$mock->substituteEntities = TRUE;

$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
  $mock->appendChild($mock->importNode($child, true));
}

// Convert mock HTML document back into HTML.
$content = trim($mock->saveHTML());

Once we save this content back into the database the base64 encoded file will have been replaced by a managed file.

Conclusion

Base64 encoding files can have some benefits in certain situations. Most of the time, however, you are just going to create performance problems.

I have collected together the above code into a Drupal service class as a GitHub gist that you can use to extract base64 encoded files from your content into managed Drupal files. The only thing to do here is plug this into your existing site as a normal Drupal service.

What you also need to write are mechanisms around this code to extract base64 encoded files from content blocks. This can be done in two ways.

  • Using a Drush command (or similar) run through the content and find files that are embedded in content like this. The service can then be used to convert those files and re-save the content.
  • Using a hook to intercept the saving of new content and automatically extracting any base64 encoded data into files using the service. This ensures that the problem doesn't start happening again once the initial fix has been performed.

Whilst writing the above code isn't totally straightforward, I spent most of my time figuring out the best mechanism to extract the files and return usable content from the service. As a result I will leave the other components here as an exercise for the user.

I should note that as well as running this service through tens of thousands of pages of content also wrote a few unit tests around this service. As such I'm pretty confident that it works as it was intended to.

I hope this post has been of some help to you. If you find that your site is slow due to this problem and need help running this code then please get in touch. If there is enough interest I will create a second article looking at performing these actions.

Comments

Thanks for taking the time to write this article. I haven't run into this issue yet and I was not even sure how would one even approach a problem like this. Definitely an eye opener.

Permalink

@Chris Hill - Thanks. I'll look into it :)

@Bernardo Martinez - Thanks very much for reading!

Name
Philip Norton
Permalink

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
9 + 1 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.