Remove the BOM (Byte Order Mark) from a String in PHP

This article, will discuss multiple ways to remove the bom (byte order mark) from a string in PHP.

Table Of Contents

Background

The Byte Order Mark (BOM) is a Unicode character used to signify the endianness (byte order) of a text file or stream. It’s particularly common in UTF-8 encoded files. In PHP, when dealing with strings read from such files, you might find a BOM at the beginning of the string, which can interfere with further processing or display. The BOM character in UTF-8 encoded text is typically 0xEF 0xBB 0xBF. If you have a string with a BOM, like “xEFxBBxBFThis is a sample string”, you’ll want to remove the BOM to get “This is a sample string”.

Solution: Using preg_replace()

To remove the BOM from a string, you can use preg_replace() with a specific regular expression pattern that matches the BOM.

Let’s see the complete example,

<?php
$originalString = "xEFxBBxBFThis is a sample string";
// Regular expression to remove UTF-8 BOM
$cleanString = preg_replace('/x{EF}x{BB}x{BF}/', '', $originalString);

// Display the result
echo $cleanString;
?>

Output

This is a sample string

In this code snippet, preg_replace(‘/x{EF}x{BB}x{BF}/’, ”, $originalString) is used to find and replace the BOM (specified as x{EF}x{BB}x{BF}) with an empty string. This effectively removes the BOM from the start of the string if it exists.

Additional Consideration

  • Conditional Removal: It’s often a good idea to check if the BOM exists before trying to remove it. This can be done using substr() and comparing the beginning of the string with the BOM bytes.
  • Different Encodings: Be aware that BOMs differ between encodings (e.g., UTF-16 and UTF-32 have different BOMs). The above solution is specific to UTF-8. For other encodings, the BOM bytes will be different.

Summary

Removing the BOM from a string in PHP is important for data processing and display, especially when working with files encoded in UTF-8. Using preg_replace() with a regular expression that specifically targets the BOM provides a reliable way to clean up your strings and ensure they’re free from this potentially troublesome character. Remember, though, to consider the encoding of your text data, as different encodings have different BOMs.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top