Removing non UTF-8 Characters

While generating a PDF from a dynamically created HTML file, I found that the PDF generation failed as there were non UTF-8 characters in the HTML file.

To try and find these characters, I used the strings command with the -n 8 switch to remove any non UTF characters:

cat original.html | strings -n 8 > nonUTF.html

I was then able to compare the two html files to find out where the non UTF characters were appearing.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.