Grepping for binary data

By March 31, 2009Technical

I was dealing with an interesting content-encoding issue yesterday for a customer’s website. They’re adamant that the problem started a few weeks ago after a routine database restoration, but we beg to differ. In any case, the customer’s site was displaying “funny characters” here and there, classic symptoms of encoding failure. I’ve written about this before, as it relates to MySQL’s handling of character encoding, but it’s not mysql’s problem alone.

In this case, the content coming from the database and CMS was proper UTF8, but there were dodgy characters leaking into the rendered page. I knew these would be coming from template files in the user’s account, but how to find them? I could find an instance here and there by searching for nearby strings, but I needed to nail all of them, and I don’t know every page on the customer’s site.

After playing with things a bit and the aid of iconv, I determined that one of the bad characters was the Trademark symbol (), and that it was in the Windows-1252 encoding. As we all know, Windows loves standards, so of course it makes sense that developers would be saving files with this encoding.

But how to find the other files with this non-UTF8 trademark characters? I can’t easily paste the dodgy character into my UTF8 terminal, but I do know that the trademark character is represented in hex as 0x99. One idea I considered was using hexdump on all HTML and PHP files (did I mention this developer had put PHP in .html files?) and then grepping the textual output for ' 99 ', but this is very messy and takes no advantage of grep’s powerful capabilities.

Then our tech director suggested using echo – grep will happily search for arbitrary bytes, you just need to get them into the search pattern. A few keystrokes later we had a very nice solution.

Compare:

[ciel@phantomhive public_html]$ for i in *.html ; do hexdump -C "$i" | grep -q ' 99 ' && echo "$i" ; done
london.html
baker-st.html
monarch.html

With this:

[sebastian@phantomhive public_html]$ find . -name '*.html' -print0 | xargs -0 grep -l `echo -en 'x99'`
./london.html
./baker-st.html
./monarch.html
./dire/header.html
./includes/header.html

Fantastic! You’ll notice that the latter form also finds relevant files in subdirectories, something which the naive version simply doesn’t do.

Astute readers might also recognise that echo shouldn’t have worked there; the manpage for echo only mentions printing arbitrary bytes from octal notation. The example here works because we’re actually invoking the shell’s own builtin echo command, which accepts backslash-escaped hex bytes. If this isn’t an option or your shell’s builtin echo is deficient, you’ll have to convert those bytes to octal manually (you can use the ascii command’s character chart for this). Or do you?

Google to the rescue! I use google as a calculator all the time, but serendipitously discovered that it’ll do simple base conversions as well.

Ask a simple question, get a simple answer
0x99 = 0o231