A commit message from our repo

Mal Pinder, one of our Developers, shares a commit message from the code repository for our main Ruby on Rails application, which was made a few weeks ago. As well as being an interesting story about debugging a strange problem with user input, it’s also an example of our policy of writing good commit messages.

The diff of the commit itself was a very small change – only a few lines of code – but we think writing in-depth commit messages help us understand our code better. Commit messages that detail why code was introduced, how it works, and what alternatives were considered, form a very useful reference for future developers (including our future selves) who want to make changes to it.

It’s been edited for formatting, and some links to our internal sites and Git history have been expanded in place so that you can understand the reference, but otherwise it’s unchanged.

Scrub invalid UTF-8 bytes from comment text params

TL;DR Add a concern to all comment controllers that scrubs the 
comment text parameter, replacing any invalid bytes with a 
replacement character.

Some time ago we had a couple of errors reported in Honeybadger 
with the message “ArgumentError: invalid byte sequence in UTF-8”
when saving a comment [1].

UTF-8 represents characters using between 1 and 4 bytes, but not 
all combinations of bytes map onto valid characters. [2] These are 
called ‘invalid byte sequences’.

In our stack, user input is passed through to Rack, which turns 
them into a params hash, then the Rails controller uses those 
params to create the comment. As Rack doesn’t care about the data 
passed to it, the invalid string makes it all the way down to the 
model layer. The first comment validation is checking that the 
text attribute is not blank; at this point Ruby checks if the 
string is valid UTF-8, and then raises the error.

Because Honeybadger can’t display the invalid bytes, we didn’t 
have enough information to solve the error right then, so we added 
some extra logging and waited for it to happen again. [3]

There were no errors for a while, but then a very popular course 
opened, and we suddenly got quite a few more, up to a grand total 
of 25. From looking at the additional logs we’d added, we could 
get some more information about all these errors:

- Most of them had user agents indicating browsers on older 
  versions of Android
- The invalid bytes always appeared after words or phrases, 
  usually at the end of the comment, and the text read fine 
  without them
- All but one had a flash alert reading "Your comment could not be
  created: Text must not contain unsupported characters"
- The invalid bytes were always “%ED%A0%BD”

The Flash Alert

All text in our database are stored in UTF-8, but for various 
reasons [4], we can’t save four-byte UTF-8 characters, such as 
emoji. Therefore we have a `UnicodeCharactersValidator` class, 
which makes sure any text entered into our database doesn’t 
contain them.

This validator gets run before a comment is saved, but *after* the
strings are coerced and checked as valid UTF-8. It puts that error
message on the model, which the controller then turns into a flash
alert, which is shown back to the user (along with their comment).

The presence of this flash message means that the users must have 
first submitted a comment with valid UTF-8, but at least one four-
byte character. It’s the on the re-submit of the comment that this 
error is raised.

Those Invalid Bytes

“%ED%A0%BD” is the url-safe hex encoding of the UTF-16 high 
surrogate “D83D”.


>> ((0xED & 0xF) << 12) + ((0xA0 & 0x3F) << 6) + (0xBD & 0x3F)
=> 55357
>> _.to_s(16)
=> "d83d"

In UTF-16, values in the range [D800, DBFF] are ‘high surrogates’ 
which need to be followed by a low surrogate in the range [DC00, 
DFFFF] to represent a character.

The first code point that would begin D83D is thus encoded as D83D 
DC00, giving:


>> 0x10000 + ((0xD83D - 0xD800) << 10) 
     + ((0xDC00 - 0xDC00) & 0x3FF)
=> 128000
>> _.to_s 16
=> "1f400"

or, Unicode Character 'RAT' (U+1F400)[5]. This is the beginning of 
a long block of codepoints assigned to emojis, including most of 
the common ones you can find on your phone keyboard. All the emoji 
in this range begin D83D when encoded as surrogate pairs.

However, the request body containing %ED%A0%BD is a string encoded 
as `application/x-www-form-urlencoded,` which works by encoding a 
string as UTF-8 and turning any multi-byte sequences (and some 
characters that need escaping like & and =) into %XX strings. If 
you put multi-byte characters into a form and post it, you will 
see a UTF-8 encoding of those characters arrive at the server - 
for instance, `U+1F602` would properly be encoded as 
`%F0%9F%98%82`, not as `%ED%A0%BD%ED%B8%82`.

It is not valid to send values corresponding to surrogate pairs as 
UTF-8, because they are not necessary in UTF-8. Pairs are needed 
in UTF-16, for codepoints that don't fit in 2 bytes. UTF-8, being 
a variable-length encoding, doesn't need them. So ED A0 BD is an 
invalid byte sequence in UTF-8.

Most browsers won't allow unpaired surrogates to be entered, but 
you can generate them using Javascript, which represents all 
astral symbols (4-byte characters) as two characters:

> s = String.fromCharCode(0xD83D)
'�'
> s.length
1

> s = String.fromCharCode(0xD83D, 0xDE02)
'?'
> s.length
2

In Chrome, inserting `String.fromCharCode(0xD83D)` into a form and
submitting it sends Unicode Character 'REPLACEMENT CHARACTER'
(U+FFFD)[6] in its place. But some user agents might retain the
'character' and send it as `application/x-www-form-urlencoded`.

What we think is going on

We know users try to enter emoji in comments (we get occasional
UserVoice messages about it), but most see the warning, delete the
emoji, and post their comment.

Given the positioning of the invalid bytes (between phrases, or at 
the end) and the flash message, we think this problem is also due 
to people trying to use emoji.

However, something about these user agents is breaking the emoji 
into two 'characters' and only sending us the first part, rather 
than either deleting both parts, or replacing the remaining 
invalid bytes with the replacement symbol. This could be due to 
any number of reasons; perhaps they have a misbehaving browser, 
javascript plugin, or keyboard app, or their keyboard doesn’t 
support 4-byte characters but they work around it by copy-pasting 
emoji in. (It could also have been truncated due to a proxy or 
network error, but since the `content-length` header is correct, 
this is unlikely).

Because this is basically a problem caused by a misbehaving client
sending us an invalid request, I found it impossible to replicate 
using an actual device. However it is possible to write controller 
tests that approximate the situation, since it’s only our model 
layer that checks the string content.

How to fix this

Now we knew what the actual problem is, I had several options on 
how to proceed.

Semantically, since it’s invalid data that’s being sent, the page 
could respond with an HTTP `400` or `406`. However this would show 
an error page to the user, and discard the comment they’d tried to 
make.

In general it’s best to persist user input wherever possible; 
comments can be very long and/or represent a lot of effort on the 
user’s part, and it’s a terrible and frustrating experience to 
press submit and watch everything you wrote vanish.

Given that the best-guess for what’s happening is that users are 
trying to delete these emoji anyway, and there’s no way to ‘guess’ 
at what the other surrogate pair should be, I decided it would be 
better to replace the offending bytes with the Unicode replacement 
character, as this is what broswers should be sending us anyway.

This means that users can post their comment without errors, and 
there’s a visual hint to them that something’s gone awry - they’ll 
be able to edit their comment to remove the replacement character 
if they want.

*ahem*

There are four controllers that handle comment text: one each for
creating replies, step comments, or pre course discussion 
comments, and one for editing comments. This commit adds a 
concern, `ScrubCommentText`, for use in all of them.

This concern uses a before filter on the `create` and `update` 
actions, which checks for the presence of comment text params, and 
replaces any invalid bytes with the Unicode Replacement Character 
� [6].

[1] This was originally a link to Honeybadger, which we use for tracking errors.

[2] http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

[3] This was originally a link to the commit that added the tracking. What we did was pass the content of `request.env[‘rack.input’]` to Honeybadger as additional ‘context’ when an `ArgumentError` of the right type is raised in any controller, and also a base64 encoded version, in case Honeybadger did any scrubbing of the bytes we sent it.

[4] This was originally a link to the commit that added the Validator, in which the reasoning is fully explained. In short, we still use MySQL’s `utf8` character set, which doesn’t support 4-byte UTF-8 characters, and there’s also issues with display and moderation of emoji.

[5] www.fileformat.info/info/unicode/char/1f400/index.htm

[6] http://www.fileformat.info/info/unicode/char/fffd/index.htm

Want to know more about how we use Git? Read our post on how we tell stories in our commits.