UTF-8 is a wonderfully simple encoding format with some very nice properties, but the juggling required to convert to UTF-16, and UTF-32 can be a little tricky and fairly easy to do poorly. This is further compounded by the various error conditions you must keep an eye out for, such as overlong encodings, reserved ranges, surrogate markers, incomplete sequences, and so on.
These are a couple tricks you can employ to hopefully keep the conversion fast and robust.
Tail Length Lookup
Our first trick is to use a lookup table for the initial byte. This allows you to both a) tell whether the byte is valid (80 to bf and fe to ff are invalid leading bytes, as well as f5 to fd if you don't want to handle 5 and 6 byte sequences) and b) determine the number of trailing bytes in the expected sequence. We will also need the length of the sequence to quickly ensure there are enough bytes in left in the input as well as for other upcoming tricks, so this actually results in multiple wins..
read more