Thursday, January 8, 2015

C# String and UTF-8 Encoding

Few days back came across very interesting fact about C# string and its memory representation. As a developer distinguishing string based on encoding is not straight forward because Visual Studio debugger shows same string in debug window but internal memory representation is different. In normal scenario this can be ignored but plays vital roles while working with program that deals with string from different sources. This string input can be from user, file, stream, etc. This can be in any encoding. So, if program is doing string operation based on assumed encoding than problem starts.

To understand how C# internally stores string let us take an example,
Create text file in notepad and save as UTF-8 encoding.
Below code opens this file and reads content into string. As you see debugger shows string same as it written in file

So there are 4 characters in string. Let us look at the memory by adding this to Watch and converting in to Char Array. It shows 5 characters instead of 4

Now question is what is extra character? In simple terms it is indicator of encoding. So when transformation of string happens in various form then first char will be used to find source encoding, in this case UTF-8.

Ok, UTF-8 string has different Memory representation so what? How does it matter to developer. Well, it does when some string operations are performed on it. Let’s take an example of String.StartsWith().
In above code, adding one more line of code to check if string stars with “Te” and do a case insensitive comparison

What will be output of line # 2? True or False?
I was expecting True. But, answer is false

If you followed memory representation above then I am sure you might have figured out answer as False. What happens here is that .NET does binary sort and comparison internally. Since it has extra byte at beginning of the string and way UTF-8 chars are sorted, comparison fails.

To make StartsWith () work correctly in UTF-8 encoded string, change to StringComparison.InvariantCultureIgnoreCase

Hope this it was useful.

Post a Comment