Text Encoding: ASCII, UTF-8, URL & Base64

"Encoding" gets used for several different things, and mixing them up is behind a surprising share of garbled text, rejected form fields and broken URLs. This guide walks up the stack: from how a single English letter is stored, to how every script on Earth fits into bytes, to two encodings that exist for transport rather than language.

the foundation

ASCII: 128 characters, one byte each

In the beginning there was ASCII. It assigns a number to 128 characters — the uppercase and lowercase English letters, the digits, common punctuation, and some invisible control codes. A is 65, a is 97, 0 is 48. Because 128 values fit in 7 bits, each character takes a single byte with room to spare.

For English text that's the whole story, and it's why a plain English sentence has the same character count and byte count. The trouble starts the moment you need an é, a £, a 中 or a 😀 — none of which ASCII can represent.

going global

Unicode and UTF-8: every character, variable bytes

Unicode solves the coverage problem by giving every character in every writing system a unique number called a code point — well over a million of them. But a code point is just a number; you still have to decide how to store it as bytes. That's what an encoding does, and UTF-8 is the one that won the web.

UTF-8 is clever in two ways. First, it's variable-width: a character takes one to four bytes depending on how large its code point is. Second, it's backward-compatible — the one-byte range is exactly ASCII, so every old ASCII file is already valid UTF-8.

Character	Code point	UTF-8 bytes
`A`	U+0041	1
`é`	U+00E9	2
`中`	U+4E2D	3
`😀`	U+1F600	4

This is exactly why a "255 character" database column can reject text that looks short, and why one emoji can push an SMS over a segment boundary. The unit that matters for storage and transport is the byte, not the character you see. You can watch the gap between the two on the character counter, which reports the live UTF-8 byte size next to the character count.

Gotcha — mojibake. If text was written as UTF-8 but read back as some other encoding (or vice-versa), you get mojibake: café turning into cafÃ©. The bytes are intact — they're just being interpreted with the wrong decoder. The fix is always to make the writer and reader agree on UTF-8, not to "find and replace" the broken characters.

encodings for transport

URL encoding: making text safe for a link

A URL may only contain a restricted set of ASCII characters. Anything outside that set — a space, an ampersand, a non-English letter — has to be written as a percent sign followed by its byte value in hexadecimal. A space becomes %20; 中, which is three bytes in UTF-8, becomes %E4%B8%AD.

The subtlety is how much to encode. Encoding a single value that you're dropping into a query string should escape structural characters like /, ? and & (in JavaScript, encodeURIComponent). Encoding a whole URL should leave those alone so the link still works (encodeURI). Use the wrong one on a value and an embedded & will split your query string in two. You can try both modes on the URL encoder.

Base64: binary as plain text

URL encoding handles awkward characters; Base64 handles awkward bytes. It maps arbitrary binary data onto 64 safe characters (A–Z, a–z, 0–9 and two symbols) so it can ride through channels that only expect text — a PNG embedded in CSS as a data URI, a file attached to an email, a blob tucked inside JSON.

Two things to remember. Base64 makes data about 33% larger, because it spends four output characters on every three input bytes. And it is not encryption: anyone can decode it instantly, so it hides nothing. There's a URL-safe variant (used by JWTs) that swaps the two symbols for - and _ and usually drops the = padding. The Base64 tool does both variants with full Unicode support.

how the layers stack

A real example: a JWT

A JSON Web Token shows three of these ideas at once. Its payload starts life as a JSON object. That text is encoded to bytes as UTF-8, those bytes are Base64url-encoded into an ASCII string, and the result is short enough to sit in an HTTP header or a URL. Decode one on the JWT decoder and you're unwinding that stack: Base64url back to bytes, bytes back to UTF-8 text, text back to JSON.

Seen this way, the layers aren't competing — they're stacked. Unicode decides which character; UTF-8 decides how that character becomes bytes; Base64 and URL encoding decide how those bytes survive a channel that's picky about what it carries.

the bugs this prevents

What to watch for

Byte vs character limits. "Max 160 characters" and "max 160 bytes" are different rules the moment non-ASCII appears. Check the byte count when a limit really means bytes.
Double-encoding. Encoding an already-encoded string gives you %2520 (an encoded %). Encode once, at the right layer.
"Length" surprises. In JavaScript an emoji reports a length of 2 because the string is measured in UTF-16 code units, not characters — worth knowing before you validate input by length.
Treating Base64 as secure. It isn't. If something must be secret, hash or encrypt it; Base64 only changes the alphabet.

None of this is exotic — it's the plumbing under every form field, API and URL you touch. Once the layers are clear, "why is my text garbled / too long / broken in the URL?" usually answers itself.