Pioneers @ KerryR.net  


ASCII and HTML - How They Work Together

ASCII assigns a decimal number to each letter of the alphabet (both upper- and lower-case), as well as to the numbers from 0 - 9, and most punctuation signs. Each of these decimal numbers are then transposed into binary for processing, and the end result is a standard code for information exchange.

SymbolDecimalBinary
A6501000001
B6601000010
C6701000011
D6801000100
E6901000101
F7001000110

Well, okay, nearly standard. There are always some slight differences between platforms (trying to read Unix ASCII files on your standard MS-DOS PC can cause some interesting hiccups), but these are generally restricted to control characters.

. . . . . . . . . . . . . . . . . . . .

The original form of ASCII, as defined by the American Standards Association in the early 1960s, represented 128 characters - more than enough to cover the English alphabet and numerals, and most other modern languages as well.

Later versions upped the ante to 256 to include some slightly esoteric characters and produce a more international fluency.

Languages such as Arabic and Urdu cause a bit of a problem, though. Each has 28 phonetic characters (not counting various vowels and diacritical signs) but, depending on its place in the word, each character can have four different shapes.

And then there are those languages that use ideographs instead of the more familiar alphanumeric symbols. Modern Chinese, for example, includes more than 50,000 picture-signs.

One solution is to divide the ideographs themselves into a number of common signs, or ‘strokes’, and to chain these strokes together to form the symbol - much as we chain letters to form words. Another notion is to use more than the traditional one byte to encode the characters - two bytes, for instance, can cover upwards of 65,000 ideographs.

. . . . . . . . . . . . . . . . . . . .

Hypertext Markup Markup Language (HTML), on the other hand, uses a character set derived from the International Standards Organisation - ISO-Latin-1, or ISO8859-1.

ISO-Latin-1 (Latin: meaning derived from the Roman alphabet, and 1: for the version number), helps a browser interpret characters it might otherwise see as markup tags, and to support extended ASCII characters without having to actually support higher-order ASCII. It also produces some characters that don't appear in ASCII at all.

This character set uses two different types of strings to represent characters: Numeric Entities (eg. ") that use strings of numbers, and Character Entities (eg. ") using letters to represent the characters or symbols.

It can look a little weird, with characters like &lt; and &#60; representing <, but it's fairly simply really when you break it down:

  • The & lets your browser know that the following characters are code, rather than just another string of letters,
  • If you throw in an # after the &, the browser looks for a numerical equivalent of the symbol,
  • The ; finishes off the string and lets your browser know the code is complete.

This can be a bit tricky. For instance, if you need to use the character Ä, you'll have to key in &Auml;, otherwise your Ä will turn into a Ž. Just to be interesting, this can also be defined by &#196;.

And then there are a few glitches (isn't it always the way?). Some ASCII characters, when used in HTML, present a symbol that isn't included in the ISO-Latin-1 character set (of course, they don't show up as the actual ASCII character either - that'd be too easy). For instance, the ASCII character Ö sometimes presents as in HTML - depending on the browser.

You can see a few here (there are bound to be more) and check out the ACSII and HTML conversion tables below:

Conversion Tables
Control CharactersASCIIHTML
AlphabetASCIIHTML
Punctuation and NumbersASCIIHTML
Extended CharactersASCIIHTML