Conversion table format
My preferred table format is:
- For each character, a line containing
0xnn or 0xnnnn
- the character's code in the charset (hexadecimal),
- whitespace
0xnnnn
- the character's code in Unicode (hexadecimal),
- whitespace (optional)
- a comment, beginning with # and extending to the end of line
- The character lines are sorted according to their first column.
- Some comment lines (beginning with #) at the beginning (optional).
This is the format in which most of the unicode.org tables come. It has the
advantage of being very easy to manipulate using grep and
sed.
My preferred table format for tables I generate myself is:
- For each character, a line containing
0xnn or 0xnnnn
- the character's code in the charset (hexadecimal),
- a tab
0xnnnn
- the character's code in Unicode (hexadecimal).
- The character lines are sorted according to their first column.
This is a special case of the above. It has the advantage of being very easy
to manipulate using grep, sed, sort,
uniq, join, diff and small C programs
(scanf, printf). Plus, it is rather compact.
Comparison of conversion tables
Bruno Haible <bruno@clisp.org>
Last modified: 31 December 2003.