Conversion table format
My preferred table format is:
- For each character, a line containing
0xnn
or 0xnnnn
- the character's code in the charset (hexadecimal),
- whitespace
0xnnnn
- the character's code in Unicode (hexadecimal),
- whitespace (optional)
- a comment, beginning with # and extending to the end of line
- The character lines are sorted according to their first column.
- Some comment lines (beginning with #) at the beginning (optional).
This is the format in which most of the unicode.org tables come. It has the
advantage of being very easy to manipulate using grep
and
sed
.
My preferred table format for tables I generate myself is:
- For each character, a line containing
0xnn
or 0xnnnn
- the character's code in the charset (hexadecimal),
- a tab
0xnnnn
- the character's code in Unicode (hexadecimal).
- The character lines are sorted according to their first column.
This is a special case of the above. It has the advantage of being very easy
to manipulate using grep
, sed
, sort
,
uniq
, join
, diff
and small C programs
(scanf, printf). Plus, it is rather compact.
Comparison of conversion tables
Bruno Haible <bruno@clisp.org>
Last modified: 31 December 2003.