Babbletower - Text Encodings

Text Encodings

Computer files consist of bytes. When a text is stored in or read from a file, the letters that make up the text therefore need to be written and read as bytes. For texts using only the Latin alphabet, this isn't very complicated: Since one byte can have 256 different values, each letter is assigned to a particular byte value. There are enough values to describe all letters, plus numbers, punctuation signs and a few special characters. For languages such as Chinese, Japanese, and Korean, things are more complicated: There are thousands of characters that need to be expressed by sequences of two or more bytes. Encodings describe for various languages and their characters how this is done.

Encodings are important in Babbletower in two ways:

Dictionary encodings
To allow Babbletower to correctly read your dictionary files, you need to tell it in the setup for a dictionary what its encoding is. The documentation of electronic dictionaries often states in what encoding the data is delivered. If you can not obtain this information, you will have to experiment a bit. Try opening the dictionary -or preferably just a piece of a few lines- in your web browser, then change the encoding setting of the browser until the entries from the dictionary are displayed correctly. Below you will find a table with the names of encodings as used in Java. These are the names you have to use when setting the encoding for a dictionary. Note that these names are case sensitive.
Encodings supported by your Java runtime
You will only be able to use a particular encoding if your Java runtime supports it. While Java runtimes on desktops usually support all or most of the encodings listed below, in particular runtimes for PDAs often only support a small number of encodings. To find out which encodings are supported by your runtime, start Babbletower and open the Import or Export file dialog in the Wordbox screen. At the bottom of this dialog is a drop down list of all supported encodings.
If the encoding of your dictionary data is not among them, you will have to convert these dictionaries into an encoding that is supported. To be on the safe side, it is recommended to convert your dictionaries into UTF8. This is not only an encoding guaranteed to be supported on all Java platforms, it can also encode all Unicode characters, i.e. in whatever language your dictionary files are, if you convert them into UTF8, there is no danger of ending up with mangled text.
Babbletower comes with a tool for converting text files between different encodings. Using this tool, you can convert your dictionary data in a Java runtime that supports the original encoding -e.g. your desktop computer.
Usage:
java -classpath babbletower.jar Converter in_file in_encoding out_file out_encoding
Note that you need to recompute the index for a dictionary after you change its encoding.

Character Encoding Description
ASCII ASCII
ISO8859_1 ISO 8859-1
ISO8859_2 ISO 8859-2
ISO8859_3 ISO 8859-3
ISO8859_4 ISO 8859-4
ISO8859_5 ISO 8859-5
ISO8859_6 ISO 8859-6
ISO8859_7 ISO 8859-7
ISO8859_8 ISO 8859-8
ISO8859_9 ISO 8859-9
ISO8859_15_FDIS ISO 8859-15 (Final Draft Information Standard, based on ISO8859-1)
Big5 Big5, Traditional Chinese
Cp037 USA, Canada(Bilingual, French), Netherlands, Portugal, Brazil, Australia
Cp1006 IBM AIX Pakistan (Urdu)
Cp1025 IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
Cp1026 IBM Latin-5, Turkey
Cp1046 IBM Open Edition US EBCDIC
Cp1097 IBM Iran(Farsi)/Persian
Cp1098 IBM Iran(Farsi)/Persian (PC)
Cp1112 IBM Latvia, Lithuania
Cp1122 IBM Estonia
Cp1123 IBM Ukraine
Cp1124 IBM AIX Ukraine
Cp1140 Cp037 with the euro
Cp1141 Cp273 with the euro
Cp1142 Cp277 with the euro
Cp1143 Cp278 with the euro
Cp1144 Cp280 with the euro
Cp1145 Cp284 with the euro
Cp1146 Cp285 with the euro
Cp1147 Cp297 with the euro
Cp1148 Cp500 with the euro
Cp1149 Cp871 with the euro
Cp1250 Windows Eastern European
Cp1251 Windows Cyrillic
Cp1252 Windows Latin-1
Cp1253 Windows Greek
Cp1254 Windows Turkish
Cp1255 Windows Hebrew
Cp1256 Windows Arabic
Cp1257 Windows Baltic
Cp1258 Windows Vietnamese
Cp1381 IBM OS/2, DOS People's Republic of China (PRC)
Cp1383 IBM AIX People's Republic of China (PRC)
Cp273 IBM Austria, Germany
Cp277 IBM Denmark, Norway
Cp278 IBM Finland, Sweden
Cp280 IBM Italy
Cp284 IBM Catalan/Spain, Spanish Latin America
Cp285 IBM United Kingdom, Ireland
Cp297 IBM France
Cp33722 IBM-eucJP - Japanese (superset of 5050)
Cp420 IBM Arabic
Cp424 IBM Hebrew
Cp437 MS-DOS United States, Australia, New Zealand, South Africa
Cp500 EBCDIC 500V1
Cp737 PC Greek
Cp775 PC Baltic
Cp838 IBM Thailand extended SBCS
Cp850 MS-DOS Latin-1
Cp852 MS-DOS Latin-2
Cp855 IBM Cyrillic
Cp857 IBM Turkish
Cp858 Cp850 with the euro
Cp860 MS-DOS Portuguese
Cp861 MS-DOS Icelandic
Cp862 PC Hebrew
Cp863 MS-DOS Canadian French
Cp864 PC Arabic
Cp865 MS-DOS Nordic
Cp866 MS-DOS Russian
Cp868 MS-DOS Pakistan
Cp869 IBM Modern Greek
Cp870 IBM Multilingual Latin-2
Cp871 IBM Iceland
Cp874 IBM Thai
Cp875 IBM Greek
Cp918 IBM Pakistan(Urdu)
Cp921 IBM Latvia, Lithuania (AIX, DOS)
Cp922 IBM Estonia (AIX, DOS)
Cp930 Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
Cp933 Korean Mixed with 1880 UDC, superset of 5029
Cp935 Simplified Chinese Host mixed with 1880 UDC, superset of 5031
Cp937 Traditional Chinese Host miexed with 6204 UDC, superset of 5033
Cp939 Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
Cp942 Japanese (OS/2) superset of 932
Cp948 OS/2 Chinese (Taiwan) superset of 938
Cp949 PC Korean
Cp950 PC Chinese (Hong Kong, Taiwan)
Cp964 AIX Chinese (Taiwan)
Cp970 AIX Korean
EUC_CN GB2312, EUC encoding, Simplified Chinese
EUC_JP JIS0201, 0208, 0212, EUC Encoding, Japanese
EUC_KR KS C 5601, EUC Encoding, Korean
EUC_TW CNS11643 (Plane 1-3), T. Chinese, EUC encoding
GBK GBK, Simplified Chinese
ISO2022CN ISO 2022 CN, Chinese
ISO2022CN_CNS CNS 11643 in ISO-2022-CN form, T. Chinese
ISO2022CN_GB GB 2312 in ISO-2022-CN form, S. Chinese
ISO2022JP JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
ISO2022KR ISO 2022 KR, Korean
JIS0201 JIS 0201, Japanese
JIS0208 JIS 0208, Japanese
JIS0212 JIS 0212, Japanese
KOI8_R KOI8-R, Russian
MS874 Windows Thai
MacArabic Macintosh Arabic
MacCentralEurope Macintosh Latin-2
MacCroatian Macintosh Croatian
MacCyrillic Macintosh Cyrillic
MacDingbat Macintosh Dingbat
MacGreek Macintosh Greek
MacHebrew Macintosh Hebrew
MacIceland Macintosh Iceland
MacRoman Macintosh Roman
MacRomania Macintosh Romania
MacSymbol Macintosh Symbol
MacThai Macintosh Thai
MacTurkish Macintosh Turkish
MacUkraine Macintosh Ukraine
SJIS Shift-JIS, Japanese
UTF8 UTF-8

Character Encoding	Description
ASCII	ASCII
ISO8859_1	ISO 8859-1
ISO8859_2	ISO 8859-2
ISO8859_3	ISO 8859-3
ISO8859_4	ISO 8859-4
ISO8859_5	ISO 8859-5
ISO8859_6	ISO 8859-6
ISO8859_7	ISO 8859-7
ISO8859_8	ISO 8859-8
ISO8859_9	ISO 8859-9
ISO8859_15_FDIS	ISO 8859-15 (Final Draft Information Standard, based on ISO8859-1)
Big5	Big5, Traditional Chinese
Cp037	USA, Canada(Bilingual, French), Netherlands, Portugal, Brazil, Australia
Cp1006	IBM AIX Pakistan (Urdu)
Cp1025	IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovinia, Macedonia(FYR)
Cp1026	IBM Latin-5, Turkey
Cp1046	IBM Open Edition US EBCDIC
Cp1097	IBM Iran(Farsi)/Persian
Cp1098	IBM Iran(Farsi)/Persian (PC)
Cp1112	IBM Latvia, Lithuania
Cp1122	IBM Estonia
Cp1123	IBM Ukraine
Cp1124	IBM AIX Ukraine
Cp1140	Cp037 with the euro
Cp1141	Cp273 with the euro
Cp1142	Cp277 with the euro
Cp1143	Cp278 with the euro
Cp1144	Cp280 with the euro
Cp1145	Cp284 with the euro
Cp1146	Cp285 with the euro
Cp1147	Cp297 with the euro
Cp1148	Cp500 with the euro
Cp1149	Cp871 with the euro
Cp1250	Windows Eastern European
Cp1251	Windows Cyrillic
Cp1252	Windows Latin-1
Cp1253	Windows Greek
Cp1254	Windows Turkish
Cp1255	Windows Hebrew
Cp1256	Windows Arabic
Cp1257	Windows Baltic
Cp1258	Windows Vietnamese
Cp1381	IBM OS/2, DOS People's Republic of China (PRC)
Cp1383	IBM AIX People's Republic of China (PRC)
Cp273	IBM Austria, Germany
Cp277	IBM Denmark, Norway
Cp278	IBM Finland, Sweden
Cp280	IBM Italy
Cp284	IBM Catalan/Spain, Spanish Latin America
Cp285	IBM United Kingdom, Ireland
Cp297	IBM France
Cp33722	IBM-eucJP - Japanese (superset of 5050)
Cp420	IBM Arabic
Cp424	IBM Hebrew
Cp437	MS-DOS United States, Australia, New Zealand, South Africa
Cp500	EBCDIC 500V1
Cp737	PC Greek
Cp775	PC Baltic
Cp838	IBM Thailand extended SBCS
Cp850	MS-DOS Latin-1
Cp852	MS-DOS Latin-2
Cp855	IBM Cyrillic
Cp857	IBM Turkish
Cp858	Cp850 with the euro
Cp860	MS-DOS Portuguese
Cp861	MS-DOS Icelandic
Cp862	PC Hebrew
Cp863	MS-DOS Canadian French
Cp864	PC Arabic
Cp865	MS-DOS Nordic
Cp866	MS-DOS Russian
Cp868	MS-DOS Pakistan
Cp869	IBM Modern Greek
Cp870	IBM Multilingual Latin-2
Cp871	IBM Iceland
Cp874	IBM Thai
Cp875	IBM Greek
Cp918	IBM Pakistan(Urdu)
Cp921	IBM Latvia, Lithuania (AIX, DOS)
Cp922	IBM Estonia (AIX, DOS)
Cp930	Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
Cp933	Korean Mixed with 1880 UDC, superset of 5029
Cp935	Simplified Chinese Host mixed with 1880 UDC, superset of 5031
Cp937	Traditional Chinese Host miexed with 6204 UDC, superset of 5033
Cp939	Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
Cp942	Japanese (OS/2) superset of 932
Cp948	OS/2 Chinese (Taiwan) superset of 938
Cp949	PC Korean
Cp950	PC Chinese (Hong Kong, Taiwan)
Cp964	AIX Chinese (Taiwan)
Cp970	AIX Korean
EUC_CN	GB2312, EUC encoding, Simplified Chinese
EUC_JP	JIS0201, 0208, 0212, EUC Encoding, Japanese
EUC_KR	KS C 5601, EUC Encoding, Korean
EUC_TW	CNS11643 (Plane 1-3), T. Chinese, EUC encoding
GBK	GBK, Simplified Chinese
ISO2022CN	ISO 2022 CN, Chinese
ISO2022CN_CNS	CNS 11643 in ISO-2022-CN form, T. Chinese
ISO2022CN_GB	GB 2312 in ISO-2022-CN form, S. Chinese
ISO2022JP	JIS0201, 0208, 0212, ISO2022 Encoding, Japanese
ISO2022KR	ISO 2022 KR, Korean
JIS0201	JIS 0201, Japanese
JIS0208	JIS 0208, Japanese
JIS0212	JIS 0212, Japanese
KOI8_R	KOI8-R, Russian
MS874	Windows Thai
MacArabic	Macintosh Arabic
MacCentralEurope	Macintosh Latin-2
MacCroatian	Macintosh Croatian
MacCyrillic	Macintosh Cyrillic
MacDingbat	Macintosh Dingbat
MacGreek	Macintosh Greek
MacHebrew	Macintosh Hebrew
MacIceland	Macintosh Iceland
MacRoman	Macintosh Roman
MacRomania	Macintosh Romania
MacSymbol	Macintosh Symbol
MacThai	Macintosh Thai
MacTurkish	Macintosh Turkish
MacUkraine	Macintosh Ukraine
SJIS	Shift-JIS, Japanese
UTF8	UTF-8