Dictionary Formats

Babbletower needs format definitions to correctly work with dictionary data. If your data is in any of the predefined formats listed below, there is no need to define your own formats. However, if you have dictionary data in a format that is not supported, you will have to define appropriate formats for those dictionaries. Format definitions can be placed in dictionary setup files and in Babbletower's optional properties file.

Defining Custom Formats

The format of a dictionary entry is logically divided into head entry, pronunciation/ reading, translation, and remarks. To tell Babbletower which of these fields are contained in a line from a particular dictionary, and how a line is broken down in to those fields, you need to define the formats of the dictionaries you're using.

A format definition looks as follows:

format.name = /fields/field separators/separator replacements/

fields List the fields contained in a dictionary entry in the order in which they appear. The marks for the four fields are:

  • h - head entry
  • r - reading / pronunciation
  • t - translation
  • e - (explanatory) remark
field separators List the separators that mark the beginnings of the fields defined in fields, starting with the separator for the second field. (The first field starts at the beginning of the dictionary entry.)

Note: If you want to specify the slash character / as a field separator, you need to write it as {slash}, since / is used here to separate the three definitions.

separator replacements You may want to replace field separators to improve the appearance of dictionary entries. For example, if fields in a dictionary are separated by tabulators, you could replace them with spaces. Put the replacement characters for the separators defined in field separators here, in the same order, one character per separator. To keep a particular separator, put the @ mark in its place.

When defining the format for a dictionary, you need to inspect several entries to see which fields are contained, and how they are separated. Since this may all be a bit confusing, here an example:

Example

Following three entries taken from a German-English dictionary:

Freude {f}	enjoyment
Freudenfeuer {n}; Feuer im Freien	bonfire
Freunde {pl}; Bekannte {pl}	friends

An appropriate format definition would be:

format.gereng = /ht/\t/@/
This tells Babbletower that an entry from this dictionary contains a head entry and a translation. The translation is separated from the head entry by a tabulator. (It is safer to write a tabulator using the escape sequence \t, as shown in the example.) It also states that the separator should not be replaced.

To also extract grammar remarks, e.g. the {f} after Freude, indicating that this is a female noun, you could extend this format in the following way:

format.gereng = /het/{\t/@@/
This defines that an entry has three fields now -the explanation field was inserted- and that this field starts with an open curly brace {. However, looking at the second and third sample entry, we see that a grammar remark may be followed by another head entry, and that an entry may also contain more than one grammar remark. These entries would therefore be incorrectly separated in to:

head entryexplanationtranslation
Freudenfeuer 
{n}; Feuer im Freien
	bonfire
Freunde 
{pl}; Bekannte {pl}
	friends

Predefined Formats

Listed below a few predefined formats supported by Babbletower. Note that you cannot redefine these formats: Custom format definitions with any of the predefined names will be silently ignored.

default This is the default dictionary format. Whenever there is a problem with a user defined format, Babbletower will attempt to use this format instead. The equivalent definition is:

format.default = /hret/\t\t\t/ /
 

edict This is a format for the popular edict Japanese-English dictionary from the Monash Nihongo ftp Archive. It's equivalent definition would be:

format.edict = /hret/[{slash} /@ @/

However, this format comes with some additional on-the-fly transcoding for improved readability.
 

edict_pda Same as edict, but with slightly different on-the-fly transcoding for nicer display on small screens.
 
kanjidic Format for kanjidic, another popular dictionary from the Monash site. This format has no equivalent definition. The specifics of this dictionary required a fully programmatic formatter.
 
Back to top of manual