Clue is used to store keywords that the Japanese Parser uses to separate input text into tokens and to determine business/personal classification. The following types are used for the Clue value.
Type | Item |
Description |
---|---|---|
T | Business Type | Words to describe business type. Example: 株式会社,(有) |
N | Business Name | Parse as business name if this token is found at the beginning of the string (excluding business type). |
E | Business Name Suffix | Words such as 病院, 学校. |
D | Branch Name | It can be a branch name by itself. Example: 人事部,経理部 |
B | Branch Name Suffix | Usually this token is merged into the previous token and constitutes a branch name. Example: 支店, 営業所 |
C | Business Keyword | Words that can be part of business name or branch name. Example: データ, 建設 |
H | Honorific | Words for honorific. Example: 様, 殿 |
P | Title (position) | Words for Title. Example: 代表取締役, 公認会計士 |
R | Region |
Words for Region. Example: 東京, 長野 |
Format
Each entry consists of the following items:
'<Clue word entry in zenkaku>' att=clue type=<type> hankaku='<Clue word entry in hankaku>'
Example
'人事部' att=clue type D hankaku=''ジンジブ'
'(株) ' att=clue type T hankaku=''(カブ)'
Input data | (株)アグレックス人事部 | ||
After token separation | Business type (T) | Unknown word | Branch name (D) |
(株) | アグレックス | 人事部 | |
Output data | Business type | Business name | Branch name |
(株) | アグレックス | 人事部 |
In this case, "(株)" matches one of business types (T type), and "人事部" matches one of branch name keywords (D type), therefore the token type for the each word was determined. In the final output, the unknown word "アグレックス" was recognized as business name and each word was written out in the proper output attribute.