Kanjis to Romajis first letter [closed]









up vote
3
down vote

favorite












I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.



I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.



My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).



Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?










share|improve this question













closed as off-topic by Dono, ajsmart, Blavius, broccoli forest, Flaw Nov 9 at 18:46


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking resources or advice about learning Japanese are off-topic here, but you may find our list of resources for learning Japanese helpful." – Dono, ajsmart, Blavius
If this question can be reworded to fit the rules in the help center, please edit the question.








  • 5




    This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
    – G-Cam
    Nov 8 at 13:31






  • 3




    I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
    – Eiríkr Útlendi
    Nov 8 at 14:07










  • The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
    – AmigoJack
    Nov 8 at 14:39






  • 3




    @AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
    – naruto
    Nov 8 at 14:58







  • 2




    This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
    – naruto
    Nov 8 at 16:01















up vote
3
down vote

favorite












I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.



I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.



My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).



Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?










share|improve this question













closed as off-topic by Dono, ajsmart, Blavius, broccoli forest, Flaw Nov 9 at 18:46


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking resources or advice about learning Japanese are off-topic here, but you may find our list of resources for learning Japanese helpful." – Dono, ajsmart, Blavius
If this question can be reworded to fit the rules in the help center, please edit the question.








  • 5




    This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
    – G-Cam
    Nov 8 at 13:31






  • 3




    I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
    – Eiríkr Útlendi
    Nov 8 at 14:07










  • The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
    – AmigoJack
    Nov 8 at 14:39






  • 3




    @AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
    – naruto
    Nov 8 at 14:58







  • 2




    This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
    – naruto
    Nov 8 at 16:01













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.



I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.



My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).



Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?










share|improve this question













I'm fond of Unicode and what romanization attempts. My goal is to get a first latin char (for sorting purposes) of a non-latin character - so far I succeeded by transcribing Cyrillic, Greek, Hebrew, Katakana, Hiragana, Hangul (including all its syllables), Berber, Thai and Arabic letters by assigning the most appropriate starting letter to each case.



I also know that multiple systems for transliterations and transcribings (and romanization) exist - so far their differences are almost irrelevant for my needs. I'm not fond of Japanese itself - at most I might be able to recognize English terms written in Katakanas.



My problem is: how to assign Unicode code points U+4E00 thru U+9FFF by algorithm? For Hangul syllables this is quite easy: U+AC00 thru U+B097 => K (as all of them start with that); U+B098 thru B2E3 => N. I've looked at JS solutions like https://github.com/hexenq/kuroshiro/ and https://github.com/WaniKani/WanaKana/, but I only find the code for processing Hiraganas and Katakanas (which I already got), never Kanjis (although all of their demos succeed in processing them).



Is there a table or dictionary? If romanization of Kanjis is achieved thru first converting each Kanji into Katakanas, then how to achieve that?







kanji rōmaji unicode






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 8 at 11:28









AmigoJack

1183




1183




closed as off-topic by Dono, ajsmart, Blavius, broccoli forest, Flaw Nov 9 at 18:46


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking resources or advice about learning Japanese are off-topic here, but you may find our list of resources for learning Japanese helpful." – Dono, ajsmart, Blavius
If this question can be reworded to fit the rules in the help center, please edit the question.




closed as off-topic by Dono, ajsmart, Blavius, broccoli forest, Flaw Nov 9 at 18:46


This question appears to be off-topic. The users who voted to close gave this specific reason:


  • "Questions seeking resources or advice about learning Japanese are off-topic here, but you may find our list of resources for learning Japanese helpful." – Dono, ajsmart, Blavius
If this question can be reworded to fit the rules in the help center, please edit the question.







  • 5




    This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
    – G-Cam
    Nov 8 at 13:31






  • 3




    I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
    – Eiríkr Útlendi
    Nov 8 at 14:07










  • The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
    – AmigoJack
    Nov 8 at 14:39






  • 3




    @AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
    – naruto
    Nov 8 at 14:58







  • 2




    This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
    – naruto
    Nov 8 at 16:01













  • 5




    This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
    – G-Cam
    Nov 8 at 13:31






  • 3




    I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
    – Eiríkr Útlendi
    Nov 8 at 14:07










  • The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
    – AmigoJack
    Nov 8 at 14:39






  • 3




    @AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
    – naruto
    Nov 8 at 14:58







  • 2




    This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
    – naruto
    Nov 8 at 16:01








5




5




This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
Nov 8 at 13:31




This question is probably off topic at this forum but for what it's worth, kanji in Japanese have multiple pronunciations and you will not be able to assign definitive pronunciation to most of them. The best you could do is find a downloadable kanji database and choose a pronunciation (it will probably be in hiragana) arbitrarily. Perhaps the database will also include some form of probability that you could use to your advantage.
– G-Cam
Nov 8 at 13:31




3




3




I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– Eiríkr Útlendi
Nov 8 at 14:07




I'm curious, why? Granted, you can assign a Latin letter to each kanji, but as others have noted, that is mostly an arbitrary process. I'm left scratching my head as to what use this would be. I suppose as a general coding project, it might be a fun puzzle, but as to final utility, I'm baffled.
– Eiríkr Útlendi
Nov 8 at 14:07












The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
Nov 8 at 14:39




The goal is to have both "Takkyu" and "卓球" being listed under "T", instead of having all non-latin words/names being listed under "#", appealing to users who rather deal with latin letters. It may not make sense for Kanjis/CJK, but for i.e. Hangul and Cyrillic ("Yulia" and "Ю́лия" both under "Y") - most probably I'll realize it makes too little sense for all CJK idiographs.
– AmigoJack
Nov 8 at 14:39




3




3




@AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
Nov 8 at 14:58





@AmigoJack Wait, so you're trying deal with words (卓球) rather than characters (卓)? Then character-based approach described in my answer will make almost no sense in Japanese. One kanji can have many readings, and Unihan_Readings.txt is almost useless to determine the reading of an individual word. For example 生命 is Seimei, 生地 is Kiji, 生卵 is Namatamago and 生霊 is Ikiryo. What you may need is a morphological analyzer introduced here.
– naruto
Nov 8 at 14:58





2




2




This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
– naruto
Nov 8 at 16:01





This is a must-read if you want to go any further. Perfect ronaminzation is incredibly difficult if you implement from scratch, but there are several free projects. And the dictionary you need is not large unless you have to handle very rare words. Have you tried this?
– naruto
Nov 8 at 16:01











1 Answer
1






active

oldest

votes

















up vote
10
down vote



accepted










EDIT: It turned out that OP did not know most kanji have multiple readings in Japanese. He is actually trying to get the reading of words (e.g., 生地 → K, 生卵 → N, 生命 → S, 生霊 → I). For this purpose, a word-based dictionary is needed. There are several open source morphological analyzers that can do the job and more (kuromoji, mecab, Kagome, etc). See also: Is it possible to algorithmically convert Japanese text to Romaji?




Original Answer



CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):



Readings of 日



The table says the "first roman letter of 日" is J in Cantonese, R in Mandarin, H-or-K-or-N-or-J in Japanese and I in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn field (or the first letter of kJapaneseKun if there is no kJapaneseOn).






share|improve this answer


















  • 1




    This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
    – AmigoJack
    Nov 8 at 13:25










  • Oh I didn't know it's available for download!
    – naruto
    Nov 8 at 13:44










  • I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
    – Mikaeru
    Nov 8 at 14:39

















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
10
down vote



accepted










EDIT: It turned out that OP did not know most kanji have multiple readings in Japanese. He is actually trying to get the reading of words (e.g., 生地 → K, 生卵 → N, 生命 → S, 生霊 → I). For this purpose, a word-based dictionary is needed. There are several open source morphological analyzers that can do the job and more (kuromoji, mecab, Kagome, etc). See also: Is it possible to algorithmically convert Japanese text to Romaji?




Original Answer



CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):



Readings of 日



The table says the "first roman letter of 日" is J in Cantonese, R in Mandarin, H-or-K-or-N-or-J in Japanese and I in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn field (or the first letter of kJapaneseKun if there is no kJapaneseOn).






share|improve this answer


















  • 1




    This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
    – AmigoJack
    Nov 8 at 13:25










  • Oh I didn't know it's available for download!
    – naruto
    Nov 8 at 13:44










  • I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
    – Mikaeru
    Nov 8 at 14:39














up vote
10
down vote



accepted










EDIT: It turned out that OP did not know most kanji have multiple readings in Japanese. He is actually trying to get the reading of words (e.g., 生地 → K, 生卵 → N, 生命 → S, 生霊 → I). For this purpose, a word-based dictionary is needed. There are several open source morphological analyzers that can do the job and more (kuromoji, mecab, Kagome, etc). See also: Is it possible to algorithmically convert Japanese text to Romaji?




Original Answer



CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):



Readings of 日



The table says the "first roman letter of 日" is J in Cantonese, R in Mandarin, H-or-K-or-N-or-J in Japanese and I in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn field (or the first letter of kJapaneseKun if there is no kJapaneseOn).






share|improve this answer


















  • 1




    This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
    – AmigoJack
    Nov 8 at 13:25










  • Oh I didn't know it's available for download!
    – naruto
    Nov 8 at 13:44










  • I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
    – Mikaeru
    Nov 8 at 14:39












up vote
10
down vote



accepted







up vote
10
down vote



accepted






EDIT: It turned out that OP did not know most kanji have multiple readings in Japanese. He is actually trying to get the reading of words (e.g., 生地 → K, 生卵 → N, 生命 → S, 生霊 → I). For this purpose, a word-based dictionary is needed. There are several open source morphological analyzers that can do the job and more (kuromoji, mecab, Kagome, etc). See also: Is it possible to algorithmically convert Japanese text to Romaji?




Original Answer



CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):



Readings of 日



The table says the "first roman letter of 日" is J in Cantonese, R in Mandarin, H-or-K-or-N-or-J in Japanese and I in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn field (or the first letter of kJapaneseKun if there is no kJapaneseOn).






share|improve this answer














EDIT: It turned out that OP did not know most kanji have multiple readings in Japanese. He is actually trying to get the reading of words (e.g., 生地 → K, 生卵 → N, 生命 → S, 生霊 → I). For this purpose, a word-based dictionary is needed. There are several open source morphological analyzers that can do the job and more (kuromoji, mecab, Kagome, etc). See also: Is it possible to algorithmically convert Japanese text to Romaji?




Original Answer



CJK unified ideographs are sorted based on radicals, not readings. This is because it's impossible to determine the reading of those characters in one way. The Unicode Consortium provides Unihan Database, which can display the representative readings of CJK ideographs written in simple Latin alphabet. For example, here is the result for a very basic ideograph 日 (U+65E5; "day", "date", "sun", etc):



Readings of 日



The table says the "first roman letter of 日" is J in Cantonese, R in Mandarin, H-or-K-or-N-or-J in Japanese and I in Korean. To understand what's going on here, please keep in mind that the 'CJK Unified Ideographs' block has characters used in Chinese, Korean and Japanese jumbled together. Each character is read differently in different languages. Especially in Japanese, one character can have many readings depending on the context. To make matters worse, there are some characters whose readings are totally unknown. If you can accept all those limitations and still want the Latin readings anyway, go ahead and use the database according to your needs. If you're only interested in Japanese kanji, a reasonable method would be to pick the first letter of the kJapaneseOn field (or the first letter of kJapaneseKun if there is no kJapaneseOn).







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 2:58

























answered Nov 8 at 12:35









naruto

147k8140272




147k8140272







  • 1




    This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
    – AmigoJack
    Nov 8 at 13:25










  • Oh I didn't know it's available for download!
    – naruto
    Nov 8 at 13:44










  • I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
    – Mikaeru
    Nov 8 at 14:39












  • 1




    This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
    – AmigoJack
    Nov 8 at 13:25










  • Oh I didn't know it's available for download!
    – naruto
    Nov 8 at 13:44










  • I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
    – Mikaeru
    Nov 8 at 14:39







1




1




This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
Nov 8 at 13:25




This is a huge step forward to me - thru Unihan_Readings.txt in unicode.org/Public/UCD/latest/ucd/Unihan.zip I now have something to start automation with, despite being CJK (and more). Great!
– AmigoJack
Nov 8 at 13:25












Oh I didn't know it's available for download!
– naruto
Nov 8 at 13:44




Oh I didn't know it's available for download!
– naruto
Nov 8 at 13:44












I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
Nov 8 at 14:39




I recently released the Unihan database equivalent data in JSON format, either as a single file (unihan-data-json.zip) or as a set of files for each property (unihan-data.zip); they are available for download here. HTH...
– Mikaeru
Nov 8 at 14:39



Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ữḛḳṊẴ ẋ,Ẩṙ,ỹḛẪẠứụỿṞṦ,Ṉẍừ,ứ Ị,Ḵ,ṏ ṇỪḎḰṰọửḊ ṾḨḮữẑỶṑỗḮṣṉẃ Ữẩụ,ṓ,ḹẕḪḫỞṿḭ ỒṱṨẁṋṜ ḅẈ ṉ ứṀḱṑỒḵ,ḏ,ḊḖỹẊ Ẻḷổ,ṥ ẔḲẪụḣể Ṱ ḭỏựẶ Ồ Ṩ,ẂḿṡḾồ ỗṗṡịṞẤḵṽẃ ṸḒẄẘ,ủẞẵṦṟầṓế

⃀⃉⃄⃅⃍,⃂₼₡₰⃉₡₿₢⃉₣⃄₯⃊₮₼₹₱₦₷⃄₪₼₶₳₫⃍₽ ₫₪₦⃆₠₥⃁₸₴₷⃊₹⃅⃈₰⃁₫ ⃎⃍₩₣₷ ₻₮⃊⃀⃄⃉₯,⃏⃊,₦⃅₪,₼⃀₾₧₷₾ ₻ ₸₡ ₾,₭⃈₴⃋,€⃁,₩ ₺⃌⃍⃁₱⃋⃋₨⃊⃁⃃₼,⃎,₱⃍₲₶₡ ⃍⃅₶₨₭,⃉₭₾₡₻⃀ ₼₹⃅₹,₻₭ ⃌