how to detect other languages (non-Latin) except in English in a text column in PostgreSQL? [duplicate]

how to detect other languages (non-Latin) except in English in a text column in PostgreSQL? [duplicate]



This question already has an answer here:



I have a table with two columns: one is id and another one is a text column. I want to keep only the rows that the text value is in English.



The languages I am talking about, are the ones that use non Latin alphabet such as Arabic, Chinese and Cyrillic.
This question has been asked around 2012, and I was wondering if there is some new solution rather dealing with it in another programming languages!



This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.




1 Answer
1



It is not an easy problem. There are several libraries for language detection out there (e.g. langdetect), but they don't work inside the database, so you'd have to process all records by selecting them out, processing them in another language then deleting if they fail the test. Furthermore, the accuracy is not great, and decreases as text gets shorter; if your texts are just a couple of words, the accuracy is pretty horrible.






One easier approach is to use the neural network exposed by Google Translate API, to determine language OR It depends on the which Charset that you use Unicode or UTF-8 or UTF-16 w3schools.com/html/html_charset.asp or en.wikipedia.org/wiki/List_of_Unicode_characters It can be confusing to determine which language a text can belong.For example the word "gracias" can be represented using same character for example ASCII or UTF-8. For non-latin languages and non-cyrillic, if you use Unicode, it can be mapped easily

– Mount Mani
Sep 6 '18 at 9:21






Actually I do not care, what languages they are, I just wanted to keep the English! and I do have Chinese and Arabic in my data which I want to remove!

– Raha1986
Sep 6 '18 at 9:26






If they’re all non-Latin, it’s easy.

– Amadan
Sep 6 '18 at 9:27






you believe for non Latin ones, I still need to write a code?

– Raha1986
Sep 6 '18 at 9:34






No, a simple regexp should suffice. It's the ones who use a subset of English alphabet that are the worst to detect.

– Amadan
Sep 6 '18 at 17:59

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ

Node.js puppeteer - Use values from array in a loop to cycle through pages