how to detect other languages (non-Latin) except in English in a text column in PostgreSQL? [duplicate]
how to detect other languages (non-Latin) except in English in a text column in PostgreSQL? [duplicate]
This question already has an answer here:
I have a table with two columns: one is id and another one is a text column. I want to keep only the rows that the text value is in English.
The languages I am talking about, are the ones that use non Latin alphabet such as Arabic, Chinese and Cyrillic.
This question has been asked around 2012, and I was wondering if there is some new solution rather dealing with it in another programming languages!
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
1 Answer
1
It is not an easy problem. There are several libraries for language detection out there (e.g. langdetect), but they don't work inside the database, so you'd have to process all records by selecting them out, processing them in another language then deleting if they fail the test. Furthermore, the accuracy is not great, and decreases as text gets shorter; if your texts are just a couple of words, the accuracy is pretty horrible.
Actually I do not care, what languages they are, I just wanted to keep the English! and I do have Chinese and Arabic in my data which I want to remove!
– Raha1986
Sep 6 '18 at 9:26
If they’re all non-Latin, it’s easy.
– Amadan
Sep 6 '18 at 9:27
you believe for non Latin ones, I still need to write a code?
– Raha1986
Sep 6 '18 at 9:34
No, a simple regexp should suffice. It's the ones who use a subset of English alphabet that are the worst to detect.
– Amadan
Sep 6 '18 at 17:59
One easier approach is to use the neural network exposed by Google Translate API, to determine language OR It depends on the which Charset that you use Unicode or UTF-8 or UTF-16 w3schools.com/html/html_charset.asp or en.wikipedia.org/wiki/List_of_Unicode_characters It can be confusing to determine which language a text can belong.For example the word "gracias" can be represented using same character for example ASCII or UTF-8. For non-latin languages and non-cyrillic, if you use Unicode, it can be mapped easily
– Mount Mani
Sep 6 '18 at 9:21