Page 1 of 1

Backend Identification of Incorrect Languages

Posted: Mon Jun 20, 2022 7:39 am
by kextyn
I download Thai subtitles for my family (I cannot read it myself). I have noticed a lot of subtitles labeled as Thai that are obviously not due to the unique characters Thai uses. The majority of these have been in what I assume is Sinhalese. They use very different character sets.

I'm wondering if it is possible to run some analysis on the backend to identify issues like this and either re-tag, delete, or mark the subs to eliminate them from searches for the particular language. I'm sure it can be done for many languages that have unique characters.

For example, on my system I run a simple find paired with grep to search for "ම" (a very common character in Sinhalese). Then I do another grep for files that do not contain common Thai characters (e.g. "ร"). Of course some files are encoded in different formats that require conversion before I'm able to search, but it's all trivial to automate.

I'm not going to assume it's easy to perform this on the entire site, but is it feasible?

Re: Backend Identification of Incorrect Languages

Posted: Mon Jun 20, 2022 9:21 am
by oss
Hi

thanks for feedback, can you please point 1-2 thai subtitles, which have this problem ?
Of course it is possible to run automatic detection for those, I just wonder if that automated uploads, or from users and which channel was used.

thanks

Re: Backend Identification of Incorrect Languages

Posted: Mon Jun 20, 2022 1:15 pm
by kextyn
Of course, here's one movie with some issues:

Incorrect:

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

Correct (TIS-620 encoding):

https://www.opensubtitles.org/en/subtit ... endgame-th

Although my spouse tells me this one seems like it's machine translated. It's understandable, but not normal speech.

Re: Backend Identification of Incorrect Languages

Posted: Mon Jun 20, 2022 4:02 pm
by oss
Hi

thanks, that is helpful. It seems all wrong uploads coming from MX Player. I will try to look into it, once I have a bit of time.