Backend Identification of Incorrect Languages

kextyn · Mon Jun 20, 2022 7:39 am

I download Thai subtitles for my family (I cannot read it myself). I have noticed a lot of subtitles labeled as Thai that are obviously not due to the unique characters Thai uses. The majority of these have been in what I assume is Sinhalese. They use very different character sets.

I'm wondering if it is possible to run some analysis on the backend to identify issues like this and either re-tag, delete, or mark the subs to eliminate them from searches for the particular language. I'm sure it can be done for many languages that have unique characters.

For example, on my system I run a simple find paired with grep to search for "ම" (a very common character in Sinhalese). Then I do another grep for files that do not contain common Thai characters (e.g. "ร"). Of course some files are encoded in different formats that require conversion before I'm able to search, but it's all trivial to automate.

I'm not going to assume it's easy to perform this on the entire site, but is it feasible?

Mon Jun 20, 2022 9:21 am

Hi

thanks for feedback, can you please point 1-2 thai subtitles, which have this problem ?
Of course it is possible to run automatic detection for those, I just wonder if that automated uploads, or from users and which channel was used.

thanks

kextyn · Mon Jun 20, 2022 1:15 pm

Of course, here's one movie with some issues:

Incorrect:

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

Correct (TIS-620 encoding):

https://www.opensubtitles.org/en/subtit ... endgame-th

Although my spouse tells me this one seems like it's machine translated. It's understandable, but not normal speech.

Mon Jun 20, 2022 4:02 pm

Hi

thanks, that is helpful. It seems all wrong uploads coming from MX Player. I will try to look into it, once I have a bit of time.

Backend Identification of Incorrect Languages

Backend Identification of Incorrect Languages

Re: Backend Identification of Incorrect Languages

Re: Backend Identification of Incorrect Languages

Re: Backend Identification of Incorrect Languages

Who is online

OpenSubtitles.org Forum

Contact

All Open Subtitles

Social Links