Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
kextyn
Posts: 2
Joined: Mon Jun 20, 2022 7:19 am

Backend Identification of Incorrect Languages

Mon Jun 20, 2022 7:39 am

I download Thai subtitles for my family (I cannot read it myself). I have noticed a lot of subtitles labeled as Thai that are obviously not due to the unique characters Thai uses. The majority of these have been in what I assume is Sinhalese. They use very different character sets.

I'm wondering if it is possible to run some analysis on the backend to identify issues like this and either re-tag, delete, or mark the subs to eliminate them from searches for the particular language. I'm sure it can be done for many languages that have unique characters.

For example, on my system I run a simple find paired with grep to search for "ම" (a very common character in Sinhalese). Then I do another grep for files that do not contain common Thai characters (e.g. "ร"). Of course some files are encoded in different formats that require conversion before I'm able to search, but it's all trivial to automate.

I'm not going to assume it's easy to perform this on the entire site, but is it feasible?

User avatar
oss
Site Admin
Posts: 5879
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Backend Identification of Incorrect Languages

Mon Jun 20, 2022 9:21 am

Hi

thanks for feedback, can you please point 1-2 thai subtitles, which have this problem ?
Of course it is possible to run automatic detection for those, I just wonder if that automated uploads, or from users and which channel was used.

thanks

kextyn
Posts: 2
Joined: Mon Jun 20, 2022 7:19 am

Re: Backend Identification of Incorrect Languages

Mon Jun 20, 2022 1:15 pm

Of course, here's one movie with some issues:

Incorrect:

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

https://www.opensubtitles.org/en/subtit ... endgame-th

Correct (TIS-620 encoding):

https://www.opensubtitles.org/en/subtit ... endgame-th

Although my spouse tells me this one seems like it's machine translated. It's understandable, but not normal speech.

User avatar
oss
Site Admin
Posts: 5879
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Backend Identification of Incorrect Languages

Mon Jun 20, 2022 4:02 pm

Hi

thanks, that is helpful. It seems all wrong uploads coming from MX Player. I will try to look into it, once I have a bit of time.

Return to “General talk”

Who is online

Users browsing this forum: Ahrefs [Bot] and 44 guests