I download Thai subtitles for my family (I cannot read it myself). I have noticed a lot of subtitles labeled as Thai that are obviously not due to the unique characters Thai uses. The majority of these have been in what I assume is Sinhalese. They use very different character sets.
I'm wondering if it is possible to run some analysis on the backend to identify issues like this and either re-tag, delete, or mark the subs to eliminate them from searches for the particular language. I'm sure it can be done for many languages that have unique characters.
For example, on my system I run a simple find paired with grep to search for "ම" (a very common character in Sinhalese). Then I do another grep for files that do not contain common Thai characters (e.g. "ร"). Of course some files are encoded in different formats that require conversion before I'm able to search, but it's all trivial to automate.
I'm not going to assume it's easy to perform this on the entire site, but is it feasible?