Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Subtitle language: country codes

Tue Jun 23, 2015 8:48 pm

Hi.

I was wondering what happens with the subtitle language. Could you set the country code?

In the locale internationalisation (en_GB) system there is a placeholder for country code (e.g. the "GB" part). Perhaps there are languages that don't have this problem but Spanish is among the most affected. The language differences between South America and Spain are too big too keep them hidden. But the language is just "Spanish". There is no country information. It would be nice to have "en_GB" separated from "en_US" and "es_ES" separated from "es_CO", "es_MX", etc.

User avatar
SmallBrother
Site Admin
Posts: 3726
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Subtitle language: country codes

Wed Jun 24, 2015 7:43 pm

A few more languages also have different 'versions'. England/USA, Netherlands/Belgium, Germany/Switzerland, Portugal/Brazil and I am sure a bunch more. Differences vary from minor to completely different words and/or grammar. On OpenSubtitles.org only Portuguese is split (Brazil and Portugal).

I don't know how 'extreme' the differences are in Spanish, but the question is if things would get better by splitting subtitles into regional versions of that language. Theoretically it might solve the problem of those language differences, but realistically I am afraid it would (only) create more chaos - a lot more moderating work for admins, and for the downloader more to choose from. Besides, what would I do, if I don't find my desired subs in ES_CO, but there is one available in ES_MX?

A maybe similar situation is with Dutch, as spoken in the Netherlands versus the north of Belgium. Differences are more than just for example minor differences like the s and the z in English or "colour" versus "color". Some words are totally different and a typical NL word may sound 'funny' for a Belgium person and vice versa. The theoretical solution is to avoid these typical words and use 'standard language', this is a guideline for dutch subtitles, but it is very hard to know if the language I am talking is standard or regionally colored.

And think about this: For Dutch, Amsterdam has some slang with is used only in Amsterdam and someone in Groningen might not understand. Maybe the same for language used in Mexico City and the countryside of Mexico. Should we also have ES_MX_MC and ES_MX_CS?

So in general I think a better solution is to make subs using as much as possible 'standard language'. And for as far as that doesn't work, accept the differences, which may be 'funny', but comprehensible.

Just from curiosity... can you give examples of the differences within Spanish you are talking about?
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Subtitle language: country codes

Thu Jun 25, 2015 7:41 am

I agree with SmallBrother. But for another side, if differences are so big (Por vs Por Brazil), it is possible to split it. But then - who will check all subs in that language to move it to another language ?

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Thu Jun 25, 2015 9:21 am

One example from "28 days":

Se rompió un caño de agua
y el tren se demoró en Rye.

In Spain we would say:

Se ha roto una cañería…
y el tren se ha parado en Rye,

It's not that I can't understand it. It is, as you say, that it sounds funny or strange.

One tense difference: "rompió" -> "ha roto" and two vocabulary differences: "caño" -> "cañería", "demorar" -> "parar". It's not incorrect. It's only a matter of use. But you know it is not "Spanish from Spain". As you'll probably notice that I'm not English.

There are many vocabulary differences. One typical English example is "cupboard" - "closet" or "refrigerator" - "icebox". Sometimes you can even find the same word with different "standard" meaning. For example, the verb "parar" (to stop) has completely different meaning in Latin America. That's why they use "detener" or "demorar" instead. I don't know. But I can tell in 10 seconds that it's not es_ES.

If I get one of these "español latino" (from some American country) subtitles I can cope with the language differences. I can understand it. But it would be nice to know.

Yeah, es_ES_Madrid_MyStreet... it sounds to me a bit too much :-D You have to stop at some point. But es_ES or es_MX, es_AR... doesn't seem too much. I think it's different enough to be explicit. And I don't think the difference is less significant than that between pt_BR and pt_PT. If you want to be consistent it should be done for every language.
But then - who will check all subs in that language to move it to another language ?
Who rates subtitles or mark them as bad?
At first you could leave them in a generic container "es" or "es_*", without country code. If you look for "es_ES" and there isn't any you could fall back to generic "es". Yes, I know it means some extra work to implement it in searching. But I don't think it's too much.

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Thu Jun 25, 2015 11:19 am

And this, I was thinking, would bring up another question.

There could be some subtitle that is clearly "español latino", that is, you know it is from some South American country. But it couldn't be so easy to classify it as es_CO (Colombia) or es_VE (Venezuela). The same for
es_AR and es_CL (Argentina and Chile) or Argentina and Uruguay. The language difference could be less notable between those countries so it could be hard to assign it to only one country.

Then I suppose you could leave it as generic Spanish or choose one country from the various possibilities.
Another option would be assign not only one country to the file but a list. But that would complicate classifying and searching too much. I don't know. It would be something like hashes?

I think one country with wildcards in search, or several alternatives in search would suffice. And you could always leave out the country information.

Yes, life is tough. :-)

User avatar
SmallBrother
Site Admin
Posts: 3726
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Subtitle language: country codes

Thu Jun 25, 2015 12:02 pm

Hector, from the examples you give, I think differences are big enough to maybe do something with it. I don't know about the differences between pt-pt and pt-br, but it may be similar. Same for NL Dutch versus Flamish - mostly no problem to understand, but clearly made by a Dutch or Belgium person.

Oss has a very practical point - and you a very practical answer ;-) But you were saying "it would be nice to know" and that made me think a possible solution would be NOT to SPLIT languages, but to ADD an optional field with region info.

For a Spanish subtitle it could be
Language (mandatory, obviously): Spanish
Language region (optional): Standard(?) / Spain / South America / Mexico / Colombia / Peru / Bolivia / Etc.1 / Etc.2 / Etc.3

A search can be done for all Spanish subs, and optionally filtered by the region field.

And we would have to find out for which other languages we would need something similar.

What is oss opinion? Good idea? Possible to implement technically?
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Thu Jun 25, 2015 12:35 pm

Yes, I can tell you the differences are big. At least Spain - Latin America. But I'm sure there are also some differences between Cuba and Argentina, for example. In some places they only differentiate "español latino" and "español de España" (Latin America as a whole versus Spain). But they could argue that MX is not the same as AR. I don't know about that.

My point is: the "Spanish" of many subtitles sounds very weird to me. It's okay to have these but I'd just like to know in advance.

My other point is that I could provide that information. I know my subtitles are es_ES (Spain) but now I can't store it anywhere.

In
a possible solution would be NOT to SPLIT languages, but to ADD an optional field with region info.
Yes, in terms of database administration, it would mean adding one field that could be "language_region" or "country". Or it could be part of language field and use regular expressions to extract the country part. It's up to you. Perhaps having the language and country separated would be clearer and easier to implement. So it would be very easy to leave it blank or NULL to get the current behaviour.

I think ISO 3166-1 alpha-2 could be a good choice.

User avatar
SmallBrother
Site Admin
Posts: 3726
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Subtitle language: country codes

Thu Jun 25, 2015 6:24 pm

My point is: the "Spanish" of many subtitles sounds very weird to me. It's okay to have these but I'd just like to know in advance.
My other point is that I could provide that information. I know my subtitles are es_ES (Spain) but now I can't store it anywhere.
So, do I understand well, my solution would do the job
- when uploading, be able to store that info
- when downloading, know what you will get (if the info is known, of course)
?
I think ISO 3166-1 alpha-2 could be a good choice.
The problem is that countries only will not cover all situations, or at least be confusing. For Spanish, I understand there is a clear distinction between Latin America and Spain, but (maybe) not clearly between all individual Latin American countries. If that is the case it doesn't make sense to define the Spanish as "Mexican" or "Bolivian" and to force that. Maybe "Latin America" would be enough. But that's not a country...

Another problem is within for example Dutch, with Netherlands Dutch and Belgium Dutch. In Belgium there are TWO languages, Flemish and French. So you would have nl_nl and nl_be, but maybe also fr_fr and fr_be. Swizerland is even worse ;-) and has three official and actually even four languages... Hm, hard to explain... am I clear?

Anyway, so two-letter country codes would not be enough. The actual way of storing a code in the database and to implement it into the web site would be an Oss Challenge ;-)

But I like the idea.
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Fri Jun 26, 2015 1:32 am

The problem is that countries only will not cover all situations, or at least be confusing. For Spanish, I understand there is a clear distinction between Latin America and Spain, but (maybe) not clearly between all individual Latin American countries. If that is the case it doesn't make sense to define the Spanish as "Mexican" or "Bolivian" and to force that. Maybe "Latin America" would be enough. But that's not a country.
Yes, that would be solved by storing a list of countries the file language "complies" with. But I think it would be too much. In this case you could set a group of countries. ISO 3166-1 could provide a solution because it has some code zones free for assignment at the disposal of users. If I understand it right you can use this codes for your needs as you like. One of these zones is every code with an "X" as the first character, the range "XA" to "XZ". So you could still use iso 3166-1 codes for every country. You could say:

XA -> "Latin America"
XB -> Arab countries (MA, DZ, SA...)
XC -> another fancy group

You just store the code and do the conversion in both directions when necessary.

As I said, I don't know how big are the language differences between Cuba and Argentina (both are part of Latin America) for example. This should be answered by people in those countries: whether they consider their language to be the same or different from another South American country.

The problem of this approach is that the number of country groups is very limited. I don't think you need more than 10 groups. Then it would be O.K.

I don't quite understand your question. You have nl_NL and nl_BE? No problem, that's the purpose. You have language="Dutch" and country="NL" or country="BE" or country="AW" (Aruba, near the coast of Venezuela). And in the same way you can have language="Limburgish" and country="NL" I think every combination is possible though some could not make sense. For example es_NL does not make sense.

User avatar
SmallBrother
Site Admin
Posts: 3726
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Subtitle language: country codes

Fri Jun 26, 2015 9:44 am

I don't quite understand your question. You have nl_NL and nl_BE? No problem, that's the purpose. You have language="Dutch" and country="NL" or country="BE" or country="AW" (Aruba, near the coast of Venezuela).
Then maybe I misunderstood you. As for actual database contents, nl_nl and nl_be would be sufficient. I was thinking of the human approach, users will have to fill out language and country (maybe better called "language region" or so). Not only country, because sometimes it's just a part of a country (like Belgium with French and Flemish), and sometimes it's a group of countries (like Spanish in (maybe) all of Latin America).
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Fri Jun 26, 2015 10:40 am

Think of it rather as "language variant". Perhaps the term "country" can be misleading. It is all about language. That is one reason to store it in the language field.

By the way, how do you store internally the language? Can we talk about OS internals or is it secret information? I hope not. I like the "open" part of the name :-)

The other day I was surprised to find a subtitle in "Catalonian" and "Basque". What set of languages can you use? Could I set language to Aragonese (iso 639 code is "an") for example? It's not that I'm going to, I was just curious about that.

User avatar
oss
Site Admin
Posts: 5887
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: Subtitle language: country codes

Sat Jun 27, 2015 1:31 pm

nice ideas folks. I can explain how is the internals. When I started to work on opensubtitles, I decided using ISO639 (3 character codes) would be enough, so we got full support of this (and today I know, it is not really perfect). Look here: http://www-01.sil.org/iso639-3/codes.asp

Well, we are using also ISO639-2 codes (stored in the same table), and when something is missing, I just add code. For example for Portuguese (Brasil) I made pb | pob. Thats how I solve it.

I can create more "languages" (Espanol Latin America) and create codes for it, but here comes more troubles:
1. useragents (programs using OS) update very very slowly their internals
2. searching would be not perfect (1 subtitle can be only in one language)

What you propose is to have ISO639 code and ISO3166 code (custom extended). Let's say I could make it (but thats really big change for me, but I think possible to do), but then Portuguese and pt-br must be changed too (or at least it should be changed), similar with Chinese...and we are back in problem 1 (useragents take changes very very slowly, POB would not work, because it will be pt_br, not pb_br.

We need, that if user search:
pt - will return all portuguese subtitles
pt_pt - will return Portugal Portuguese subs
pr_br - will return Brazil Portuguese subs

Adding another language in my table is not a problem, but I am not sure, if it will have more cons or more pros, because as I said, one subtitle with this approach can have just one language.

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Sat Jun 27, 2015 7:36 pm

I have to say that this is not my idea. It comes from the locale system:
http://linux.die.net/man/7/locale
http://linux.die.net/man/3/setlocale

From the setlocale(3) man page:
A locale name is typically of the form language[_territory][.codeset][@modifier], where language is an ISO 639 language code, territory is an ISO 3166 country code, and codeset is a character set or encoding identifier like ISO-8859-1 or UTF-8.

It is something very familiar to Unix and Linux users. It is a means of adapting program behaviour to local preferences (messages, currency representation, etc.). So, it's nothing new.

Your solution may work, in fact it's working right now, but I think it can cause trouble if you keep doing it. In a word: it's not scalable. You can easily create collisions with real reserved codes.

I think this particular problem with Spanish is general to Spanish and Latin American users. I think a lot of people would benefit with this change. But perhaps it should be generalised. For example it could also be useful for Dutch (nl_NL, nl_BE), German (de_DE, de_AT), English (en_US, en_GB, en_AU) and perhaps other cases I don't know. Just consider it.

As I said you could just extend your language field and keep the country information together or split it and make two fields for language. I think I'd choose the former and use pattern matching. In SQL: "eng%" or "spa%". I don't know how much this could affect query execution time. I've never worked with so big databases.

User avatar
SmallBrother
Site Admin
Posts: 3726
Joined: Sun Mar 04, 2012 12:59 pm
Location: Somewhere on this globe

Re: Subtitle language: country codes

Sun Jun 28, 2015 10:55 am

As I said you could just extend your language field and keep the country information together
but here comes more troubles:
2. searching would be not perfect (1 subtitle can be only in one language)
Adding another language in my table is not a problem, but I am not sure, if it will have more cons or more pros, because as I said, one subtitle with this approach can have just one language.
That's why I think we should not 'create' more languages and make things difficult - for the database/coding and also for the up- and downloading user.

I think the solution is to ADD a field "region" or so. Basically, this field will be completely seperate and independant from language, the same like for example the field "Translator". I can search for all dutch subtitles for the movie Blabla, but (optionally) only if translated by Translator X. Same with languages: I can search for all Spanish subtitles for the movie Blabla, but (optionally) only if region is "Spain".

I also think a user will typically just search for Spanish or Dutch or whatever subtitles, not region specific subs. Like hector said: "it would be nice to know", but it's not crucial. Only after and if I get a bunch of results, I might wanna filter more.
Nowadays a VPN is a must for everyone. A VPN allows you safe surfing and protects you against spying governments and companies.
I advise AirVPN - from € 2,75 per month. Click the below banner for more info.


Image

User avatar
hector
Posts: 370
Joined: Wed Jan 01, 2014 12:27 pm
Location: Spain

Re: Subtitle language: country codes

Wed Jul 01, 2015 1:24 am

That's why I think we should not 'create' more languages and make things difficult - for the database/coding and also for the up- and downloading user
Why "difficult"? Work, yes. Testing, yes. I think the API is the worst part. But I don't think it's difficult. The same for users. Only one more field to fill in search form. You could even make it appear only in case the user selects "Spanish" or "Portuguese". If you leave it empty, then you search for 'esp_%' (in case you store it in the language field). If you choose to store it in a new field the SQL query remains the same, I think.
searching would be not perfect (1 subtitle can be only in one language)
??? You just search more generally. Instead of an exact match, you use patterns. In case you store in one field:
Instead of:

Code: Select all

select * from subtitle where language = 'spa'
you would use:

Code: Select all

select * from subtitle where language like 'spa_%'
In case you create a new field it does not change.
I don't know about the API. Perhaps it is more difficult to extend it.

Return to “General talk”

Who is online

Users browsing this forum: No registered users and 109 guests