Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Hash Algorithm in Plain English

Sat Mar 17, 2007 3:06 am

Hello

Can someone, *please*, explain the movie (and subtitle) hash algorithms in plain english?

I've checked the Wiki and although there are code examples they mostly don't work out-of-the-box. The only one I've been able to manage to compile properly is the C++ and the results I get are not even close to what SubDownloaderFrame.py is reporting.

So now I don't understand, in plain language, what the movie hash algorithm is supposed to be. I'm baffled.

I think it would be a good addition to the Wiki and the docs, personally.

I now have a command line tool that can't actually do anything because as much as I've moved things around none of the resulting hashes actually work.

FWIW, if anyone not in Windows wants to try and compile the C++ this is the code that finally did it for me (again, not working for me at the moment, the hashes are all wrong):

Code: Select all

#include <inttypes.h> // The actual library that contains the proper data types #include <iostream> #include <fstream> using namespace std; int MAX(int x, int y) { if((x) > (y)) return x; else return y; } uint64_t compute_hash(ifstream& f) { uint64_t hash, fsize; f.seekg(0, ios::end); fsize = f.tellg(); f.seekg(0, ios::beg); hash = fsize; int tmp; int i; for(tmp = 0, i = 0; i < 65536/sizeof(tmp) && f.read((char*)&tmp, sizeof(tmp)); i++, hash += tmp); f.seekg(MAX(0, (uint64_t)fsize - 65536), ios::beg); for(tmp = 0, i = 0; i < 65536/sizeof(tmp) && f.read((char*)&tmp, sizeof(tmp)); i++, hash += tmp); return hash; } int main(int argc, char *argv) { ifstream f; uint64_t myhash; f.open("/Volumes/video/TV/Lost/Lost S03E01.avi", ios::in|ios::binary|ios::ate); // Replace as adequate, obviously if (!f.is_open()) { cerr << "Error opening file" << endl; return 1; } myhash = compute_hash(f); typedef char TCHAR; // Set a test var to ensure datatype is unsigned 64 int TCHAR *test = "1234567890123456789"; printf("in main(): test = <%s>\n", test); printf("sizeof(uint64_t) = %d\n", sizeof(uint64_t)); // Try all possible printf combinations we can think of printf("I64d: %I64d\n", myhash); // Borland BCC or MS VC++ printf("Ld: %Ld\n", myhash); // Borland BCC printf("lld: %lld\n", myhash); // gcc printf("I64x: %016I64x(hex)\n", myhash); // Borland BCC or Microsoft VC++ */ printf("Lx: %016Lx(hex)\n", myhash);// Borland BCC printf("llx: %016llx(hex)\n", myhashç); // gcc f.close(); return 0; }
When I run the code above I get this:
in main(): test = <1234567890123456789>
sizeof(uint64_t) = 8
I64d: I64d
Ld: -1152560871
lld: 8279544385817
I64x: 000000000000000I64x(hex)
Lx: 00000000bb4d5119(hex)
llx: 00000787bb4d5119(hex)
Supposedly, at least the very last one of the results should be giving me the proper result in gcc. But no. Because from SubDownloader what is being reported as the moviehash is 332c83338820e4f6.

I may be missing something non-obvious here. Could someone help me? English or spanish is OK. I just need to know the actual algorihtm for the movie hash in plain language. No code.

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sat Mar 17, 2007 9:22 pm

I can't help you too much, but here is Gabest mail (old 1 year:)

Code: Select all

size + 64bit chksum of the first and last 64k (even if they overlap because the file is smaller than 128k). The size of the file is also stored in the database besides the hash. fh.hash = fh.size; for(UINT64 tmp = 0, i = 0; i < 65536/sizeof(tmp) && f.Read(&tmp, sizeof(tmp)); fh.hash += tmp, i++); f.Seek(max(0, (INT64)fh.size - 65536), CFile::begin); for(UINT64 tmp = 0, i = 0; i < 65536/sizeof(tmp) && f.Read(&tmp, sizeof(tmp)); fh.hash += tmp, i++);
you can get MPC sourcecodes and look for that routine. Also, if I may ask you - you are developing some new software for us ? :)

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sat Mar 17, 2007 9:58 pm

os: I won't lie. After the last statement that all development will be made for windows-only I'm not sure if I'd rather wait for an alternative service that is less hostile to other platforms.

You idea is great, and it started wonderfully, in the contest and the forja and sourceforge servers and it pointed like it would be a wonderful multi-platform tool.

Now it looks like it'll become a windows-only tool and every other platform and user (some of which have been here from the beginning) can go and screw themselves.

What I was asking for is "plain language". Not code. I've seen the wiki and I've seen this, but I was looking a normal explanation of what the algorithm did. So far I haven't been able to match the hash from the snippets of code to the results I get from SubDownloader, so there's something I'm missing.

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sun Mar 18, 2007 1:25 am

I explained in plain english in post before, just read it carefuly, I think it is understable. For hashes - I know, that hashes are good, and maybe you do something wrong. It is based on Gabest - Media Player Classic hash, and every portion of code should give same results.

Personally I didn't try that, but you can do - take MPC, make hash there, Take Subdownloader make hash there, compare them. Delphi code works for sure too, and C/C++ should do too.

About crossplatform - I replied on that in other post. Maybe it should be nice to have some code also for perl/php commandline. I know these languages, so I can code something, but no time now. Also, SubDownloader should work from commandline too.

Hm, maybe I will just make code snippets for php/perl, so other should code program theyself.

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sun Mar 18, 2007 3:37 am

I'm sorry. I didn't notice the first part of the quote you provided. Only the code part. My mistake there. I'll try using that and come back afterwards.

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sun Mar 18, 2007 1:05 pm

I am trying to make perl snipped to create of this hash. Maybe I will do PHP later, so we will have hashing algo in many languages. Also C/C++ codes should be improved for sure.

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sun Mar 18, 2007 8:42 pm

I thought all URLs of opensubtitles.org could give XML if /xml was appended at the end. It doesn't work everywhere, though.

If I try:
http://www.opensubtitles.org/en/search2 ... 20e4f6/xml

I am redirected to the xml version. But if I try a language-specific search:

http://www.opensubtitles.org/en/search2 ... 20e4f6/xml

I get redirected to the IMDB-ID page (which I'm not sure is a good thing, but it's a design decision).

I wonder if there is a way to put the moviehash and moviesize in any way of URL that shows the links to the zipfiles directly. Instead of having to do intermediate pages.

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Mon Mar 19, 2007 1:34 am

serch2 is not what are you looking for. Also try simplexml ... opensubtitles is full of magic :)

http://www.opensubtitles.org/en/search/ ... /simplexml

this helps ?

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Mon Mar 19, 2007 2:27 am

This is perfect. Is there a version of this page that shows also downloads, rating and date-uploaded?

I've sent a pm to bratao to see what's he doing. And see how much can we help each other. This is what I sent to him (ammended a little), in case it's useful for you as well:

Code: Select all

Do you have any plans on the tool? What I wanted to do is the following: CLI Side: 1.-Program to get the hash and filesize. I already have this in C++ and it's simple enough. Outputs filename, file size and hash on an input file. 3.-Program to fetch the list of subtitles available for a hash, filesize and language list. Outputs necessary data for a list to work afterwards. This can be done through xmlrpc or through an http request). 2.-Program to download the subtitles. Works in two modes: a.-Takes a moviehash and a language list. Downloads either all subtitles for the hash for those languages or the highest-rated/most recent/most downloaded. b.-Downloads specific subtitles (as output by step 2, which should provide a specific identifier for each anyway). In both cases a rename would be done to conform to the original file and location. For different language files it would use language indicators, for multiple files for the same language it would use a numbering scheme. i.e. movie.en.srt movie.sp.1.srt movie.sp.2.srt 3.-Program to upload the subtitles. Works in two modes: a.-Uploads a subtitle based on subtitle name, imdb-id (generic subtitle). It would be a good thing if there was an official naming scheme, so as to use that. I'd include the imdb-id used by os right now plus the os for episodes in series optionally as part of the name for the subtitle to be uploaded. b.-Uploads a subtitle based on moviehash, imdb-id (preferred). I think the website already checks a subhash and compares it to existing ones and if existing matches the movie to the subtitle through the hashes. Of course, all three behaviours could easily be in the same executable, depending on the switches used when invoking. GUI Side: Using the above CLI programs it would be easy to: 1.-Search for the subtitle or subtitles (multiple languages) or one or several movies (automatically downloading and renaming the subtitles so they work properly, as moviename.lang.srt, for example). The data is there, so it's just a matter of deciding what to do with it. 2.-Upload subtitles existing, with or without a matching movie. All subtitles should be matched to imdb before uploading (a requirement of opensubtitles at the moment) which can be done through a batch as well. 3.-Have a list of downloaded subtitles where they can be rated (opensubtitles.org REALLY needs some easy way to rate subtitles, and this could be it). You could see in a tab all the subtitles you've downloaded and from there be able to rate them from 1 to 10 and upload the rating. To the website. This is a GOOD thing to have. The GUI program could keep track of the files downloaded/uploaded. This way ratings could be set for downloaded subtitles and ratings could be seen for uploaded files, as well as comments. The advantage of going this way is multiple: -There can be GUI tools from other developers, using the same CLI tools. -There can be Web tools that use the CLI tools. -There could be plug-ins for VLC and MediaPlayer using the CLI tools. I may be missing something, but I think this is a good path and the most flexible one. Especially because changes to the GUI would not affect the CLI and viceversa.
I just spent 3 hours fixing the subtitles for the movie Innerspace (which for some reason, I'm guessing OCR, had all "g" letters swapped by "r").

Remember, I speak spanish. You can contact me in spanish if you prefer via IM.

I can't dedicate much time to this, though (as you've probably noticed. It took me two weeks to come up with the basic C program that gives the hash :)

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sat Mar 31, 2007 12:42 am

bump.

Just a reminder for this question:
This is perfect. Is there a version of this page that shows also downloads, rating and date-uploaded?

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sat Mar 31, 2007 1:15 am

OS:

I tried downloading the sub from the testserver using this method (I don't want to hit my maximum quota doings tests).

I tried this:

http://test.opensubtitles.org/en/search ... /simplexml

I don't get any results with that. I tried also with sublanguageid-spa, sublanguageid-esp and sublanguageid-all withour results. I tried taking out the whole "sublanguageid" part and no results either.

Does this search method work in the test server? I tried with the test file, if that isn't there I don't know what to do. Can I test with normal movie files?

Also, is there a simplesearch like this (moviehash and bytesize) that returns all languages?

Thanks in advance.

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Sat Mar 31, 2007 7:59 pm

hi,

if you want, I can give you better access to normal server, also on test server is not so much subtitles (it is on compeltely different database/code/box). Also, with that URL you cant get results.

I suggest doing this:

1. upload som subtitles to test server using you program
2. check if they really exists on web
3. try to use search through XMLRPC
4. try to download them...

if something not clear, or doesnt work, just contact me, I will help.

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sat Mar 31, 2007 8:15 pm

os: I can't do that. i can't upload subtitles through subdownloader. I thought this avi and its subtitles were in the test server. I'd guess that's what they're there for.

What's the point of having a test file and a test hash if they are not in the test server? :)

I can test against production, but there is a limit on the files I can download (I'd be downloading the same file, over and over again).

Can you confirm if this search method is available in the test server?

Also, can you confirm if there is any way of having this same search but showing ratings and upload-date?

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Sun Jul 08, 2007 10:08 pm

Guys;

One question I didn't ask before and now I'm facing it.

How are hashes for subtitles calculated? They are too small for this routine to work with. I don't think this works for files below 65K.

How can I make a hash for a subtitle 28K big?

User avatar
oss
Site Admin
Posts: 5882
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Mon Jul 09, 2007 10:59 pm

eduo :)
you calculate hash for movie, not for subtitles. And movies are often bigger than 128 kb :)

it is pitty, but I cant find link to wiki for hash sourcecodes.

Return to “Developing”

Who is online

Users browsing this forum: No registered users and 6 guests