Forum rules
Under no circumstances is spamming or advertising of any kind allowed. Do not post any abusive, obscene, vulgar, slanderous, hateful, threatening, sexually-orientated or any other material that may violate others security. Profanity or any kind of insolent behavior to other members (regardless of rank) will not be tolerated. Remember, what you don’t find offensive can be offensive to other members. Please treat each other with the kind of reverence you’d expect from other members.
Failure to comply with any of the above will result in users being banned without notice. If any further details are needed, contact: “The team” using the link at the bottom of the forum page. Thank you.
kirill
Posts: 4
Joined: Thu Sep 20, 2012 6:32 pm

General question to hash algorithms

Fri Sep 28, 2012 6:38 pm

Hi,

I noticed that similar sites like OpenSubtitles use different hashing algorithms.

For example,TheSubDB just does an MD5 hash over the first and last 64 KB. Filesize is not used. Their Python example is misleading.

http://thesubdb.com/api/

Is there an advantage of the OpenSubtitles hashing, which does a CRC64 (?) ?
Were there many hash collisions in the past, or why the additional "moviebytesize" ?
Moviebytesize is already added to the hash, why is it needed for the search?

Do you think the Hashing of the SubDB has advantages or is it flawed to create hash collisions in the future?

Sublight also does the hashing differently: http://www.sublight.si/Article/6/How-to ... -hash.aspx

User avatar
oss
Site Admin
Posts: 5890
Joined: Sat Feb 25, 2006 11:26 pm
Contact: Website

Re: General question to hash algorithms

Tue Oct 16, 2012 7:01 pm

Hi Kirill,

good question, too bad I didnt see it before. Anyway, here is explanation. Read here about OpenSubtitles Hash: http://trac.opensubtitles.org/projects/ ... ourceCodes

First of all, I didn't develop this hash myself - it is based on Gabest MPC, he got already database back in time with hashes, so I thought it would be good idea to make it standard and not develop another hash, like others did and are doing.

So far, there are no collisions, the search itself works perfectly on hash, but we added filesize as required parameter (why not, if everybody got this info).

As you can see, now there are many hashes floating around, almost everybody uses owns hash. I didn't want to make this happen (even opensubtitles back in time could come with new hash, but thats another story), but I am not controlling other programmers. AFAIK there is not big comparison between older CRC64 and younger MD5, which is "stronger", if I would implement hash, it would use SHA1 at least. Problem with CRC64 might be implementation in programming languages, for example PHP API doesn't support it, but we now have source codes in almost every language...

User avatar
eduo
Posts: 716
Joined: Sat Feb 10, 2007 1:40 am
Location: Information Technology
Contact: ICQ Website Yahoo Messenger

Re: General question to hash algorithms

Sun Nov 04, 2012 5:36 pm

Quick note:

What makes OS's hash impervious to collisions is, precisely, the addition of filesize. It makes sure several formats which can change in the middle of the file but not at the ends are, nonetheless, treated as different files. Adding it as a separate field ensures that the new hash doesn't ever collision.

The standard Capiscuas came up with for Opensubtitles is faster than MD5 hashing (bytesize is required to get the last 64 bytes, so it's no extra processing) and less prone to collisions. The filesize would be removed and collisions most likely wouldn't happen, but since it's always an automated process playing the safer card is better.

Sublight at the time did their own hashing different from OpenSubtitles because they were called out for ripping OS off. Sublight's hashing is a quick hashing mechanism that's ridiculously slower (a lot more of the file must be read and special video recognition libraries must be used). Also, video duration depends on local library used and seconds might get rounded off or the local machine may not have necessary codecs whereas Gabest's hash can be calculated just by having access to the file data itself.

SubDB's hash is not bad, just slower. I have in the past considered adding SubDB support to SolEol and refrained from doing so for other reasons. Sublight's hash is mediocre and, to my eyes, a reflection of the style and principles of its creator.

The problem with any hashing, though, is that it becomes too popular and you end up with a mediocre program supporting it and "tainting" your database. In OpenSubtitles this happened when SubPlayer (or BSPlayer, I can't recall) added automatic upload of subtitles without any control or quality confirmation from the user. To my eyes that was the worst possible use of the API and we've been paying for it since then (not to mention an action worthy of the "malware" description for any player that uses it).
http://eduo.info/
[url=http://eduo.info/soleol/]OpenSubtitles from your desktop: SolEol for Mac/Windows/Linux[/url]
[url=http://forums.plexapp.com/index.php?showtopic=325&st=0&p=2480&#entry2480]My current episode processing work flow[/url].

macofaco
Posts: 68
Joined: Mon Sep 22, 2008 8:31 pm
Contact: Website

Re: General question to hash algorithms

Mon Feb 25, 2013 12:20 am

Sublight has basically two versions of hash algorithms: old one and new one :)

New version is much much faster to calculate and very easy to implement. It is described in document http://www.sublight.si/Documents/API/Su ... rvices.pdf on page 6 (document is almost two years old).

Sublight 3 can operate with both versions but new version of Sublight 4 will probably have just newer one.

Return to “Developing”

Who is online

Users browsing this forum: No registered users and 32 guests