 November 1st, 2017, 06:39 PM #1 Newbie   Joined: Nov 2017 From: Sydney, Australia Posts: 3 Thanks: 0 log2(3/2)=0.584 Hi All, I'm learning a basic course on Big Data and one of the lectures touches on Vector Model, and in that it has this equation. Term Doc Frequency IDF "new" 2 log2(3/2)=0.584 I understand why 3/2 as it's 2 occurrences of the term "new" in 3 documents. But 3/2=1.5, so how does log2(1.5)=0.584? Could someone break it down for me? Thanks. MGS
 November 1st, 2017, 06:45 PM #2
 November 1st, 2017, 06:53 PM #3
Ok, thanks but I probably should have said, explain it to me, not break it down EDIT Ah, wait, so 2 to the power of 0.584963 = 1.5 or there abouts?
November 1st, 2017, 07:04 PM   #4
Senior Member

Joined: Aug 2012

Posts: 1,638
Thanks: 415

That's the base 2 logarithm of 1.5. In other words, what power do I have to raise 2 to, in order to get 1.5?

For example, what is the base 2 log of 32? It's 5, because 5 is the power that we have to raise 2 to in order to get 32.

Now $2^0 = 1$ and $2^1 = 2$ so we'd expect the base 2 log of 1.5 to be somewhere between 0 and 1. So .5-something is about right.

Quote:
 Originally Posted by MarkGsargent Ah, wait, so 2 to the power of 0.584963 = 1.5 or there abouts?
Yes.

 November 1st, 2017, 07:56 PM #5 Newbie   Joined: Nov 2017 From: Sydney, Australia Posts: 3 Thanks: 0 Ok, cool thanks. Now I just need to understand more why we even use log2 for this scenario.
November 1st, 2017, 09:58 PM   #6
Senior Member

Joined: Aug 2012

Posts: 1,638
Thanks: 415

Quote:
 Originally Posted by MarkGsargent Ok, cool thanks. Now I just need to understand more why we even use log2 for this scenario.
That's an application detail that I'm not familiar with. I looked up idf vector and apparently this has something to do with data retrieval. I actually used to know something about this. If you have a text document, you can make a vector where the coordinates are words and the values are the frequency of how many times that word appears. Then to determine how similar two documents are, you can take the dot product of the (normalized) vectors. This gives you the inverse cosine of the angle between the vectors. The closer the value is to 1, the more similar the documents are.

This seems vaguely related to your question. Perhaps you can explain more about the context of your problem. I don't know what the base 2 log is for.

ps -- I found this ... https://en.wikipedia.org/wiki/Tf%E2%80%93idf It's interesting but it doesn't say anything about base 2 logs. Is any of this relevant to your question?

November 2nd, 2017, 08:04 AM #7

 November 2nd, 2017, 08:04 AM #7 Senior Member   Joined: May 2016 From: USA Posts: 825 Thanks: 335 This is a pure guess, but the durations of binary searches would be conveniently measured in powers of 2, and that might lead to using log base 2.

