Every time we search something like ‘Why isn’t 11 pronounced onety-one’ or ‘Why isn’t there an E grade’ on Google, the algorithm behind the search engine runs through its corpus of data to find the answers which are most suitable for the question to give the user. This blog will explore the Google Pagerank system, in a simplified form with keywords and equations to explain how the algorithm functions.
Pagerank is an original algorithm used only by Google created in 1998. Due to the vast amount of data it holds, the search engine faces space-time complexity. The space complexity is the Pagerank is an original algorithm used only by Google created in 1998. The purpose of PageRank is to find links to and from pages and to determine the authority, reliability, popularity and importance, of each page. To do so, the search engine performs querying and indexing. The query class performs operations in the user’s word and the Index class sorts data about the content of each page then score these pages.
To demonstrate how this algorithm works, an example of a corpus will be used to show what happens to the words which are being passed through. Usually, the corpus would contain a whole website or file but for the simplicity of the example, the corpus will only contain phrases.
These phrases include “Dogs, dogs, dogs”, “The running dogs”, “Adopt a dog” and “Cute video of a dog running”.
There are four main tasks the Query class has to perform
Using the example of the running dog, here is what each function does.
- Parsing: extracting text from XML format from the input to be processed
- Tokenising: removing punctuation and white space
- Removing stop words such as but not limited to ‘like’, ‘the’ or ‘that’
- Stemming words: reducing inflected words to their most simplified form
e.g: running → run
Similarly, the rest of the words would count as “ ‘Dog’, ‘Dog’, ‘Dog’”, “ ‘Run’, ‘Dog’ “, “ ‘Adopt’, ‘dog’ ”, “ ‘Cute’, ‘Video’, ‘Dog’, ‘Run’”.
The Indexing class does all the calculations to determine the importance of a specific word on a page. This number is then used to score how important this page is relative to the keyword searched by the user.
The first thing that the algorithm needs to find is the Term Frequency, how many times a word appears in the document and corpus. This is calculated with the equation
TF = x/y, where
x = specific times a word appears
y = specific number of times the most used word appears
Taking the word run as an example from the phrase ‘The Running Dog’, this would give us the TF value of ¼ or 0.25
After that, the algorithm will find the Inverse Document Frequency (IDF), how many times the word shows up in the document and the whole corpus with the equation
IDF = log(n/n(i))
n = total number of documents in a corpus
i = word in corpus
n(i) = number of documents containing word i in corpus
Taking the word ‘run’ as an example, this would give us the IDF value of log(4/2)) or log(2) which gives the IDF value of 0.301.
After getting those two results, the word relevance can be scored with the equation
Relevance = TF * IDF
This method is then repeated with the number of keywords there are on a page. Using the word ‘run’ from the phrase ‘The Running Dog’ from the example gives us a relevance score of 0.075
After calculating how important a page is relative to a keyword, its importance to each other, the page’s authority, is calculated. This process can be simplified in a set of rules:
- The more links the page has to other pages, the higher the score it receives
- If a more authoritative page, page P, links to page X, page X will gain the equivalent level of authority as page P
- If page P only links to a few pages, the scores of these pages with each other are higher
- If page P is closer to page X (fewer clicks to go from one to another), there is a higher PageRank score
With this in mind, we can calculate the PageRank score.
First of all, we have to calculate the weight of the page in question.
w(pm) = weight of page P gets from page M
n(pm) = number of unique page links between page P and page M
t = total number of pages in corpus
e = a given constant
If page m links to page p, this can be calculated with the equation
The weight which the page ‘The running Dog’ gets from ‘Dogs, Dogs, Dogs’ would be
Which gives us the weight of 0.268. The weight of the Page ‘The running Dog’ would be the same as the page ‘Dogs, Dogs, Dogs’ because the unique number of page links would still be 2.
If page m doesn’t link to page p, this can be calculated with the equation
The weight the page ‘The running Dog’ would get from the other page it is not linked to is 0.0375 from 0.15/4
Finally, the Pagerank score r(p) can be calculated with
This gives the page ‘The running Dog’ a score of 0.268*2 + 0.0375 = 0.6095
We can also calculate distance between ranks
rp = current rank
rp’ = previous rank
With all algorithms, there will be edge cases, cases where your code will break. Here are the solutions for when that happens
- If Page A has many links with page B, it will all be counted as one link
- If the Page links to nothing, it will be linked to everything to get the lowest score
- If the Page links to itself, the link will be ignored
Something we have to keep in mind as users of this algorithm is the time and space complexity. The space complexity is the amount of space required to solve a computational problem and the time complexity is the amount of time required to solve it. These values to measure the time and space the algorithm takes up can be represented with big O – a notation used to describe the time-space complexity. The big O value for PageRank’s space complexity is O(N), meaning that it takes up as much space as the number of pages in Google’s corpus. The time complexity’s value is O(k*N) where k represents the specific number of repeats of the code that needs to be run on page N. To better imagine the Big O complex, an example of a linear time task would be someone checking from a line of students to find all the male students. Using N to represent the number of students, they would have the complete the task of checking N times to finish with the array of students.
This is because having a system as large as the Google search engine will take a lot of energy and fuel. Computer Scientists hold a large responsibility to keep the algorithm as efficient as possible to not only save time for the user but also the energy that is used and the carbon footprint of the machines that are working through it.