Seo,Search Engine Best Optimization with High Page Rank About ??

Cartoon illustrating basic principle of PageRank

A PageRank results from a mathematical algorithm based on the graph, the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursivelyand depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page. Numerous academic papers concerning PageRank have been published since Page and Brin's original paper. In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank. Other link-based ranking algorithms for Web pages include the HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com), the IBM CLEVER project, and the TrustRank algorithm.

Algorithm

PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value. A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank.

Simplified algorithm

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial PageRank of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75. Suppose instead that page B had a link to pages C and A, while page D had links to all three pages. Thus, upon the next iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A. In other words, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the number of outbound links L( ). In the general case, the PageRank value for any page u can be expressed as:

i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v.

The Implementation of PageRank in the Google Search Engine

Regarding the implementation of PageRank, first of all, it is important how PageRank is integrated into the general ranking of web pages by the Google search engine. The proceedings have been described by Lawrencec Page and Sergey Brin in several publications. Initially, the ranking of web pages by the

Google search engine was determined by three factors:

1-Page specific factors
2-Anchor text of inbound links
3-PageRank

Page specific factors are, besides the body text, for instance the content of the title tag or the URL of the document. It is more than likely that since the publications of Page and Brin more factors have joined the ranking methods of the Google search engine. But this shall not be of interest here.
In order to provide search results, Google computes an IR score out of page specific factors and the anchor text of inbound links of a page, which is weighted by position and accentuation of the search term within the document. This way the relevance of a document for a query is determined. The IR-score is then combined with PageRank as an indicator for the general importance of the page. To combine the IR score with PageRank the two values are multiplicated. It is obvious that they cannot be added, since otherwise pages with a very high PageRank would rank high in search results even if the page is not related to the search query.
Especially for queries consisting of two or more search terms, there is a far bigger influence of the content related ranking criteria, whereas the impact of PageRank is mainly visible for unspecific single word queries. If webmasters target search phrases of two or more words it is possible for them to achieve better rankings than pages with high PageRank by means of classical search engine optimisation.
If pages are optimised for highly competitive search terms, it is essential for good rankings to have a high PageRank, even if a page is well optimised in terms of classical search engine optimisation. The reason therefore is that the increase of IR score deminishes the more often the keyword occurs within the document or the anchor texts of inbound links to avoid spam by extensive keyword repetition. Thereby, the potentialities of classical search engine optimisation are limited and PageRank becomes the decisive factor in highly competitive areas.

The PageRank Display of the Google Toolbar :

PageRank became widely known by the PageRank display of the Google Toolbar. The Google Toolbar is a browser plug-in for Microsoft Internet Explorer which can be downloaded from the Google web site. The Google Toolbar provides some features for searching Google more comfortably.
The Google Toolbar displays PageRank on a scale from 0 to 10. First of all, the PageRank of an actually visited page can be estimated by the width of the green bar within the display. If the user holds his mouse over the display, the Toolbar also shows the PageRank value.

Caution: The PageRank display is one of the advanced features of the Google Toolbar. And if those advanced features are enabled, Google collects usage data. Additionally, the Toolbar is self-updating and the user is not informed about updates. So, Google has access to the user's hard drive.

If we take into account that PageRank can theoretically have a maximum value of up to dN+(1-d), where N is the total number of web pages and d is usually set to 0.85, PageRank has to be scaled for the display on the Google Toolbar. It is generally assumed that the scalation is not linearly but logarithmically. At a damping factor of 0.85 and, therefore, a minimum PageRank of 0.15 and at an assumed logaritmical basis of 6 we get a scalation as follows:

Toolbar PageRank Real PageRank
0/10 0.15 - 0.9
1/10 0.9 - 5.4
2/10 5.4 - 32.4
3/10 32.4 - 194.4
4/10 194.4 - 1,166.4
5/10 1,166.4 - 6,998.4
6/10 6,998.4 - 41,990.4
7/10 41,990.4 - 251,942.4
8/10 251,942.4 - 1,511,654.4
9/10 1,511,654.4 - 9,069,926.4
10/10 9,069,926.4 - 0.85 × N + 0.15

It is uncertain if in fact a logarithmical scalation in a strictly mathematical sense takes place. There is likely a manual scalation which follows a logarithmical scheme, so that Google has control over the number of pages within the single Toolbar PageRank ranges. The logarithmical basis for this scheme should be between 6 and 7, which can for instance be rudimentary deduced from the number of inbound links of pages with a high Toolbar PageRank from pages with a Toolbar PageRank higher than 4, which are shown by Googe using the link command.

Directory (directory.google.com).

The Google Directory is a dump of the Open Directory Project (dmoz.org), which shows the PageRank for listed documents similarly to the Google Toolbar display scaled and by means of a green bar. In contrast to the Toolbar, the scale is from 1 to 7. The exact value is not displayed, but it can be determined by the divided bar respectively the width of the single graphics in the source code of the page if one is not sure by looking at the bar.

By comparing the Toolbar PageRank of a document with its Directory PageRank, a more exact estimation of a pages PageRank can be deduced, if the page is listed with the ODP. This connection was mentioned first by Chris Raimondi (www.searchnerd.com/pagerank).

Especially for pages with a Toolbar PageRank of 5 or 6, one can appraise if the page is on the upper or the lower end of its Toolbar scale. It shall be noted that for the comparison the Toolbar PageRank of 0 was not taken into account. It can easily be verified that this is appropriate by looking at pages with a Toolbar PageRank of 3. However, it has to be considered that for a verification pages of the Google Directory respectively the ODP with a Toolbar PageRank of 4 or lower have to be chosen, since otherwise no pages linked from there with a Toolbar PageRank of 3 will be found

The Effect of Inbound Links
It has already been shown that each additional inbound link for a web page always increases that page's PageRank. Taking a look at the PageRank algorithm, which is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

one may assume that an additional inbound link from page X increases the PageRank of page A by

d × PR(X) / C(X)

where PR(X) is the PageRank of page X and C(X) is the total number of its outbound links. But page A usually links to other pages itself. Thus, these pages get a PageRank benefit also. If these pages link back to page A, page A will have an even higher PageRank benefit from its additional inbound link.

The PageRank-1 Rule

Users of the Google Toolbar often notice that pages with a certain Toolbar PageRank have an inbound link from a page with a Toolbar PageRank which is higher by one. Some take this observation to doubt the validity of the PageRank algorithm presented here for the actual ranking methods of the Google search engine. It shall be shown, however, that the PageRank-1 rule complies with the PageRank algorithm. Basically, the PageRank-1 rule proves the fundamental principle of PageRank. Web pages are important themselves if other important web pages link to them. It is not necessary for a page to have many inbound links to rank well. A single link from a high ranking page is sufficient.

To show the actual consistance of the PageRank-1 rule with the PageRank algorithm several factors have to be taken into consideration. First of all, the toolbar PageRank is a logarithmically scaled version of real PageRank values. If the PageRank value of one page is one higher than the PageRank value of another page in terms of Toolbar PageRank, than its real PageRank can at least be higher by an amount which equals the logarithmical basis for the scalation of Toolbar PageRank. If the logarithmical basis for the scalation is 6 and the toolbar PageRank of a linking Page is 5, then the real PageRank of the page which receives the link can be at least 6 times smaller to make that page still get a toolbar PageRank of 4.

However, the number of outbound links on the linking page thwarts the effect of the logarithmical basis, because the PageRank propagation from one page to another is devided by the number of outbound links on the linking page. But it has already been shown that the PageRank benefit by a link is higher than PageRank algorithm's term d(PR(Ti)/C(Ti)) pretends. The reason is that the PageRank benefit for one page is further distributed to other pages within the site. If those pages link back as it usualy happens, the PageRank benefit for the page which initially received the link is accordingly higher. If we assume that at a high damping factor the logarithmical basis for PageRank scalation is 6 and a page receives a PageRank benefit which is twice as high as the PageRank of the linking page devided by the number of its outbound links, the linking page could have at least 12 outbound links so that the Toolbar PageRank of the page receiving the link is still at most one lower than the toolbar PageRank of the linking page.

A number of 12 outbound links admittedly seems relatively small. But normally, if a page has an external inbound link, this is not the only one for that page. Most likely other pages link to that page and propagate PageRank to it. And if there are examples where a page receives a single link from another page and the PageRanks of both pages comply the PageRank-1 rule although the linking page has many outbound links, this is first of all an indication for the linking page's toolbar PageRank being at the upper end of its scale. The linking page could be a "high" 5 and the page receiving the link could be a "low" 4. In this way, the linking page could have up to 72 outbound links. This number rises accordingly if we assume a higher logarithmical basis for the scalation of Toolbar PageRank

The Effect of Outbound Links;
Since PageRank is based on the linking structure of the whole web, it is inescapable that if the inbound links of a page influence its PageRank, its outbound links do also have some impact. To illustrate the effects of outbound links, we take a look at a simple example.

Dangling Links

An important aspect of outbound links is the lack of them on web pages. When a web page has no outbound links, its PageRank cannot be distributed to other pages. Lawrence Page and Sergey Brin characterise links to those pages as dangling links.

The effect of dangling links shall be illustrated by a small example website. We take a look at a site consisting of three pages A, B and C. In our example, the pages A and B link to each other. Additionally, page A links to page C. Page C itself has no outbound links to other pages. At a damping factor of 0.75, we get the following equations for the single pages' PageRank values:

PR(A) = 0.25 + 0.75 PR(B)
PR(B) = 0.25 + 0.375 PR(A)
PR(C) = 0.25 + 0.375 PR(A)

Solving the equations gives us the following PageRank values:

PR(A) = 14/23
PR(B) = 11/23
PR(C) = 11/23

So, the accumulated PageRank of all three pages is 36/23 which is just over half the value that we could have expected if page A had links to one of the other pages. According to Page and Brin, the number of dangling links in Google's index is fairly high. A reason therefore is that many linked pages are not indexed by Google, for example because indexing is disallowed by a robots.txt file. Additionally, Google meanwhile indexes several file types and not HTML only. PDF or Word files do not really have outbound links and, hence, dangling links could have major impacts on PageRank.

In order to prevent PageRank from the negative effects of dangling links, pages wihout outbound links have to be removed from the database until the PageRank values are computed. According to Page and Brin, the number of outbound links on pages with dangling links is thereby normalised. As shown in our illustration, removing one page can cause new dangling links and, hence, removing pages has to be an iterative process. After the PageRank calculation is finished, PageRank can be assigned to the formerly removed pages based on the PageRank algorithm. Therefore, as many iterations are needed as for removing the pages. Regarding our illustration, page C could be processed before page B. At that point, page B has no PageRank yet and, so, page C will not receive any either. Then, page B receives PageRank from page A and during the second iteration, also page C gets its PageRank.

Regarding our example website for dangling links, removing page C from the database results in page A and B each having a PageRank of 1. After the calculations, page C is assigned a PageRank of 0.25 + 0.375 PR(A) = 0.625. So, the accumulated PageRank does not equal the number of pages, but at least all pages which have outbound links are not harmed from the danging links problem.

By removing dangling links from the database, they do not have any negative effects on the PageRank of the rest of the web. Since PDF files are dangling links, links to PDF files do not diminish the PageRank of the linking page or site. So, PDF files can be a good means of search engine optimisation for Google

The Effect of the Number of Pages ;
Since the accumulated PageRank of all pages of the web equals the total number of web pages, it follows directly that an additional web page increases the added up PageRank for all pages of the web by one. But far more interesting than the effect on the added up PageRank of the web is the impact of additional pages on the PageRank of actual websites.

To illustrate the effects of addional web pages, we take a look at a hierachically structured web site consisting of three pages A, B and C, which are joined by an additional page D on the hierarchically lower level of the site. The site has no outbound links.

Link Exchanges for the purpose of Search Engine Optimisation

For the purpose of search engine optimisation, many webmasters exchange links with others to increase link popularity. As it has already been shown, adding links within closed systems of web pages has no effects on the accumulated PageRank of those pages. So, it is questionable if link exchanges have positive consequences in terms of PageRank at all.

To show the effects of link exchanges, we take a look at an an example of two hierarchically structured websites consisting of pages A, B and C and D, E and F, respectively. Within the first site, page A links to pages B and C and those link back to page A. The second site is structured accordingly, so that the PageRank values for its pages do not have to be computed explicitly. At a damping factor d of 0.5, the equations for the single pages' PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C))
PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 2)

Solving the equations gives us the follwing PageRank values for the first site

PR(A) = 4/3
PR(B) = 5/6
PR(C) = 5/6

and accordingly for the second site

PR(D) = 4/3
PR(E) = 5/6
PR(F) = 5/6

Now, two pages of our example sites start a link exchange. Page A links to page D and vice versa. If we leave the general conditions of our example the same as above and, again, set the damping factor d to 0.5, the equations for the calculations of the single pages' PageRank values are given by

PR(A) = 0.5 + 0.5 (PR(B) + PR(C) + PR(D) / 3)
PR(B) = PR(C) = 0.5 + 0.5 (PR(A) / 3)
PR(D) = 0.5 + 0.5 (PR(E) + PR(F) + PR(A) / 3)
PR(E) = PR(F) = 0.5 + 0.5 (PR(D) / 3)

Solving these equations gives us the follwing PageRank values:

PR(A) = 3/2
PR(B) = 3/4
PR(C) = 3/4
PR(D) = 3/2
PR(E) = 3/4
PR(F) = 3/4

We see that the link exchange makes pages A and D benefit in terms of PageRank while all other pages lose PageRank. Regarding search engine optimisation, this means that the exactly opposite effect compared to interlinking hierachically lower pages internally takes place. A link exchange is thus advisable, if one page (e.g. the root page of a site) shall be optimised for one important key phrase.

A basic premise for the positive effects of a link exchange is that both involved pages propagate a similar amount of PageRank to each other. If one of the involved pages has a significantly higher PageRank or fewer outbound links, it is likely that all of its site's pages lose PageRank. Here, an important influencing factor is the size of a site. The more pages a web site has, the more PageRank from an inbound link is distributed to other pages of the site, regardless of the number of outbound links on the page that is involved in the link exchange. This way, the page involved in a link exchange itself benefits lesser from the link exchange and cannot propagate as much PageRank to the other page involved in the link exchange. All the influencing factors should be weighted up against each other bevor one trades links.

Finally, it shall be noted that it is possible that all pages of a site benefit from a link exchange in terms of PageRank, whereby also the other site taking part in the link exchange does not lose PageRank. This may occur, when the page involved in the link exchange already has a certain number of external outbound links which don't link back to that site. In this case, less PageRank is lost by the already existing outbound links.

The Yahoo Bonus and its Impact on Search Engine Optimization
Many experts in search engine optimization assume that certain websites obtain a special PageRank evaluation from the Google search engine which needs a manual intervention and does not derive from the PageRank algorithm directly. Mostly, the directories Yahoo and Open Directory Project (dmoz.org) are considered to get this special treatment. In the context of search engine optimization, this assumption would have the consequence that an entry into the above mentioned directories had a big impact on a site's PageRank.

Additional Factors Influencing PageRank

It has been widely discussed if additional criteria beyond the link structure of the web have been implemented in the PageRank algorithm since the scientific work on PageRank has been published by Lawrence Page and Sergey Brin. Lawrence Page himself outlines the following potential influencing factors in his patent specifications for PageRank:

Visibility of a link
Position of a link within a document
Distance between web pages
Importance of a linking page
Up-to-dateness of a linking page
First of all, the implementation of additional criteria in PageRank would result in a better approximation of human usage regarding the Random Surfer Model. Considering the visibility of a link and its position within a document implies that a user does not click on links completely at haphazard, but rather follows links which are highly and immediately visible regardless of their anchor text. The other criteria would give Google more flexibility in determing in how far an inbound link of a page should be considered important, than the methods which have been described so far.

Whether or not the above mentioned factors are actually implemented in PageRank can not be proved empirically and shall not be discussed here. It shall rather be illustrated in which way additional influencing factors can be implemented in the PageRank algorithm and which options the Google search engine thereby gets in terms of influencing PageRank values.

Modification of the PageRank Algorithm

To implement additional factors in PageRank, the original PageRank algorithm has again to be modified. Since we have to assume that PageRank calculations are still based on numerous iterations and for the purpose of short computation times, we have to consider to keep the number of database queries during the iterations as small as possible. Therefore, the following modification of the PageRank algorithm shall be assumed:

PR(A) = (1-d) + d (PR(T1)×L(T1,A) + ... + PR(Tn)×L(Tn,A))

Here, L(Ti,A) represents the evaluation of a link which points from page Ti to page A. L(Ti,A) withal replaces the PageRank weighting of page Ti by the number of outbound links on page Ti which was given by 1/C(Ti). L(Ti,A) may consist of several factors, each of them having to be determined only once and then being merged to one value before the iterative PageRank calculation begins. So, the number of database queries during the iterations stays the same, although, admittedly, a much larger database has to be queried at each step in comparison to the computation by use of the original algorithm, since now there is an evaluation of each link instead of an evaluation of pages (by the number of their outbound links).

Different Evaluation of Links within a Document
Two of the criteria for the evaluation of links mentioned by Lawrence Page in his PageRank patent specifications are the visibilty of a link and its position within a document. Regarding the Random Surfer Model, those criteria reflect the probability for the random surfer clicking on a link on a specific web page. In the original PageRank algorithm, this probability is given by the term (1/C(Ti)), whereby the probability is equal for each link on one page.

Assigning different probabilities to each link on a page can, for instance, be realized as follows:

We take a look at a web consisting of three pages A, B anc C, where each of these pages has outbound links to both of the other pages. Links are weighted by two evaluation criteria X and Y. X represents the visibility of a link. X equals 1 if a link is not particularly emphasized, and 2 if the link is, for instance, bold or italic. Y represents the position of a link within a document. Y equals 1 if the link is on the lower half of the page, and 3 if the link is on the upper half of the page.

The Weighting of Links Based on Content Analyses
That it is possible to weight single links within the PageRank technique has been shown on the previous page. The thought behind weighting links based on content analyses is to avoid the corrumption of PageRank. By weighting links this way, it is theoretically possible to diminish the influence of links between thematically unrelated page, which have been set for the sole purpose of boosting PageRank of one page. Indeed, it is questionable if it is possible to realize such weighting based on content analyses.

The fundamentals of content analyses are based on Gerard Salton's work in the 1960s and 1970s. In his vector space model of information retrieval, documents are modeled as vectors which are built upon terms and their weighting within the document. These term vectors allow comparisons between the content of documents by, for instance, calculating the cosine measure (the inner product) of the vectors. In its basic form, the vector space model has some weaknesses. For instance, often the assumption that if and in how far the same words appear in two documents is an indicator for their similarity is criticized. However, numerous enhancements have been developed that solve most of the problems of the vector space model.

One person who excelled at publications which are based on Salton's vector space model is Krishna Bharat. This is interesting because Bharat meanwhile is a member of Google's staff and, particularly, because he is deemded to be the developer of "Google News" (news.google.com). Google News is a service that crawls news websites, evaluates articles and then provides them categorized and grouped in different subjects on the Google New website. According to Google, all these procedures are completely automated. Therefore, other criteria like, for example, the time when an article is published, are taken into account, but if there is no manual intervention, the clustering of articles is most certainly only possible, if the contents of the articles are actually compared to each other. The questions is: How can this be realized?

In their publication on a term vector database, Raymie Stata, Krishna Bharat and Farzin Maghoul describe how the contents of web pages can be compared based on term vectors and, particularly, they describe how some of the problems with the vector space model can be solved. Firstly, not all terms in documents are suitable for content analsysis. Very frequent terms provide only little discrimination across vectors and, so, the most frequent third of all terms is eliminated from the database. Infrequent terms, on the other hand, do not provide a good basis for measuring similarity. Such terms are, for example, misspellings. They appear only on few pages which are likely unrelated in terms of their theme, but because they are so infrequent, the term vectors of the pages appear to be closely related. Hence, also the least frequent third of all terms is eliminated from the database.

Even if only one third of all terms is included in the term vectors, this selection is still not very efficient. Stata, Bharat and Maghoul perform another filtering, so that each term vector is based on a maximum of 50 terms. But these are not the 50 most frequent terms on a page. They weight a term by deviding the number of times it appears on a page by the number of times it appears on all pages, and those 50 terms with the highest weight are included in the term vector of a page. This selection actually allows a real differentiation between the content of pages.

The methods described above are standards for the vector space model. If, for example, the inner product of two term vectors is rather high, the contents of the according pages tend to be similar. This may allow content comparisons in many areas, but it is doubtful if it is a good basis for weighting links within the PageRank technique. Most of all, synonyms and terms that describe similar things can not be identified. Indeed, there are algorithms for word stemming which work good for the english language, but in other languages word stemming is much more complicated. Different languages are a general problem. Unless, for instance, brand names or loan words are used, texts in different languages normally do not contain the same terms. And if they do, these terms normally have a completely different meaning, so that comparing content in different languages is not possible. However, Stata, Bharat and Maghoul provide a method of resolution for these problems.

Stata, Bharat und Maghoul present a concrete application for their Term Vector Database by classifying pages thematically. Bharat has also published on this issue together with Monika Henzinger, presently Google's Research Director, and they called it "topic distillation". Topic distillation is based on calculating so-called topic vectors. Topic vectors are term vectors, but they do not only include terms of one page but rather the terms of many pages which are on the same topic. So, in order to create topic vectors, they have to know a certain amount of web pages which are on several pre-defined topics. To achieve this, they resort to web directories.

For their application, Stata, Bharat und Maghoul have crawled about 30,000 links within each of the then 12 main categories of Yahoo to create topic vectors which include about 10,000 terms each. Then, in order to identify the topic of any other web page, they matched the according term vector with all the topic vectors which were created from the Yahoo crawl. The topic of a web page derived from the topic vector which matched the term vector of the web page best. That such a classification of web pages works can again be observed by the means of Google News. Google News does not only merge articles to one news topic, but also arranges them to the categories World, U.S., Business, Sci/Tech, Sports, Entertainment and Health. As long as this categorization is not based on the structure of the website where the articles come from (which is unlikely), the actual topic of an article has in fact to be computed.

At the time he published on term vectors, Krishna Bharat did not work on PageRank but rather on Kleinberg's algorithm, so that he was more interested in filtering off-topic links than in weighting links. But from classifying pages to weighting links based on content comparisons, there is only a small step. Instead of matching the term vectors of two pages, it is much more efficient to match the topics of two pages. We can, for instance, create a "topic affinity vector" for each page based on the degree of affinity of the page's term vector and all the topic vectors. The better the topic affinity vectors of two pages match, the more likely are they on the same topic and the higher should a link between them be weighted.

Using topic vectors has one big advantage over comparing term vectors directly: A topic vector can include terms in different languages by being based on, for instance, the links on different national Yahoo versions. Deviant site structures of the national versions can most certainly be adapted manually. Even better may be using the ODP because the structure of the sub-categories of the "World" category is based on the main OPD structure. In this way, measuring topic similarities between pages in different languages can be realized, so that a really useful weighting of links based on text analyses appears to be possible.

Is there an Actual Implementation of Themes in PageRank?

That both the approach of Havelivala and the approach of Richardson and Domingos are not utilized by Google is obvious. One would notice it using Google. However, a weighting of links based on text analyses would not be apparent immediately. It has been shown that it is theoretically possible. But it is doubtful that it is actually implemented.

We do not want to claim that we have shown the only way of weighting links on the basis of text analyses. Indeed, there are certainly dozens of others. However, the approach that we provided here is based on publications of important members of Google's staff and, thus, we want to rest a critical evaluation on it.

Like always, when talking about PageRank, there is the question if our approach is sufficienly scalable. On the one hand, it causes additional memory requirements. After all, Stata, Bharat and Maghoul describe the system architecture of a term vector database which is different from Google's inverse index, since it maps from page ids to terms and, so, can hardly be integrated in the existing architecture. At the actual size of Google's index, the additional memory requirements should be several hundred GB to a few TB. However, this should not be so much of a problem since Google's index is most certainly several times bigger. In fact, the time requirements for building the database and for computing the weigtings appear to be the critical part.

Building a term verctor database should be approximately as time-consuming as building an inverse index. Of course, many procecces can probably be used for building both but if, for instance, the weighting of terms in the term vectors has to differ from the weighting of terms in the inverse index, the time requirements remain substantial. If we assume that, like in our approach, content analyses are based on computing the inner products of topic affinity vectors which have to be calculated by matching term vectors and topic vectors, this process should be approximately as time-consuming as computing PageRank. Moreover, we have to consider that the PageRank calculations themselves beome more complicated by weighting links.

So, the additional time requirements are definitely not negligible. This is why we have to ask ourselves if weighting links based on text analyses is useful at all. Links between thematically unrelated page, which have been set for the sole purpose of boosting PageRank of one page, may be annoying, but most certainly they are only a small fraction of all links. Additionally, the web itself is completely inhomogeneous. Google, Yahoo or the ODP do not owe their high PageRank solely to links from other search engines or directories. A huge part of the links on the web are simply not set for the purpose of showing visitors ways to more thematically related information. Indeed, the motivation for placing links is manifold. Moreover, the problably most popular websites are completely inhomogeneous in terms of theme. Think about portals like Yahoo or news websites which contain articles that cover almost any subject of life. A strong weighting of links as it has been described here could influence those website's PageRanks significantly.

If the PageRank technique shall not become totally futile, a weighting of links can only take place to a small extent. This, of course, raises the question if the efforts it requires are justifiable. After all, there are certainly other ways to eliminate spam which often comes to the top of search results through thematically unrelated and probably bought links.

Google's PageRank 0 Penalty why??

By the end of 2001, the Google search engine introduced a new kind of penalty for websites that use questionable search engine optimization tactics: A PageRank of 0. In search engine optimization forums it is called PR0 and this term shall also be used here. Characteristically for PR0 is that all or at least a lot of pages of a website show a PageRank of 0 in the Google Toolbar, even if they do have high quality inbound links. Those pages are not completely removed from the index but they are always at the end of search results and, thus, they are hardly to be found.

A PageRank of 0 does not always mean a penalty. Sometimes, websites which seam to be penalized simply lack inbound links with an sufficiently high PageRank. But if pages of a website which have formerly been placed well in search results, suddenly show the dreaded white PageRank bar, and if there have not been any substantial changes regarding the inbound links of that website, this means - according to the prevailing opinion - certainly a penalty by Google.

We can do nothing but speculate about the causes for PR0 because Google representatives rarely publish new information on Google's algorithms. But, non the less, we want to give a theoretical approach for the way PR0 may work because of its serious effects on search engine optimization.

The Background of PR0
Spam has always been one of the biggest problems that search engines had to deal with. When spam is detected by search engines, the usual proceeding is the banishment of those pages, websites, domains or even IP addresses from the index. But, removing websites manually from the index always means a large assignment of personnel. This causes costs and definitely runs contrary to Google's scalability goals. So, it appears to be necessary to filter spam automatically.

Filtering spam automatically carries the risk of penalizing innocent webmasters and, hence, the filters have to react rather sensibly on potential spam. But then, a lot of spam can pass the filters and some additional measures may be necessary. In order to filter spam effectively, it might be useful to take a look at links.

That Google uses link analysis in order to detect spam has been confirmed more or less clearly in WebmasterWorld's Google News Forum by a Google employee who posts as "GoogleGuy". Over and over again, he advises webmasters to avoid "linking to bad neighbourhoods". In the following, we want to specify the "linking to bad neighbourhoods" and, to become more precisely, we want to discuss how an identification of spam can be realized by the analysis of link structures. In particular, it shall be shown how entire networks of spam pages, which may even be located on a lot of different domains, can be detected.

BadRank as the Opposite of PageRank

The theoretical approach for PR0 as it is presented here was initially brought up by Raph Levien (www.advogato.org/person/raph). We want to introduce a technique that - just like PageRank - analyzes link structures, but, that unlike PageRank does not determine the general importance of a web page but rather measures its negative characteristics. For the sake of simplicity this technique shall be called "BadRank".

BadRank is in priciple based on "linking to bad neighbourhoods". If one page links to another page with a high BadRank, the first page gets a high BadRank itself through this link. The similarities to PageRank are obvious. The difference is that BadRank is not based on the evaluation of inbound links of a web page but on its outbound links. In this sense, BadRank represents a reversion of PageRank. In a direct adaptation of the PageRank algorithm, BadRank would be given by the following formula:

BR(A) = E(A) (1-d) + d (BR(T1)/C(T1) + ... + BR(Tn)/C(Tn)) where

BR(A) is the BadRank of page A,
BR(Ti) is the BadRank of pages Ti which are outbound links of page A,
C(Ti) is here the number of inbound links of page Ti and
d is the again necessary damping factor.
In the previously discussed modifications of the PageRank algorithm, E(A) represented the special evaluation of certain web pages. Regarding the BadRank algorithm, this value reflects if a page was detected by a spam filter or not. Without the value E(A), the BadRank algorithm would be useless because it was nothing but another analysis of link structures which would not take any further criteria into account.

By means of the BadRank algorithm, first of all, spam pages can be evaluated. A filter assigns a numeric value E(A) to them, which can, for example, be based on the degree of spamming or maybe even better on their PageRank. Thereby, again, the sum of all E(A) has to equal the total number of web pages. In the course of an iterative computation, BadRank is not only transfered to pages which link to spam pages. In fact, BadRank is able to identify regions of the web where spam tends to occur relatively often, just as PageRank identifies regions of the web which are of general importance.

Of course, BadRank and PageRank have significant differences, especially, because of using outbound and inbound links, respectively. Our example shows a simple, hierarchically structured website that reflects common link structures pretty well. Each page links to every page which is on a higher hierachical level and on its branch of the website's tree structure. Each page links to pages which are arranged hierarchically directly below them and, additionally, pages on the same branch and the same hierarchical level link to each other.

The following table shows the distribution of inbound and outbound links for the hierarchical levels of such a site.

Level inbound Links outbound Links
0 6 2
1 4 4
2 2 3

As to be expected, regarding inbound links, a hierarchical gradation from the index page downwards takes place. In contrast, we find the highest number of outbound links on the website's mid-level. We can see similar results, when we add another level of pages to our website while the above described linking rules stay the same.

Level inbound Links outbound Links
0 14 2
1 8 4
2 4 5
3 2 4

Again, there is a concentration of outbound links on the website's mid-level. But most of all, the outbound links are much more evenly distributed than the inbound links. If we assign a value of 100 to the index page's E(A) in our original example, while all other values E equal 1 and if the damping factor d is 0.85, we get the following BadRank values:

Page BadRank
A 22.39
B/C 17.39
D/E/F/G 12.21

First of all, we see that the BadRank distributes from the index page among all other pages of the website. The combination of PageRank and BadRank will be discussed in detail below, but, no matter how the combination will be realized, it is obvious that both can neutralize each other very well. After all, we can assume that also the page's PageRank decreases, the lower the hierarchy level is, so that a PR0 can easily be achieved for all pages.

If we now assume that the hierarchically inferior page G links to a page X with a constant BadRank BR(X)=10, whereby the link from page G is the only inbound link for page X, and if all values E for our example website equal 1, we get, at a damping factor d of 0.85, the following values:

Page BadRank
A 4.82
B 7.50
C 14.50
D 4.22
E 4.22
F 11.22
G 17.18

In this case, we see that the distribution of the BadRank is less homogeneous than in the first scenario. Non the less, a distribution of BadRank among all pages of the website takes place. Indeed, the relatively low BadRank of the index page A is remarkable. It could be a problem to neutralize its PageRank which should be higher compared to the rest of the pages. This effect is not really desirable but it reflects the experiences of numerous webmasters. Quite often, we can see the phenomenom that all pages except for the index page of a website show a PR0 in the Google Toolbar, whereby the index page often has a Toolbar PageRank between 2 and 4. Therefore, we can probably assume that this special variant of PR0 is not caused by the detection of the according website by a spam filter, but the site rather received a penalty for "linking to bad neighbourhoods". Indeed, it is also possible that this variant of PR0 occurs when only hierarchical inferior pages of a website get trapped in a spam filter.

The Combination of PageRank and BadRank to PR0

If we assume that BadRank exists in the form presented here, there is now the question in which way BadRank and PageRank can be combined, in order to penalize as much spammers as possible while at the same time penalizing as few innocent webmasters as possible.

Intuitively, implementing BadRank directly in the actual PageRank computations seems to make sense. For instance, it is possible to calculate BadRank first and, then, divide a page's PageRank through its BadRank each time in the course of the iterative calculation of PageRank. This would have the advantage, that a page with a high BadRank could pass on just a little PageRank or none at all to the pages it links to. After all, one can argue that if one page links to a suspect page, all the other links on that page may also be suspect.

Indeed, such a direct connection between PageRank and BadRank is very risky. Most of all, the actual influence of BadRank on PageRank cannot be estimated in advance. It is to be considered that we would create a lot of pages which cannot pass on PageRank to the pages they link to. In fact, these pages are dangling links, and as it has been discussed in the section on outbound links, it is absolutely necessary to avoid dangling links while computing PageRank.

So, it would be advisable to have separate iterative calculations for PageRank and BadRank. Combining them afterwards can, for instance, be based on simple arithmetical operations. In principle, a subtraction would have the desirable consequence that relatively small BadRank values can hardly have a large influence on relatively high PageRank values. But, there would certainly be a problem to achieve PR0 for a large number of pages by using the subtraction. We would rather see a PageRank devaluation for many pages.

Achieving the effects that we know as PR0 seems easier to be realized by dividing PageRank through BadRank. But this would imply that BadRank receives an extremely high importance. However, since the average BadRank equals 1, a big part of BadRank values is smaller than 1 and, so, a normalization is necessary. Probably, normalizing and scaling BadRank to values between 0 and 1 so that "good" pages have values close to 1, and "bad" pages have values close to 0 and, subsequently, multiplying these values with PageRank would supply the best results.

A very effective and easy to realize alternative would probably be a simple stepped evaluation of PageRank and BadRank. It would be reasonable that if BadRank exceeds a certain value it will always lead to a PR0. The same could happen when the relation of PageRank to BadRank is below a certain value. Additionally, it would make sense that if BadRank and/or the relation of BadRank to PageRank is below a certain value, BadRank takes no influence at all. Only if none of these cases occurs, an actual combination of PageRank and BadRank - for instance by dividing PageRank through BadRank - would be necessary. In this way, all unwanted effects could be avoided.

A Critical View on BadRank and PR0

How Google would realize the combination of PageRank and BadRank is of rather minor importance. Indeed, a separate computation and a subsequent combination of both has the consequence that it may not be possible to see the actual effect of a high BadRank by looking at the Toolbar. If a page has a high PageRank in the original sense, the influence of its BadRank can be negligible. But if another page links to it, this could have quite serious consequences.

An even bigger problem is the direct reversion of the PageRank algorithm as we have presented it here: Just as an additional inbound for one page can do nothing but increasing this page's PageRank, an additional outbound link can only increase its BadRank. This is because of the addition of BadRank values in the BadRank formula. So, it does not matter how many "good" outbound links a page has - one link to a spam page can be enough to lead to a PR0.

Indeed, this problem may appear in exceptional cases only. By our direct reversion of the PageRank algorithm, the BadRank of a page is divided by its inbound links and single links to pages with high BadRank transfer only a part of that BadRank in each case. Google's Matt Cutts' remark on this issue is: "If someone accidentally does a link to a bad site, that may not hurt them, but if they do twenty, that's a problem." (searchenginewatch.com/sereport/02/11-searchking.html)

However, as long as all links are weighted uniformly within the BadRank computation, there is another problem. If two pages differ widely in PageRank and both have a link to the same page with a high BadRank, this may lead to the page with the higher PageRank suffering far less from the transferred BadRank than the page with the low PageRank. We have to hope that Google knows how to deal with such problems. Nevertheless it shall be noted that, regarding the procedure presented here, outbound links can do nothing but harm.

Of course, all statements regarding how PR0 works are pure speculation. But in principle, the analysis of link structures similarly to the PageRank technique should be the way how only Google understands to deal with spam. More about

This page is taken from Wikipedia and google webmaster tools.

 

 

Home - Keywords - Friendly - contact - Gold Pack - Page Ranking - Link Partner - Resources - Link - Order