Web Scraping.. Learn the Basic

Web scraping is a technique to grab -then extract- information from internet. Other terminology of web scraping is ‘beyond browser activity’, a browsing activity without a browser. Once you understand how web scraping works, and mastering technique(s) how to do that, you will find new feeling of freedom on the information access.

The most basic of web scraping is copy paste activity. You find useful information, then select, copy, then paste it. This basic technique doesn’t require any programming knowledge. I am sure everyone has done this kind of scraping ;).

But the most interesting part of web scraping is doing it programmatically. Ok, let’s get started. This tutorial using ruby programming language, so the basic knowledge of ruby is recommended, but not required.

For example, I will use a page from 4shared.com that I choose randomly. The purpose of this web scraping is to get direct download link from url below.

http://www.4shared.com/file/87112304/1ea6ef90/FastStone_Image_Viewer.html

Learning web scraping, require analysis to perform the best. As you see above, if you open that url, then you will find the button “Download Now”, and that button points to another url:

http://www.4shared.com/get/87112304/1ea6ef90/FastStone_Image_Viewer.html

Now you understand where you have to start you web scraping process. Instead starting from first url, once you understand how 4shared url scheme, now you realize that you just replace /file/ to /get/ on your url, so you skip one scraping process. Remember, make sure that you minimize unnecessary request to the server.

Now, using second url, the web scraping process started. Here I am using irb to perform live testing, and require open-uri and nokogiri rubygems:

:~ % irb
>> require 'rubygems'
=> true
>> require 'open-uri'
=> true
>> require 'nokogiri'
=> true
>> uri = 'http://www.4shared.com/get/87112304/1ea6ef90/FastStone_Image_Viewer.html'
>> no = Nokogiri::HTML open uri

Now you have full page source on Nokogiri object. Direct download that we find is located on

<div style="margin:30px 0;height:50px;line-height:2.5em; display: none;" id=’divDLStart’ >
    <a href=’http://dc125.4shared.com/download/87112304/1ea6ef90/FastStone_Image_Viewer.exe?tsid=20090821-013258-a762821d’>Click here to download this file</a>
</div>

Direct download wrapped on div element that have id “divDLStart”, we can traverse into it using CSS selector easily, that the reason why I am using Nokogiri ;)

>> no.search("#divDLStart a").attr 'href'
=> "http://dc125.4shared.com/download/87112304/1ea6ef90/FastStone_Image_Viewer.exe?tsid=20090821-013258-a762821d"

Now the direct download link is on your hand, using a simple CSS selector above. Which is translated into words “select element ‘a’ under element that have id ‘divDLStart’, but takes only its ‘href’ attribute“.

Web scraping is a programming task that require you to understand about the target you work with. So dig deeper first then write program code. It’s simple, elegant, powerful and fun!.



Share and Bookmark


4 Comments to “Web Scraping.. Learn the Basic”

  1. Thanks tom. That's just the basic, I promise I will write more about web scraping in the near future ;)

  2. tom says:

    Great tutorial. will try this myself

  3. alifity says:

    @ Stret.Walker
    Lha itu kan masih belum selesai.. cuman preview dulu.. Sabar bro :)

  4. Stret.Walker says:

    Why koq sedh*t.com isnt direct dl from 4shared? i still pakae nunggu2 je???? erRrrrrR

    -tlng ditranslate skalian komenq- :D ;)

Leave a Reply