Website Screen Scraping: May 2013

Thursday, 30 May 2013

Screen-scraping for the Web

If you are looking for a Web Screen-Scraper, Screensurfer is your answer. We prefer the term ScreenSurfing, because old-fashioned approaches to upgrading mainframe screens (Screen Scraping) are too difficult, have too many programmatic details and are simply too complicated. A traditional Screen/Scraper product uses EHLLAPI or other complicated API's to deliver screen data and functionality to a client program.

Screensurfer is not such a 'screen scraper'; it provides progamming that "lives inside the screen". That is because unlike Screen-Scraping with EHLLAPI, Screensurfer embeds HTML templates right along with screen access HTML extension tags. Screen/Scraping will never be the same.

Other approaches, which still align themselves with the term 'screen scraping', are now attempting Web Products, but they don't get it. They crawl onto the backs of products such as Microsoft IIS and then give the developer complicated ASP object calls for every screen element and operation; that's a lot of code! No wonder they call it a scraper. A Screen upgrade to the Web deserves a better approach: Screen Surfing! Stop Scraping; start Surfing. Scrape no more: Surf!

You might ask yourself "how can I tell the difference between a screensurfing product and a screen-scraping product." Well, one easy answer is check the memory requirements and runtime performance. You can usually detect a traditional screen-scraping product that has been hurredly ported to execute in a server mode if it: A) is a huge memory hog B) runs very slowly and C) is single-threaded in critical areas meaning it runs even slower with a lot of concurrent user requests.

Of course, we built Screensurfer from the ground-up to avoid these problems and to also optimize both its development productivity and runtime performance around the screen-to-HTML paradigm.
Inventu Corporation

Inventu provides our Customers with software and services to implement new applications that support business initiatives. Inventu concentrates on using, replacing and renewing existing (legacy) applications using our own product, Screensurfer and our Partner's product, Flynet Viewer.
Screensurfer
Screensurfer is an All-in-one host screen-to-Web application server that runs on Windows servers such as Windows 2000 and Windows Server 2003. Screensurfer is self-contained; no other products or development tools are needed to implement a new web-to-host application.

With an HTML-based tag language (Surferscript) which is very similar to Coldfusion CFML, Screensurfer provides very high productivity when converting existing 3270, 5250 and VT100 screens into HTML pages.

Screensurfer's IDE, the Express Editor and Debugger, provides an integrated editing and step-trace debugger for a solid editing and development environment that any developer can appreciate.
Flynet Viewer
Flynet Viewer provides "everything you need" and "nothing you don't want" for screen interfacing by development teams utilizing Microsoft IDE environments such as Visual Studio .NET, Visual Basic and Visual Interdev.

Source: http://www.inventu.com/screenscraper.html

Monday, 27 May 2013

What is screen scraping?

Screen scraping is the act of taking all the information that a person has posted on their Web site or social networking page and then using the information to break into the user’s account or to commit some other fraud involving identity theft.

Social networking Web sites such as Facebook have grown exponentially in the past few years, and it’s not uncommon for people to post personal pictures and reveal personal information about themselves. People often prefer Facebook to traditional blogs because information is usually only available to people that they choose. However, if cybercriminals gain access to your Web site or social networking page, they can use screen scraping to steal your information and can pose as you. For more information about this type of scam, see Scammers exploit Facebook friendships.

You can use strong passwords and learn techniques to avoid social engineering scams, but the best way to prevent the negative effects of screen scraping is to minimize the amount of information that you post online.

Here are a few tips:

·         Do not post anything online that you would not want made public.

·         Minimize details that identify you or your whereabouts.

·         Keep your account numbers, user names, and passwords secret.

For more information, see the following articles:

·         Protect your privacy on the Internet

·         Your information on the Internet: what you need to know

·         How to reduce the risk of online fraud

·         Help protect yourself against phishing scams and identity theft

Source: http://blogs.msdn.com/b/securitytipstalk/archive/2010/04/07/what-is-screen-scraping.aspx

Friday, 24 May 2013

Screen Scraping with BeautifulSoup and lxml

I completely rewrote this chapter for the book's second edition, to feature two powerful libraries that have appeared since the book first came out. I show how to screen-scrape a real-life web page using both BeautifulSoup and also the powerful lxml library (their web sites are here and here).

I chose this chapter for release because screen scraping is often the first network task that a novice Python programmer tackles. Because this material is oriented towards beginners, it explains the entire process — from fetching web pages, to understanding HTML, to querying for specific elements in the document.

Program listings are available for this chapter in both Python 2 and also in Python 3. Let me know if you have any questions!

Most web sites are designed first and foremost for human eyes. While well-designed sites offer formal APIs by which you can construct Google maps, upload Flickr photos, or browse YouTube videos, many sites offer nothing but HTML pages formatted for humans. If you need a program to be able to fetch its data, then you will need the ability to dive into densely formatted markup and retrieve the information you need—a process known affectionately as screen scraping.

In one's haste to grab information from a web page sitting open in your browser in front of you, it can be easy for even experienced programmers to forget to check whether an API is provided for data that they need. So try to take a few minutes investigating the site in which you are interested to see if some more formal programming interface is offered to their services. Even an RSS feed can sometimes be easier to parse than a list of items on a full web page.

Also be careful to check for a “terms of service” document on each site. YouTube, for example, offers an API and, in return, disallows programs from trying to parse their web pages. Sites usually do this for very important reasons related to performance and usage patterns, so I recommend always obeying the terms of service and simply going elsewhere for your data if they prove too restrictive.

Regardless of whether terms of service exist, always try to be polite when hitting public web sites. Cache pages or data that you will need for several minutes or hours, rather than hitting their site needlessly over and over again. When developing your screen-scraping algorithm, test against a copy of their web page that you save to disk, instead of doing an HTTP round-trip with every test. And always be aware that excessive use can result in your IP being temporarily or permanently blocked from a site if its owners are sensitive to automated sources of load.

Source: http://rhodesmill.org/brandon/chapters/screen-scraping/

Friday, 17 May 2013

Top Screen Scraper Software

What are some top screen scraper software you can recommend? Here’s my criteria: it can be easily be configured, can export in a variety of formats (CSV,SQL,text, etc). I need to get data from webpages, the data I’m planning to get are prices, descriptions and some images and that data can then be exported in different formats.
What are some top screen scraper software you can recommend?...

Do you guys know a software that does that?

Thanks!

There’s a free version of Scrape.it Screen Scraper, you can define rules in a tree like format and the workflow is very easy to get a scraping job up and running very fast.
Screen Scraping Software for Webpages

An excellent freeware application with a friendly graphical interface is DEiXTo:
http://deixto.com/
It offers a wealth of options for extracting webpage content and saving it as XML and tab-delimited files.

Other free and commercial software products are covered by KDnuggets, billed as the data mining community’s top resource:
http://www.kdnuggets.com/software/web-content-mining.html

DEiXTo (or ΔEiXTo) is a powerful web data extraction tool that is based on the W3C Document Object Model (DOM). It allows users to create highly accurate “extraction rules” (wrappers) that describe what pieces of data to scrape from a website. DEiXTo consists of three separate components:

    GUI DEiXTo, an MS Windows™ application implementing a friendly graphical user interface that is used to manage extraction rules (build, test, fine-tune, save and modify).
    Command Line Executor, a stand-alone, cross-platform utility that can massively apply an extraction rule on multiple target HTML pages and produce structured output in a wide variety of formats.
    DEiXToBot, a Perl module implementing a flexible and efficient sleepy Mechanize agent (essentially a browser emulator) capable of extracting data of interest using GUI DEiXTo generated patterns. It contains best of breed Perl technology and allows extensive customization. Thus, it facilitates tailor-made solutions.
    DEiXTo can contend with a wide range of websites with high precision and recall. It provides the user with an arsenal of features aiming at the construction of well-engineered extraction rules. Wrappers built with GUI DEiXTo can be scheduled to run automatically providing automated access to resources of interest and saving users a lot of time, energy and repetitive effort.

What is more, DEiXTo has been working very well for quite some time with large and complex systems such as:

    openarchives.gr – Greek Digital Libraries Search Engine
    aggregator.libver.gr – Hellenic Aggregator for Europeana

Source: http://www.eonlinegratis.com/2013/top-screen-scraper-software/

Monday, 6 May 2013

Screen Scraper

Screen scraping or web data extraction is basically a computer software method used for years in extracting information from various websites. Sometimes, a screen scraper application is used to simulate the discovery of information on the Web by humans, either through implementing low level HTTP or Hypertext Transfer Protocol or by embedding well established web browsers like Mozilla or Internet Explorer for Windows. The process for screen scraping is almost similar to the ones for web indexing which basically indexes all web content with the use of a bot and is a universal method which has been adopted and used by most search engines today. But as opposed to the typical web indexing technique, screen scraper software directs most of its focus on the transformation of web content that is unstructured yet and most of them are usually in HTML format.

This content is then transformed into a more structured format which can be analyzed and stored either in a spreadsheet or a central local databank. Screen scraping can also be associated with web automation wherein human browsing is simulated by utilizing specifically designed computer software. This may include, weather data monitoring, web research, change detection among websites, price comparison, integration of data from the web, or for web content mash up. When you look at screen scraper applications at a very simplified level, there are usually two basic stages that are involved and these are data extraction and data discovery. The data extraction basically deals with the process of actually pulling off data from web pages while data discovery usually deals with navigating a particular web site to locate and get into the page the researcher wants.

Usually, when people heard about screen scraping, they tend to shift their focus only on the process of data extraction although one of the most crucial stages is the data discovery itself. Using screen scraper applications, the stage for data discovery may be as simple as asking for a URL. For instance, it may be as simple as visiting a particular website and extracting out the relevant information you need. However, the process for data discovery may involve several complex tasks such as logging into a secured site, getting through a maze of supplemental pages just so you can obtain the required cookies, and getting though thousands of search results before finally tracking all the details so you can technically obtain the relevant data you are after.

The process itself seems so daunting and without a good screen scraping application, you have a very slim chance of getting what you really want. But with data extraction, this phase means you already got yourself into the page where the information you are after is located and all you need to do now is to extract the HTML. This process can be further simplified by a well trusted screen scraper programmed to go after the essential data that will be relevant to your search. Although this may entail complex programming to get finely tuned results, still there are web applications that are guaranteed to work just fine. With the help of this type of service, you will be able to simplify your hunt for information and make everything a whole lot easier on your part.

Source: http://www.fetch.com/screen-scraper-articles/