Website Screen Scraping: June 2013

Saturday, 29 June 2013

Data Mining - Techniques and Process of Data Mining

Data mining as the name suggest is extracting informative data from a huge source of information. It is like segregating a drop from the ocean. Here a drop is the most important information essential for your business, and the ocean is the huge database built up by you.

Recognized in Business

Businesses have become too creative, by coming up with new patterns and trends and of behavior through data mining techniques or automated statistical analysis. Once the desired information is found from the huge database it could be used for various applications. If you want to get involved into other functions of your business you should take help of professional data mining services available in the industry

Data Collection

Data collection is the first step required towards a constructive data-mining program. Almost all businesses require collecting data. It is the process of finding important data essential for your business, filtering and preparing it for a data mining outsourcing process. For those who are already have experience to track customer data in a database management system, have probably achieved their destination.

Algorithm selection

You may select one or more data mining algorithms to resolve your problem. You already have database. You may experiment using several techniques. Your selection of algorithm depends upon the problem that you are want to resolve, the data collected, as well as the tools you possess.

Regression Technique

The most well-know and the oldest statistical technique utilized for data mining is regression. Using a numerical dataset, it then further develops a mathematical formula applicable to the data. Here taking your new data use it into existing mathematical formula developed by you and you will get a prediction of future behavior. Now knowing the use is not enough. You will have to learn about its limitations associated with it. This technique works best with continuous quantitative data as age, speed or weight. While working on categorical data as gender, name or color, where order is not significant it better to use another suitable technique.

Classification Technique

There is another technique, called classification analysis technique which is suitable for both, categorical data as well as a mix of categorical and numeric data. Compared to regression technique, classification technique can process a broader range of data, and therefore is popular. Here one can easily interpret output. Here you will get a decision tree requiring a series of binary decisions.

Source: http://ezinearticles.com/?Data-Mining---Techniques-and-Process-of-Data-Mining&id=5302867

Thursday, 27 June 2013

What Can Online Data Entry Clerks Do for You?

Information can make or break a business. That is basically the reason why most businesspersons would have to conduct a market research first before they make any moves. Furthermore, any business would have to process tons of data as they operate. Traditionally, businesspersons would hire office-based data entry clerks to do data gathering or data management for them. Nonetheless, as technologies like the computer and the internet continue to flourish, employing people to do data entry jobs has become more cost-effective and convenient.

Nowadays, these clerks do not necessarily have to work in an office set-up. They can essentially provide their services from home or through an offshore outsourcing company. This can be extremely advantageous for any businessperson in a way that he or she need not to spend on equipments, employee benefits and other office expenses. Additionally, an employer can hire a full-time or part-time clerk depending on the work that he or she needs. Because of these advantages, more and more people from the business world have actually considered outsourcing services.

Data entry clerks that work away from the office provide the same services that office-based clerks provide. Some of the most common services provided by offshore data entry clerks include keyboarding, data conversion and data processing. Some data entry jobs involve web research including data mining, data extraction, data collection and data validation. Furthermore, they can also work with information coming from different industries or business sectors such as education, healthcare, insurance, government and publishing agencies. It is almost safe to say that this clerks can provide the things needed when it comes to data gathering and management.

Typically, this jobs do not require a lot of qualifications. The most basic ones would have to be the familiarity of the English language, the computer and the internet. Nevertheless, there are few tasks that would call for some specific knowledge or training. One example of this is medical transcription.

Data entry clerks normally make use of the internet to get things done. All the transactions and communication between a clerk and an employer will be done online. Thus, it is very important for the employer to clearly hand down his or her tasks for the day. This is to avoid unnecessary confusion particularly if a certain task calls for some special instructions. Moreover, although offshore data entry clerks require very minimal supervision, the employers still has the responsibility to follow-up and oversee the work or output provided by his or her clerk.

Online data entry clerks can certainly provide the help that each and every company or business needs to function effectively. If you need one, then there are numerous online assistant companies today that can provide you the help you need. All you ever have to do is search for the right company that offers the best data entry services.

data entry clerks is wide spreading nowadays and many companys are now dealing with this kind of institutions. There are online companies that offers virtual assistant services with quality service at affordable price.

Source: http://ezinearticles.com/?What-Can-Online-Data-Entry-Clerks-Do-for-You?&id=7176328

Tuesday, 25 June 2013

Data Mining Explained

Overview
Data mining is the crucial process of extracting implicit and possibly useful information from data. It uses analytical and visualization techniques to explore and present information in a format which is easily understandable by humans.

Data mining is widely used in a variety of profiling practices, such as fraud detection, marketing research, surveys and scientific discovery.

In this article I will briefly explain some of the fundamentals and its applications in the real world.

Herein I will not discuss related processes of any sorts, including Data Extraction and Data Structuring.

The Effort
Data Mining has found its application in various fields such as financial institutions, health-care & bio-informatics, business intelligence, social networks data research and many more.

Businesses use it to understand consumer behavior, analyze buying patterns of clients and expand its marketing efforts. Banks and financial institutions use it to detect credit card frauds by recognizing the patterns involved in fake transactions.

The Knack
There is definitely a knack to Data Mining, as there is with any other field of web research activities. That is why it is referred as a craft rather than a science. A craft is the skilled practicing of an occupation.

One point I would like to make here is that data mining solutions offers an analytical perspective into the performance of a company depending on the historical data but one need to consider unknown external events and deceitful activities. On the flip side it is more critical especially for Regulatory bodies to forecast such activities in advance and take necessary measures to prevent such events in future.

In Closing
There are many important niches of Web Data Research that this article has not covered. But I hope that this article will provide you a stage to drill down further into this subject, if you want to do so!

Should you have any queries, please feel free to mail me. I would be pleased to answer each of your queries in detail.

Source: http://ezinearticles.com/?Data-Mining-Explained&id=4341782

Monday, 24 June 2013

Usefulness of Web Scraping Services

For any business or organization, surveys and market research play important roles in the strategic decision-making process. Data extraction and web scraping techniques are important tools that find relevant data and information for your personal or business use. Many companies employ people to copy-paste data manually from the web pages. This process is very reliable but very costly as it results to time wastage and effort. This is so because the data collected is less compared to the resources spent and time taken to gather such data.

Nowadays, various data mining companies have developed effective web scraping techniques that can crawl over thousands of websites and their pages to harvest particular information. The information extracted is then stored into a CSV file, database, XML file, or any other source with the required format. After the data has been collected and stored, data mining process can be used to extract the hidden patterns and trends contained in the data. By understanding the correlations and patterns in the data; policies can be formulated and thereby aiding the decision-making process. The information can also be stored for future reference.

The following are some of the common examples of data extraction process:

• Scrap through a government portal in order to extract the names of the citizens who are reliable for a given survey.
• Scraping competitor websites for feature data and product pricing
• Using web scraping to download videos and images for stock photography site or for website design

Automated Data Collection
It is important to note that web scraping process allows a company to monitor the website data changes over a given time frame. It also collects the data on a routine basis regularly. Automated data collection techniques are quite important as they help companies to discover customer trends and market trends. By determining market trends, it is possible to understand the customer behavior and predict the likelihood of how the data will change.

The following are some of the examples of the automated data collection:

• Monitoring price information for the particular stocks on hourly basis
• Collecting mortgage rates from the various financial institutions on the daily basis
• Checking on weather reports on regular basis as required

By using web scraping services it is possible to extract any data that is related to your business. The data can then be downloaded into a spreadsheet or a database for it to be analyzed and compared. Storing the data in a database or in a required format makes it easier for interpretation and understanding of the correlations and for identification of the hidden patterns.

Through web scraping it is possible to get quicker and accurate results and thus saving many resources in terms of money and time. With data extraction services, it is possible to fetch information about pricing, mailing, database, profile data, and competitors data on a consistent basis. With the emergence of professional data mining companies outsourcing your services will greatly reduce your costs and at the same time you are assured of high quality services.

Source: http://ezinearticles.com/?Usefulness-of-Web-Scraping-Services&id=7181014

Friday, 21 June 2013

Basics of Web Data Mining and Challenges in Web Data Mining Process

Today World Wide Web is flooded with billions of static and dynamic web pages created with programming languages such as HTML, PHP and ASP. Web is great source of information offering a lush playground for data mining. Since the data stored on web is in various formats and are dynamic in nature, it's a significant challenge to search, process and present the unstructured information available on the web.

Complexity of a Web page far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization while traditional books and text documents are much simpler in their consistency. Further, search engines with their limited capacity can not index all the web pages which makes data mining extremely inefficient.

Moreover, Internet is a highly dynamic knowledge resource and grows at a rapid pace. Sports, News, Finance and Corporate sites update their websites on hourly or daily basis. Today Web reaches to millions of users having different profiles, interests and usage purposes. Every one of these requires good information but don't know how to retrieve relevant data efficiently and with least efforts.

It is important to note that only a small section of the web possesses really useful information. There are three usual methods that a user adopts when accessing information stored on the internet:

• Random surfing i.e. following large numbers of hyperlinks available on the web page.
• Query based search on Search Engines - use Google or Yahoo to find relevant documents (entering specific keywords queries of interest in search box)
• Deep query searches i.e. fetching searchable database from eBay.com's product search engines or Business.com's service directory, etc.

To use the web as an effective resource and knowledge discovery researchers have developed efficient data mining techniques to extract relevant data easily, smoothly and cost-effectively.

Source: http://ezinearticles.com/?Basics-of-Web-Data-Mining-and-Challenges-in-Web-Data-Mining-Process&id=4937441

Wednesday, 19 June 2013

Outsourcing Data Entry Services - Only an Experienced Player Could Enhance Your Business Value

Data is critical to every business, today. But, the same would meet the knowledge management needs of an organization only if it is processed and turned into useful information. The right information at the right time is what every business demands and maintaining a comprehensive database has become an indispensable task of every organization. However, the effective processing of data, which are usually huge, demands for the professional data entry service provider. Outsourcing the data entry functions to professional service providers would help large companies in expediting the processing of their day-to-day operations.

Advantages of partnering with an experienced service provider

A data entry BPO firm with significant expertise and experience can provide impeccable solutions to your data management requirements. A company well-equipped with security and sophisticated technology, along with experienced professionals with the capacity to handle bulk projects, can seamlessly provide the right solutions that best match with your business needs.

Domain expertise carries weight when it comes to certain tasks like finance and accounting, litigation or medical/ health-related data processing. Only a firm with deep domain expertise and skilled professionals could provide expert services. The well-experienced companies, thus, know the nuances of the business needs thereby offering superior services with 99/ 99.99% accuracy rate and best turnaround time (TAT).

Data security - Need for state-of-the-art services provider

Data privacy is one of the primary parameters considered by companies while outsourcing its non-core functions. The outsourced companies ensure data confidentiality by taking several security measures like splitting and distributing the task to more than one data entry operator, thus preventing them from having access to complete data; storing both raw and processed data in a secured or password protected files/ folders. Besides, these companies are committed to provide complete information security to their clients by entering into a legal agreement and a non-disclosure agreement with employees.

Choosing the right market for outsourcing

Even the market that you choose to outsource is very important as it could impact your business in several ways. A market that has sustained a balanced growth over a period of time despite the changes in global economic situations would be a reliable market in the long-run and this is where you can find the best BPO firms. Companies in Asia and India, in particular, have been reigning the data entry BPO industry for more than a decade now. Many small and large firms across the globe prefer outsourcing to Indian data entry firms, which have maintained and rendered superior end-results for many years now. Zeroing in on the right market and the right company helps vendors enjoy quick services at cost-effective rates. In addition, the cut-throat competition prevailing in the industry allows big companies to select the best outsourcing company that meets their specifications. This way a large amount of overhead costs is eliminated, besides enabling the companies to focus on their core business activities.

Apart from providing quality offshore data entry solutions, companies that have been in the industry for years are vying to deliver value to its vendors by quick processing, reducing the data security risk and providing competitive rates thereby enhancing business performance.

Source: http://ezinearticles.com/?Outsourcing-Data-Entry-Services---Only-an-Experienced-Player-Could-Enhance-Your-Business-Value&id=6746192

Monday, 17 June 2013

Data Mining Is Useful for Business Application and Market Research Services

One day of data mining is an important tool in a market for modern business and market research to transform data into an information system advantage. Most companies in India that offers a complete solution and services for these services. The extraction or to provide companies with important information for analysis and research.

These services are primarily today by companies because the firm body search of all trade associations, retail, financial or market, the institute and the government needs a large amount of information for their development of market research. This service allows you to receive all types of information when needed. With this method, you simply remove your name and information filter.

This service is of great importance, because their applications to help businesses understand that it can perform actions and consumer buying trends and industry analysis, etc. There are business applications use these services:
1) Research Services
2) consumption behavior
3) E-commerce
4) Direct marketing
5) financial services and
6) customer relationship management, etc.

Benefits of Data mining services in Business

• Understand the customer need for better decision
• Generate more business
• Target the Relevant Market.
• Risk free outsourcing experience
• Provide data access to business analysts
• Help to minimize risk and improve ROI.
• Improve profitability by detect unusual pattern in sales, claims, transactions
• Major decrease in Direct Marketing expenses

Understanding the customer's need for a better fit to generate more business target market.To provide risk-free outsourcing experience data access for business analysts to minimize risk and improve return on investment.

The use of these services in the area to help ensure that the data more relevant to business applications. The different types of text mining such as mining, web mining, relational databases, data mining, graphics, audio and video industry, which all used in enterprise applications.

Source: http://ezinearticles.com/?Data-Mining-Is-Useful-for-Business-Application-and-Market-Research-Services&id=5123878

Friday, 14 June 2013

Data Recovery Equipment

Today, computers are an integral and indispensable part of the IT world. No matter what your line may be, finance, education, business consulting and investigation, IT information security, or else. In fact, most people always take it for granted. You should never brag your computer can be failure-free.

The foremost use of computer is data storage. All the data is stored on a physical disk named hard disk drive which is a magnetic layer. And it is more likely to be stricken by a wide variety of reasons, such as a partition lost, system can not access, human mistake (accidental reformatting, deletion), file corruption, power surge, and virus attack, to the worst, these physical level failures typically are head crash, platter scratch, and motor failures caused by overwriting, physical damages, natural disasters, etc.

Sometimes a hard drive has been stricken dead or not working at all without any warning signs, but some other times there may be some clues that something is going bad or amiss. Such changes in performance or sudden blue screens are telltale signs that the hard drive may be on its way to collapse. The most obvious and common sign are clicking, squealing, scraping or grinding noises.

The computer become more involved in our daily life, so the danger of data loss also surfaces.

As most of us have already experienced data loss, it could be frustrated and traumatic, when you finally find your critical data are not able to recover. As a matter of fact, logical failures as I previously mentioned, a data recovery software program can simply work out, but speaking of physical failures, No! Those drives with minor physical failures will need a special equipment to repair hard drive itself or recover data.

Why data recovery software will stop there? The ordinary user-level repeated-read access method that is used by imaging software bring a risk of damaging the disk and head, making data lost irretrievable. Also the software skips bad sectors directly in order not to get hang (freeze). Even so it gets hang most of the time in case the drive has lost of bad sectors. Plus, there is no guarantee that all the data will be extracted as much as possible, though days or weeks of time wasted on imaging bad drives. That's why you should avoid it at all cost.

A unique data recovery equipment known as Data Compass is mostly used among experts and practitioners worldwide, where traditional tools can not reach the height. Data Compass reads data of each sector physically byte-to-byte, including good and bad, and copy to a good disk using its data extraction software and hardware. "Shadow Disk" technology allows Data Compass to maximally avoid further damage to the drive, and ensure the data is not lost from repeated recovery attempts.

Technically speaking, it is hard to figure out how exactly the data can be recovered. It all depends. In most cases, data will be able to recover as long as the parts of hard drive are not severely damaged, otherwise you should swap its components then like platters, heads, and a spindle motor, for example.

A current tool named "hard drive head/platter exchange professional" used for drive disassembling and head/platter exchange will be replaced by the vendor soon. The change is made for optimization reason, and the new product is a better enhancement; plus, the new platter exchanger allows users to work on hard drive with spacers between platters.

If you have known much about data recovery and if you have a craving for this field, you should start your own business with a right equipment and then you can be an expert. Of course, it is not easy to find a proper option from current data recovery equipments with sky-high price in economy hard times. It is even worse when comes to new versions of software to products you are possessing, vendors will charge every time. In this case, free of charge upgrade service is the way to go.

Source: http://ezinearticles.com/?Data-Recovery-Equipment&id=1947719

Thursday, 13 June 2013

Has It Been Done Before? Optimize Your Patent Search Using Patent Scraping Technology

Has it been done before? Optimize your Patent Search using Patent Scraping Technology.

Since the US patent office opened in 1790, inventors across the United States have been submitting all sorts of great products and half-baked ideas to their database. Nowadays, many individuals get ideas for great products only to have the patent office do a patent search and tell them that their ideas have already been patented by someone else! Herin lies a question: How do I perform a patent search to find out if my invention has already been patented before I invest time and money into developing it?

The US patent office patent search database is available to anyone with internet access.

US Patent Search Homepage

Performing a patent search with the patent searching tools on the US Patent office webpage can prove to be a very time consuming process. For example, patent searching the database for "dog" and "food" yields 5745 patent search results. The straight-forward approach to investigating the patent search results for your particular idea is to go through all 5745 results one at a time looking for yours. Get some munchies and settle in, this could take a while! The patent search database sorts results by patent number instead of relevancy. This means that if your idea was recently patented, you will find it near the top but if it wasn't, you could be searching for quite a while. Also, most patent search results have images associated with them. Downloading and displaying these images over the internet can be very time consuming depending on you internet connection and the availability of the patent search database servers.

Because patent searches take such a long time, many companies and organizations are looking ways to improve the process. Some organizations and companies will hire employees for the sole purpose of performing patent searches for them. Others contract out the job to small business that specialize in patent searches. The latest technology for performing patent searches is called patent scraping.

Patent scraping is the process of writing computer automated scripts that analyze a website and copy only the content you are interested in into easily accessible databases or spreadsheets on your computer. Because it is a computerized script performing the patent search, you don't need a separate employee to get the data, you can let it run the patent scraping while you perform other important tasks! Patent scraping technology can also extract text content from images. By saving the images and textual content to your computer, you can then very efficiently search them for content and relevancy; thus saving you lots of time that could be better spent actually inventing something!

To put a real-world face on this, let us consider the pharmaceutical industry. Many different companies are competing for the patent on the next big drug. It has become an indispensible tactic of the industry for one company to perform patent searches for what patents the other companies are applying for, thus learning in which direction the research and development team of the other company is taking them. Using this information, the company can then choose to either pursue that direction heavily, or spin off in a different direction. It would quickly become very costly to maintain a team of researchers dedicated to only performing patent searches all day. Patent scraping technology is the means for figuring out what ideas and technologies are coming about before they make headline news. It is by utilizing patent scraping technology that the large companies stay up to date on the latest trends in technology.

While some companies choose to hire their own programming team to do their patent scraping scripts for them, it is much more cost effective to contract out the job to a qualified team of programmers dedicated to performing such services.

Source: http://ezinearticles.com/?Has-It-Been-Done-Before?-Optimize-Your-Patent-Search-Using-Patent-Scraping-Technology&id=171000

Tuesday, 11 June 2013

Internet Data Mining - How Does it Help Businesses?

Internet has become an indispensable medium for people to conduct different types of businesses and transactions too. This has given rise to the employment of different internet data mining tools and strategies so that they could better their main purpose of existence on the internet platform and also increase their customer base manifold.

Internet data-mining encompasses various processes of collecting and summarizing different data from various websites or webpage contents or make use of different login procedures so that they could identify various patterns. With the help of internet data-mining it becomes extremely easy to spot a potential competitor, pep up the customer support service on the website and make it more customers oriented.

There are different types of internet data_mining techniques which include content, usage and structure mining. Content mining focuses more on the subject matter that is present on a website which includes the video, audio, images and text. Usage mining focuses on a process where the servers report the aspects accessed by users through the server access logs. This data helps in creating an effective and an efficient website structure. Structure mining focuses on the nature of connection of the websites. This is effective in finding out the similarities between various websites.

Also known as web data_mining, with the aid of the tools and the techniques, one can predict the potential growth in a selective market regarding a specific product. Data gathering has never been so easy and one could make use of a variety of tools to gather data and that too in simpler methods. With the help of the data mining tools, screen scraping, web harvesting and web crawling have become very easy and requisite data can be put readily into a usable style and format. Gathering data from anywhere in the web has become as simple as saying 1-2-3. Internet data-mining tools therefore are effective predictors of the future trends that the business might take.

Source: http://ezinearticles.com/?Internet-Data-Mining---How-Does-it-Help-Businesses?&id=3860679

Friday, 7 June 2013

How to get rid of Screen Scrapers from your Website

While driving on a long trip this weekend, I had a bit of time to think. One topic that came to my mind was screen scraping, with a focus on APIs. It hit me: screen scraping is more of a problem with the content producer than it is with the “unauthorized scraping” application.

Screen scraping is the process of taking information that is rendered on the client, and then transforming the information in another process. Typically, the information that is obtained is later processed for filtering, saving, or making a calculation on the information. Everyone has performed some [legitimate form] of screen scraping. When you print a web page, the content is reformatted to be printed. Many of the unauthorized formats of screen scraping have been collecting information on current gambling games [poker, etc], redirecting capchas, and collecting airline fare/availability information.

The scrapee’s [the organization that the scraper is targeting] argument against the process is typically a claim that the tool puts an unusual demand on their service. Typically this demand does not provide them with their usual predictable probability of profit that they are used to. Another argument is that the scraper provides an unfair advantage to other users on the service. In most cases, the scrapee fights against this in legal or technical manners. A third argument is that the content is being misappropriated, or some value is being gained by the scraper and defrauded from the scrapee.

The problem I have with the fighting back against scrapers, is that it never solves the problem that the scrapers try to fix. Let’s take a few examples to go over my point: the KVS tool, TV schedules, and poker bots. The KVS tool uses [frequently updated] plugins to scrape airline sites to get accurate pricing and seat availability details. The tool is really good for people that want to get a fair bit of information on what fares are available and when. It does not provide any information that was not provided by anyone else. It just made many more queries than most people can do manually. Airlines fight against this because they make a lot of money on uninformed users. Their business model is to guarantee that their passengers are not buying up cheap seats. When an airline claims that they have a “lowest price guarantee” that typically means that they show the discount tickets for as long as possible, until they’re gone.

Another case where web scraping has caused another issue is with TV schedules. With the MythTV craze a few years ago, many open source users were using MythTV to record programs via their TV card. It’s a great technology, however the schedule is not provided in the cable TV feed, at least in an unencrypted manner. Users had to resort to scrapping television sites for publicly available “copyrighted” schedules.

The Poker-bots are a little bit of an ethical issue. This is something that differs from the real world rules of the game. When playing poker outside of the internet, players do not have access to real-time statistic tools. Online poker providers aggressively fight against the bots. It makes sense; bots can perform the calculations a lot faster than humans can.

Service providers try to block scrapers in a few different ways. The end of the Wikipedia article lists more; this is a shortened version. Web sites try to deny/misinform scrapers in a few manners: profile the web request traffic (clients that have difficulty with cookies, and do not load JavaScript/images are big warning signs), block the requesting provider, provide “invisible false data” (honeypot-like paths on the content), etc. Application-based services [Pokerbots] are more focused on trying to look for processes that may influence the running executable, securing the internal message handling, and sometimes record the session (also typically done on MMORPGs)

In the three cases, my point is not to argue why the service is justified in attempting to block them, my point is that the service providers are ignoring an untapped secondary market. Those service providers have refused to address the needs of this market – or maybe just haven’t seen the market as viable, and are merely ignoring it.

If people wish to make poker bots, create a service that allows just the bots to compete against each other. The developers of these bots are [generally] interested in the technology, not so much the part about ripping-off non-bot users.

For airlines, do not try to hide your data. Open up API keys for individual users. If an individual user is trying to abuse the data to resell it, to create a Hipmunk/Kayak clone, revoke the key. Even if the individual user’s service request don’t fit the profile; there are ways of catching this behavior. Mapmakers have solved this problem a long time ago by creating trap streets. Scrapers are typically used as a last resort, they’re used to do something that the current process is made very difficult to do.

Warning more ranting: with airline sites, it’s difficult to get a very good impression on the cost differences of flying to different markets [like flying from Greensboro rather than Charlotte] or even changing tickets, so purchasing from an airline is difficult without the aid of this kind of tool. Most customers want to book a single round trip ticket, but some may have a complex itinerary that will have them leaving Charlotte stopping over in Texas, then to San Francisco, and then returning to Texas and flying back to my original destination. That could be accomplished by purchasing separate round trip tickets, but the rules of the tickets allow such combinations to exist on a single literary. Why not allow your users to take advantage of these rules [without the aid of a costly customer service representative]?

People who use scrapers do not represent the majority of the service’s customers. In the case of the television schedules example, they do not profit off the information, and the content that they wished to retrieve wasn’t even motivated by profit. Luckily, an organization stepped in and provided this information at a reasonable [$25/yr] cost. The organization is SchedulesDirect.

The silver lining to the battle on scrapers can get interesting. The PokerClients have prompted scraper developers to come up with clever solutions. The “Coding the Wheel” blog has an interesting article about this and how they inject DLLs into running applications, use OCR, and abuse Windows Message Handles [again of another process]. Web scraping introduces interesting topics that deal with machine learning [to create profiles], and identifying usage patterns.

In conclusion, solve the issue that the screen scrapers attempt to solve, and if you have a situation like poker, prevent the behavior you wish to deny.

Source: http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-of-screen-scrapers-from-your-website/

Wednesday, 5 June 2013

Screen-Scraping Ethics

The internet can be thought of as the world’s largest database. This is so, because it is comprised of inter-connected databases, files, and computer systems. By simply typing in some keywords, one can access hundreds to millions of websites containing treasure troves of facts, statistics, and other formats of information on an endless array of topics. Because the internet is such a valuable resource, we should seek new and innovative ways to mine the data using ethical means.

You may have never heard of screen-scraping, web-fetching, or web-data extraction, but if you’ve ever surfed the internet, you’ve quite likely been a beneficiary of the method of retrieving information on the web described by these terms. They refer to the increasingly popular method of methodically retrieving information with specialized tools. Numerous programs utilize many computer languages for the purpose of mining data. Software often assists users in intercepting HTTP requests and responses by incorporating proxy servers. The software then displays the pages’ source code (HTML, JavaScript, etc.) for users to extract the desired information. In addition, such software can aid iteration through pages (sometimes thousands of them) all the while gleaning valuable data in various forms.

The goal of scraping websites is to access information, but the uses of that information can vary. Users may wish to store the information in their own databases or manipulate the data within a spreadsheet. Other users may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.

Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses, such as restaurants or video-rental stores that know the locations of competitors can make better decisions about where to focus further growth. Companies that provide complementary (not to be confused with complimentary) products, like software, may wish to know the make, model, cost, and market share of hardware that are compatible with their software.

Another common, but controversial use of information taken from websites is reposting scraped data to other sites. Scrapers may wish to consolidate data from a myriad of websites and then create a new website containing all of the information in one convenient location. In some cases, the new site’s owner may benefit from ads placed on his or her site or from fees charged to access the site. Companies usually go to great lengths to disseminate information about their products or services. So, why would a website owner not wish to have his or her website’s information scraped?

Several reasons exist for why website owners may not wish to have their site’s scraped by others (excluding search engines). Some people feel that data that is reposted to other sites is plagiarized, if not stolen. These individuals may feel that they made the effort to gather information and make it available on their websites only to have it copied to other sites. Are individuals justified in feeling that they have been taken advantage of, even if their websites are posted publicly?

Interpretation of what exactly “republish” means is widely disputed. One of the most authoritative explanations may be found in the 1991 supreme-court case of Feist Publications v. Rural Telephone Service. This case involved Rural Telephone Service suing Feist Publications for copyright infringement when Feist copied telephone listings after Rural denied Feist’s request to license the information. While information has never been copyrightable under U.S. law, a collection of information, defined mostly in terms of creative arrangement or original ideas, can be copyrighted. The Supreme Court’s ruling in Feist Publications v. Rural Telephone Service stated that “information contained in Rural’s phone directory was not copyrightable, and that therefore no infringement existed.” Justice O’ Conner focused on the need for information to have a “creative” element in order to be termed a “collection” (1). Similarly, information, taken from publicly available websites should not be considered plagiarism or even theft if only the information (numbers, statistics, etc.) is reposted to new sites or used for other purposes.

Scraped websites also experience an increase in used bandwidth as a result of being scraped. Some scrapes take place once, but many scrapes must be performed over and over to achieve the desired results. In such cases, the servers that host the pages being scraped inevitably experience an increased load. Site owners may not wish to have the increased bandwidth, but more importantly, excessive page requests can cause a web server to function slowly or even fail. Rarely, however, do most scrapes cause such strain on a server on their own. Accessing a page through scraping is no different from visiting a page manually, except that scraping allows more pages to be visited over a shorter period. Additionally, scrapes can be adjusted to run more slowly, so as to minimize the strain on the server. Scraping is usually slowed when more than a few scraping sessions are being run against a single server at one time.

Interestingly, having one’s website scraped can have positive effects. Of course the recipient of the scraped data is pleased to have desired data, but owners of scraped sites may also benefit. Think of the case mentioned above in which home listings are scraped from a site. Whether the information is reposted or stored in a database for later querying to match homebuyer’s needs, the purpose of the original site is met—to get the home-listing information into the hands of potential buyers.

Individuals who scrape websites can do so, while still following guidelines for ethical data extraction. Perhaps it would be helpful to review a list of tips for maintaining ethical scraping. One website I consulted gave the following suggestions:

· Obey robots.txt.

· Don’t flood a site.

· Don’t republish, especially not anything that might be copyrighted.

· Abide by the site terms of service (2).

Occasionally, individuals who scrape websites have paid for access to the material being scraped. Many job- and résumé-posting websites fall into this category. Employers must pay a monthly fee for an account which provides access to the résumés of potential new hirers. Certainly, the fact that employers pay for the service entitles them to use whatever means are necessary to sort through and record the desired data. The only exception would be where the site’s terms of service specifically prohibit scraping.

While republishing images, artwork, and other original content without permission is unethical and in many cases illegal, using scraped data for personal purposes is certainly within the limits of ethical behavior. Nevertheless, page scrapers should always avoid taking copyrighted materials. Use of bandwidth is no more deserved by any one person than another. Even making scraped data available to others online can be argued as ethical, especially when the scraped website is posted on public space and the data taken doesn’t include any creative content. After all, the purpose of hosting a website in the first place is to provide information.

Source: http://blog.screen-scraper.com/2008/04/21/screening-scraping-ethics/

Sunday, 2 June 2013

Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source: http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813