Tuesday, 28 May 2013

Web scraping hits home

When McLean-based Cvent Inc. filed a $3 million copyright lawsuit against a West Coast competitor this spring, the software company didn’t just allege simple plagiarism. Cvent, which offers a database of venue profiles for corporate event planners, accused rival Eventbrite Inc. of quietly unleashing an automated program — a webbot or “bot,” for short — on Cvent.com to purloin thousands of pages of valuable content.

In its complaint filed May 10 in federal District Court in Alexandria, Cvent alleged the San Francisco company had taken information that cost more than $10 million to create and reproduced it on its own website — errors intact.

Cvent’s suit highlights a prime fear of companies whose stock in trade is a mass of publicly available data: Web scraping. The widespread but sometimes legally hazy practice — in which tailor-made programs mimic a human user to harvest content from the Web — runs the gamut from benign to malicious.

In some cases, scraping is used to help market researchers or create Web mashups that stitch together data in new and creative ways.

In others, it serves as a vehicle for corporate espionage and piracy. The demand for scraping has spawned a market for custom-built bot software, as well as for software to thwart those bots.

Scraping as a means of stealing Web content occurs “on a fairly regular basis,” said attorney Karl Means, who heads the intellectual property group at Potomac-based Shulman Rogers Gandal Pordy & Ecker PA.

His practice confronts scraping-related piracy about a half-dozen times a year, Means said. “As long as the information is out there, people are going to abscond with it. If you think about it, it’s essentially plagiarism.”

The problems associated with scraping are broader than intellectual property, however — a fact underscored in June by a high-profile and embarrassing security breach at AT&T Inc. A group of hackers exploited a security flaw in AT&T’s iPad 3G network to scrape 114,000 customers’ e-mail addresses, including addresses from the military, media, Congress and the White House.

The hacker, Goatse Security, which derives its name from an infamously obscene website, said it attacked AT&T as a “service to our nation,” to expose a gaping security hole.

Cvent’s lawsuit claims a bot that automatically copies website material had accessed cvent.com several times between August and October 2008.

Eventbrite, the lawsuit claims, took data from 1,613 of Cvent’s copyrighted venue profiles and earlier this year copied and redistributed the information in a “wholesale and indiscriminate” manner on eventbrite.com — including typographical errors, duplicated paragraphs and incorrect tax rates.

Neither party would comment on the suit, and Eventbrite has not yet filed a formal response. In lieu of a temporary restraining order, the parties have agreed not to download information from each other’s websites during the litigation.

Cvent claims it spent more than $10 million researching and building its venue database, dubbed the Cvent Supplier Network, which compiles information like meeting room capacity and amenities for each facility. The company calls the database “a key differentiator” that gives it a competitive advantage over rivals.

Cvent also said it spent $800,000 over the past three years to create and market a destination guide that enables planners to compare locations across cities.

Stealing another company’s competitive advantage is typically the motivation for engaging in Web scraping.

But Michael Schrenk, a Las Vegas- and Minneapolis-based bot designer, lecturer and author of “Webbots, Spiders & Screen Scrapers,” doesn’t see it as clear-cut pilfering, explaining that while scraping can be either legitimate or nefarious, it represents “probably the most exciting area of Web development.”

“Basically, you can make the Internet a lot more useful than what it is,” he said. “Instead of taking the Internet and using it the way it’s presented to you, you can actually [remake] it the way you want the Internet to look.”

Schrenk described his customers as “the people that tend to be a little more adventurous,” those in procurement and fraud detection fields, private investigators and even journalists.

Tech analysts declined to venture an estimate on the size of the market for the services that Schrenk offers, and none of the software companies interviewed for this article would specifically discuss their clients.

“In order to get that competitive advantage, in order to keep it, you got to be kind of quiet about what you’re doing,” said Schrenk, who recently addressed the national hackers convention, DEF CON 17. “No real numbers will ever be made [on the size of the market]. It’s absolutely impossible. But you have to assume there’s quite a bit of it going on.”

On the flip side, the market for combating scraping is “absolutely massive,” said David Crowder, CEO of Pramana Inc., an Internet security company. The Atlanta-based startup grew out of the Georgia Institute of Technology and specializes in products that detect and block bots and prevent scraping.

The company initially thought its products would be used primarily to stop fraudulent account creation and spam posts on websites. Instead, scraping has become its customers’ single largest concern.

The biggest target for scraping, Crowder said, is actually nonsensitive — but still valuable — content, like Facebook profile information and original news content.

Those companies are thus presented with a conundrum, he said. “They want that information as public as possible because it drives traffic to their site, but they want to protect it as much as possible because that’s their asset.”

And businesses that make their money on subscriptions see their services become less valuable when their data is scraped and re-created elsewhere on the Web.

That puts legitimate content providers in the position of “competing with their own data that they paid to create,” Crowder said. “It’s absolutely mind-boggling.”


Source: http://www.bizjournals.com/washington/stories/2010/07/12/focus1.html?page=all

Custom Web Scraping Services & Scrapers

Web scraping, or web content mining, is the process of harvesting or extracting useful data / information from Internet HTML web pages and restructuring them in pre-formatted containers such as CSV, spreadsheets (Excel), XML or SQL databases (primarily MySQL), with the extracted data well organized, orderly indexed and semantically accessible.

Take a peek at our Scraping Portfolio or browse them by the Databases Index below.
Databases Index

    Language & Reference
    Travel, Entertainment & Sports
    Society & Humanity
    Goods & Commodity
    ISBN & Books
    Business & Directory
    Medical & Health
    Geographical & Locations
    Emails List


Many web screen scraping tools exist. They are hard to learn and adapt. To fully satisfy your specific requirements and changing demands, you will need custom Web scraping services rather than ready-made Web scraper tools nor any Web scraping software. In a custom scraping project, not only can you scrape web pages but also other online materials such as PDF, Flash, audios and even videos. The results are highly structural and semantic.


Source: http://www.scrapingweb.com/

Saturday, 25 May 2013

Basics of Web Data Mining and Challenges in Web Data Mining Process

Today World Wide Web is flooded with billions of static and dynamic web pages created with programming languages such as HTML, PHP and ASP. Web is great source of information offering a lush playground for data mining. Since the data stored on web is in various formats and are dynamic in nature, it's a significant challenge to search, process and present the unstructured information available on the web.

Complexity of a Web page far exceeds the complexity of any conventional text document. Web pages on the internet lack uniformity and standardization while traditional books and text documents are much simpler in their consistency. Further, search engines with their limited capacity can not index all the web pages which makes data mining extremely inefficient.

Moreover, Internet is a highly dynamic knowledge resource and grows at a rapid pace. Sports, News, Finance and Corporate sites update their websites on hourly or daily basis. Today Web reaches to millions of users having different profiles, interests and usage purposes. Every one of these requires good information but don't know how to retrieve relevant data efficiently and with least efforts.

It is important to note that only a small section of the web possesses really useful information. There are three usual methods that a user adopts when accessing information stored on the internet:

• Random surfing i.e. following large numbers of hyperlinks available on the web page.
• Query based search on Search Engines - use Google or Yahoo to find relevant documents (entering specific keywords queries of interest in search box)
• Deep query searches i.e. fetching searchable database from eBay.com's product search engines or Business.com's service directory, etc.

To use the web as an effective resource and knowledge discovery researchers have developed efficient data mining techniques to extract relevant data easily, smoothly and cost-effectively.

Should you have any queries regarding Web Data mining processes, please feel free to contact us at info@outsourcingwebresearch.com

Source: http://ezinearticles.com/?Basics-of-Web-Data-Mining-and-Challenges-in-Web-Data-Mining-Process&id=4937441

Friday, 17 May 2013

What You Need To Know-Web Data Extraction Services

What is Web data extraction or scraping? This specialized software automatically obtains data from the Internet and places them into files for an end user. It executes a much-advanced function than search engines since it can handle HTML code. These extraction tools speed up the scanning and pulling of information that make evaluating accumulated information convenient for the person or company using the tool.

Harvesting techniques There are three techniques employed by these extraction programs. The first method is Web content harvesting and this is focused on collecting preferred content, such as HTML files, pictures, or emails. The second method, Web structure farming, takes advantage of the fact that web pages can give more information than merely their visible content. For example, links can offer information on a page’s popularity or give you a sense of the assortment of topics talked about in that page. The third method, web usage harvesting, provides an insight to user behavior. It also assesses the efficacy of the website’s framework.

Possible Functions These services are an important resource for businesses particularly those that promote their goods and services online. By means of extraction tools, companies can pull together information on competition, from prices to other vital data. For instance, by using Google suggest scraping tools you will be able to get thousand of keyword ideas from real user queries which you can use in your next marketing blog post, or to optimize your online marketing campaigns. Through the help of these extraction tools, you can collect and assess data that can help you formulate marketing strategies that have a great likelihood of success.

There are plenty of providers of Web scraping services, but some do a better job than others. It would be unwise to select the first one that offers their services. The firm that offers the least expensive package is also not automatically the best choice. Read up on their reputation and obtain references prior to making any commitments.

Source: http://www.webllena.com/what-you-need-to-know-web-data-extraction-services/

Friday, 3 May 2013

Website Data Scraping Are Definitely The Better Choice

Have you ever heard "data scraping? Data scraping is the process of collecting useful data to the Internet is placed in the public domain in the database or spreadsheet applications for use in various storage. Technology, new data scraping and scraping technology by taking advantage of the data is not a successful businessman made his fortune.

Sometimes website owners automated harvesting of its data cannot be happier. Webmasters tools or methods, the content of the website to retrieve the IP address block by using some or Refusing to use web scrapers have learned to your website. Are blocked at the end of the option is left.

Thankfully there is a modern solution to this problem. Proxy data scraping technology solves the Problem by Using proxy IP address. Time scraping the data extraction program Implemented from a website, the website thinks it is coming from a different IP address. The website owner, scraping of the proxy data, just a short period or increased traffic from around the world looks like. They have very limited ways to prevent Default Such a script and tedious, but more importantly - most of the time, they just do not know They Are Being Scraped.

Now you may be asking yourself, "Where I scraping proxy technology for your project can get the data?" "It-Yourself" solution, but unfortunately, not at all easy. Not to mention. You choose the proxy server hosting providers may preambles renting, but That option is quite pricey, but definitely better than the alternative is dangerous, and Unreliable (but) free public proxy servers.

There are literally Thousands of free proxy servers located all over the world that are fairly simple to use. But the trick is finding them. There are Hundreds of servers in multiple sites, but That is working to locate, open, and you need the type or protocol supports persistence, trial and error, can be a lesson. First of all, you do not know what activities are going on to the server or elsewhere on the server. Through a public proxy requests or sensitive data being presented is a bad idea.

Proxy data scraping a less risky proxy connection on a rotating scenario that a large number of private IP addresses through the axis circle. www.webscrapingexpert.com anonymous proxy solutions to large companies, but Often carry a fairly hefty setup fee to get you going.

After performing a simple Google search, I Quickly a company for the Purposes of the scraping of the anonymous proxy server provides the data found.

The majority of websites today that the text is easily accessible in the source code is written. However, there are other companies that currently use Adobe Portable Document Format, or PDF files. Software supports almost any OS. There are many benefits when you choose PDF files. Is ideal for business documents, or even makes the specification sheet. Of course there are also disadvantages. One is that the text in the file is converted into an image. In this case it is often the problem with this is that when it comes to copying and pasting can be.

Why some are starting to delete information from the PDF. It is often called PDF scraping, which is the process that only gets you the information contained in PDF files is as scaling of the data.

Source: http://www.selfgrowth.com/articles/website-data-scraping-are-definitely-the-better-choice