Editor's note: Speaking to The New York Times in 1992, Mitch Kapor, founder of Lotus Development Corp. and co-founder...
of the Electronic Frontier Foundation, compared the internet to a library where the books have been tossed on the floor. A quarter of a century later, the task of culling useful information from the chaos of unstructured data is still no picnic. But custom and commercial offerings, bolstered by machine learning, now facilitate web data extraction, writes Moshe Kranc.
The World Wide Web is a vast source of information on any conceivable topic. Imagine you could access this information with the same ease as you access structured data in a database, using a SQL-like query language: "Search the WWW for all used car sales in Australia from 1990 to 2010, and calculate the total sales volume, grouped by year, make and model."
Armed with a web-scraping tool like that, you could automate many consumer and business reports that currently require massive manual effort and access the web as if it were one enormous database. For example:
- industry analysis reports, e.g., trends in the used car market in Australia;
- aggregated ratings for restaurants or movies, based on reviews and social media mentions;
- metadata about TV shows, including broadcast channel and time, program duration and rich media about the program; and
- a comparison-shopping catalog of all online retailers who sell any given product.
Easing web data extraction
Why don't we have such a query language? After all, Google does a pretty good job of finding most of the relevant pages. The real problem is extracting structured data from the largely unstructured mess that is the web. Going back to our example of used car sales in Australia, some of the data exists in tables in Adobe Acrobat pdf files, some data is found in press releases, and still other data may exist in footnotes within articles published by the Australian Bureau of Statistics. To truly turn the web into a database requires the ability to automatically identify and extract structured key-value data from a mass of unstructured textual data stored in a variety of formats.
Moshe KrancCTO, Ness Digital Engineering
Both custom solutions and commercial products can facilitate web data extraction with a minimum of effort. For example, import.io provides a point-and-click interface that enables the user to teach the system how to extract data for a given website, enhanced with machine learning to infer extraction patterns for new sites based on knowledge learned from other sites. My company, Ness Digital Engineering, has also worked with clients to develop custom solutions designed to address a particular set of needs.
Guidelines for web data extraction tools
Whichever solution is chosen to best fit the use case, there are some important guidelines to consider when implementing a solution:
- Web crawling must be scheduled very carefully, so as not to harm the responsiveness of the target site. Crawl too many pages in too few seconds, and you will be identified as a denial-of-service attacker and blocked from the crawled site forever.
- Sites have different information in different regions, so you may need to use proxies to see all versions of a given site. For some sites, even a proxy is not enough, and you may need to set up a machine physically located in the desired region.
- Save the raw page content after extracting data from it. That way, you can go back and determine why you extracted the value you extracted, even if the publisher has subsequently modified the page.
- Initially, the system needs to be trained by a human being, who points out content to be extracted from the webpage. Make this system as easy to use as possible, and assume users do not understand HTML or anything that goes on under the hood in a webpage.
- The system should learn from user actions over time, so that it begins to recognize textual patterns that contain useful data when it is introduced to new websites. It's this machine learning component that enables a well-trained web data extraction system to approach the goal of fully automated web extraction, even for webpages it has never seen before.
- When extracting data, you will inevitably find multiple sources for the same data item, e.g., two sources that describe the same TV show. The data must be deduplicated, by finding matches, and then multiple records must be merged together to form the master record for the given data item. This merge must handle conflicting data, by giving precedence to more authoritative websites.
For more reading
In 1992, Mitch Kapor was featured in this New York Times article, "Technology; The Network of All Networks" by Robert E. Calem. The article provides an interesting look at the early days of the commercial internet. Check it out.
Whether you use a commercial product or a custom system, it is possible today to extract information from the web that drives your business. It's not quite as easy as fetching data from a relational database, but thanks to machine learning, it is getting easier all the time.
Learn why unstructured data may be an inappropriate label
Read about the growing prominence of unstructured data analysis
Find out about the use of data intelligence and data quality in understanding customers