Web Page Extractor

September 28, 2017 | Autor: Alexiei Dingli | Categoría: Web 2.0, Information Extraction

Descripción

Web Page Extractor Alexiei Dingli, Shirley Cini University of Malta

ABSTRACT The goal of the ‘Web Page Extractor’ system is to automate a substantial amount of the work involved in investigating cybercrime web pages. Currently, the case reports generated by the cyber-crime unit contain limited information. The artifact is developed in Java and it is platform and browser independent. Given a URL the system is able to identify personal details, addresses, locations and organizations. A case report showing the latter information and the WHOIS data of the domain is generated. The report is initially displayed in a web page but the user has the option to convert it into a PDF. In addition, an option for generating a graph of relations concerning the persons recognized in the content of the web page being investigated is available. The graph aids users in getting an insight of the degree of participation these users have with respect to the web page content. The graph feature devices a search engine to find additional information about the persons selected by the users in online sources. These results are examined by the information extractor and any entities matching those extracted from the original URL are retained. Moreover, if the user selects more than one person, the graph features captures data that is similar between the whole group of persons to demonstrate any relationships that might exist. The system is also capable of parsing the HTML source code of the web page specified by the user with aim of downloading any video links, images, external links and cookies in folder on the hard drive.

Categories and Subject Descriptors [Information Extraction]: extraction tools

General Terms Algorithms, Design, Languages

Keywords Information extractor, relations graph, cyber-crime

1. INTRODUCTION The Web Page Extractor project was partially proposed by the Malta Police Force Cyber-crime unit to aid them in identifying illegal web pages on the web. In the context of this project illegal is referring to web pages containing material associated with child abuse, drugs and other related content. The web page analysis process is currently being tackled manually and happens to be very time consuming and error-prone. A police specialized in the cyber-crime is required to find information about the assignees of the domain name, and retain a screen shot of the results. Web pages are examined visually in order to detect any valuable pictures. Useful information is currently being stored on disk so there is no searching feature to find related cases. Finally, a report is manually generated to describe the information discovered in the web page and whether it is considered as an illegal web page.

These tasks can be accomplished in a more accurate and efficient way, by developing a software solution which automatically discovers the assignees of the web page and extracts personal information, addresses, locations and organizations from a web page. The latter information is stored in a database and a report containing a summary of what has been discovered is generated automatically. The police are only required to read the report and approve it. In order to enhance the functionality of this project, an extra feature was introduced. The task of this feature is to construct a graph which demonstrates the degree of contribution a person partakes in the overall content of the web page.

2. BACKGROUND Information pertaining to the World Wide Web is rapidly increasing, thus constructing a system which is capable of extracting information without human intervention is a necessity. Web mining evolution imposed researchers to focus on three main aspects: content, structure and usage extraction. Although, the system described in this paper can be utilized to extract any web site, cyber-crime will be the main application. Cyber-crime [1] [2] is related to common criminal activities such as pornography, fraud and drugs, however it is transmitted instantaneously and is accessed by masses of people. Cyber-crimes are growing rapidly due to the Internet, sharing of files, Intranets, Extranets and device interchange. As mentioned in citation [3], electronic data can be considered as raw, structured or semi-structured. Raw data include videos and pictures. Although, the structure of data may not be as stringent as that present in databases, it will still be envisaged as structured data. Semi –structured data refers to scenarios where the structure is embedded within the data, hence needs to be extracted. The research behind natural language process is aimed at developing software applications which are capable of analyzing, interpreting and producing natural languages understood by people. A survey regarding Web data extraction tools [4] has shown diverse approaches of developing wrapper tools which are divided in the following taxonomy:

2.1 Languages for wrappers development These languages were intended to replace general purpose languages such as Java, which were being used for the construction of wrapper systems. Languages which were designed to address the former problem include Minerva [5] and TSIMMIS [6]. The Minerva language is based on a declarative grammar methodology and adopts certain characteristics from procedural programming languages. Minerva’s grammar is expressed in EBNF notation and, also comprises a language useful to examine and restructure documents. TSIMMIS is capable to handle semistructured data. It comprises a set of wrappers that can be configured using a ready-made specification file. The specification file is constituted manually by the user and contains

a set of instruction each representing an extraction rule of the form [variables, source, pattern].

2.2 HTML based wrappers HTML toolkit supports XWRAP [7], which provides a manageable interface and a library to assist users in the development of wrappers. During the preprocessing stage XRAWP analyzes the document to abolish useless tags and syntax errors and then generates a parse tree. At each step XWRAP guides users in choosing the appropriate components from its library to end up with a ready- made Java coded wrapper. RoadRunner [8] is a fully automatic tool featured by the HTML based tool. The main functionality beyond this tool takes place whilst comparing the structure of sample pages pertaining to the same topic (page class) in order to figure out a schema for data in these pages. A grammar competent of discovering attribute instances is derived from the schema. Hiremath and Algur [9] proposed a tool with the aim of notify the relevant data in a web page by mining data items from a page using list based and visual clues techniques. The implementation of the mentioned system is apportioned in two parts. The task of the first phase includes the use of visual clue for ascertaining and extracting the data regions of a web page. Such algorithm is named VASP. The second phase utilizes the output of VASP to discover the data records in order to extract the data items from them. Such algorithm is named EDIP.

2.3 Wrappers based on linguistic features SRV [10] is a top-down relational algorithm that aims to learn a set of extraction rules under a set of training examples given as input. The basics of this tool are a group of “token oriented features”. The learning process involves of, identifying and generalizing the training elements residing in the examples. WHISK [11] is another form of data extraction tool which mainly derives a series of extraction rules from the training examples inputted in the system. Initially, the set of extraction rules is null and starts being populates in every iteration. These rules are formed from regular expressions patterns which locate relevant phrases and their corresponding delimiters. Unlike SRV, WHISK is competent for extracting more than one record from a given document. GATE [12] incorporates a library and a collection of reusable language engineering modules, which together can accomplish language processing tasks such as semantic tagging, thereby, users are given a go ahead to start working on new applications, and resources which have been constructed can be reused. GATE endows a series of processing resources which collectively compose ANNIE and are suitable to support the development of natural language processing applications. Essentially, ANNIE is an information extraction system which can be put in operation on its own i.e. without being combined to other components. The processing resources devising ANNIE use the GATE API for communication purposes and consist of: Tokeniser which partitions text into smaller parts (tokens); sentence splitter that consists of a set of transducers whose objectives are to divide the input text into several sentences; tagger makes use of text outputted by the sentence splitter and forms part of speech tags which are conceived as annotations on terms, that can be put in practice by the grammar; gazetteer that contains a collection of entity names, for instance country names, and frequently used titles such as PLC; semantic tagger which comprises a group of patterns and annotations written in JAPE language. Patterns can either be created by the user by defining

some string in the text or by using the annotations which have been built the preceding components. The latter are used to find matching terms in the text and then annotations are generated as an outcome; orthomatcher identifies entity relationships in order to achieve co-referencing or entity tracing; co-referencer analyses the text with the objective of determining identity relationships amongst entities.

2.4 Inductive wrappers Extraction tools can also depend on inductive methods. One of these tools is WIEN [13], it is fed a set of pages which are expected to have a pre-defined structure and markings on the data of interest. After considering certain induction heuristics, WIEN outputs a wrapper which respect to these markings. STALKER [14] is a tool which continues to build upon the methodologies expresses in WEIN and SoftMealy. Moreover, it is capable of solving issues where hierarchal data extraction is a necessity. STALKER is given two inputs which include a collection of training examples defined by a series of tokens which determine the boundaries of the data of interest; and some detail about the structure of a page. STALKER aims to produce extraction rules that are consistent with most of the given examples.

2.5 Modeling based wrappers NoDoSE [15] is a modeling based tool designed to support pages that contain semi-structured data. It is a semi-automatic interactive tool which defines the structure within web pages. A document is segmented in stages where at each stage the user commences by utilizing a complex structure to construct an object and after that, switches to simpler structures to divide it into further objects. When the user has repeated the latter process for a certain amount of time the tool learns to segment the objects by itself. DEByE [16,17] is given a collection of object examples acquired from a sample web page and returns a set of extraction patterns useful to extract objects from pages similar to the object residing in the sample page. This tool supports a graphical interface to give users the opportunity of constructing nested tables by collecting fragments of data from a sample page. Such tables serve as examples of the objects which need to be recognized in pages, hence, DEByE produces object extraction patterns based on them.

2.6 Ontology based wrappers Ontology identifies constants in the web page and builds objects using these constants. An ontology was developed in Brigham Young University (BYU) by the Data Extraction Group. The aim of the ontology is to describe the data of interest together with relationships, keywords and lexical appearance. The Ontology entails a function that automatically extracts phrases in text that contain the data of interest. [4]

2.7 Identification of undesirable pages on the web Lot of effort required to identifying immoral offences, abolishment of rights and freedom of man and murders, as it is not clear whether the an undesirable page includes a law abuse or whether it’s just a representation of the domain without any atrocious content. A project offering a solution to the mentioned problem was developed by gathering a set of pornographic

documents and a set of decent documents and then perform training. Suffix trees were used to acquire sequences from pages which however results in a more precise classification. The project described in Henry’s paper [18] focuses on filtering a URL requested by a client computer by checking the list of prohibited URLs. The problem tackled in the latter project falls in the category of exact string matching. Weiming Hu et al [19] attempted to address the problem of recognizing pornographic by dividing pornographic texts in three categories: Over-blocking which includes text associated with the subject of pornography such as sex education; Misspelled text contains pornographic text in which terms are misspelled on purpose to inhibit recognition; World list which is a set of rational terms related to pornography. The content representation of a web page is examined and it is categorized under one of three different page classes that are continuous text, discrete text and images.

3. AIMS AND OBJECTIVES 3.1 Aims • Important details such as email addresses, personal details, and identification of trading tools are extracted from a web page and stored in the database. Information regarding registered user or assignees of the web page is acquired. • Additional information about persons pinpointed in the web page during extraction is gathered form external sources on the web. The material acquired is passed through the extraction system and results are compared to the data extracted from the original web page in order to identify any identical entities. The latter entities indicate the level of participation a person has in the material of the web page. • The additional information gathered from external sources will be also utilized to identify any similarities between persons under examination. • Users are enabled to download any images, video links, external links and cookies from the web page.

3.2 Objectives • An extraction system will be employed to extract important details from the web page inputted by the user. An online database maintaining WHOIS data will be queried to identify any registered users or assignees of the web page. • A search engine will handle external research about specific persons. • Similar entities between the persons selected by the user and entities pertaining to individual users will be displayed in a graph in order to aid the user visualize the relations a person has to the original content as well as to other users. • An HTML parser will analyze the HTML source code to extricate the require data.

of APIs which aids in less development time. Java Persistence API, Enterprise Java Beans, JavaServer Pages and Java servlets are extensively applied in the system. The Web Page Extractor system follows the Model View Controller (MVC) [20] design pattern. It is mainly divided into two main folders: src - which contains the entire java classes; and web content - where JSP files are held. The src folder consists of two fundamental packages, known as com.logic and com.presentation. Java beans and other classes that deal with business logic are maintained in the com.logic folder, each of which entails a well-defined interface. However, the com.presentation folder comprises servlets (known as controllers) and data models. MVC sustains loose coupling between the model and the view, thus set up an environment in which an application can be easily coded and maintained. ANNIE is the extraction tools which will be employed throughout this project. GATE offers a high degree of automation as it can measure the performance of language processing modules on its own. The integration of modules does not involve any overheads due to the use of open standard languages such as JAVA and XML. ANNIE, can be customized according to the user needs. Additionally, the development environment presented by GATE minimizes the overall time required during the development of a system as it guides users throughout the entire progression of a language engineering system and supports debugging techniques useful when constructing components from scratch. Moreover, the set of annotations can be expanded is easily by adding new entities. [21]It has been decided that the search engine governing the Relations graph feature will be Bing, due to its unlimited API Generate PDF «uses»

Whois data

«uses» «uses» Generate report

«uses» Generate graph «uses»

Search case «uses»

Images «uses»

User «uses»

«uses» Downlaod info

Website copy

«uses»

«uses»

User login

«uses» Video links «uses»

«extends»

External links

Cookies

4. DESIGN AND IMPLEMENTATION A web application was opted for this project as the environment of where the application is proposed to operate requires multi user access, thus developing a web application does not involve installation and configuration of computers. The Web Page Extractor system was entirely developed on the basis of Java Platform, Enterprise Edition (Java EE) as it offers an effective set

Add new user

«uses»

Administartor «uses»

«uses»

«uses»

Change user password View history

«uses»

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted, provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.

Delete user

Search user

Admin login

Figure 1: System Features

videos and external links hidden in the HTML architecture of the web site as well as the cookies residing on the client’s browser. calls and diversity of protocols. Jsoup is utilized to parse the HTML source code of a web page for the purpose of collecting information about images, external links and video links. Moreover, Jgraph is the API utilized to visualize the graph as it offers complete documentation and conveys concrete examples about the implementation of graphs on server side components. Figure 1 illustrates the entire features provided by the system. A registered user is eligible to generate a case report, search for a particular case and download any information related to a case. Filters are employed to ensure that a user is logged into the system before exerting the prior features. The generate report feature (see Figure 2) prompts the user to enter case related details, which includes the URL of the web page under investigation. The report generation process initializes by asserting whether the URL specified in the case details is supported by a WHOIS server, as the WHOIS database utilized in this project is limited to COM, NET, and EDU domains and registrars. If the URL turns out to be unsupported, a note will inform the user in the final case report and process carries on. However, if domain is supported, a socket connection is opened on port 43 for host name “whois.internic.net” to acquire information related to the domain in concert with its registrars and name servers. The latter information is stored in the database for later use. Thenceforth, ANNIE is employed to extract specific data from the content of the given URL. All initializations dealing with Gate and ANNIE are performed during login to enhance the performance of the generate report feature. As a result, ANNIE returns organization names such as BOV; person names and their gender such as Joe Vella – male; locations and their type such as Italy – Country or Rome – City; and addresses associated with their type such as http://www.google.com – URL or 109 Corporation Rd, Grimsby, Lincolnshire, England – Complete. When the case report is displayed, the user is recommended two links, one to generate a PDF file of the case report and other to create a graph of relations. As soon as the link stating “Create a graph of relations” is clicked, the user is displayed with a list of person’s names that were extracted by ANNIE while producing the case report. A user must select at most three persons to generate a relations graph. The graph will consist of n + 1 distinct sub graphs where n represents the number of persons selected by the user, and the numeric value 1 is reserved for the sub-graph designating similar data amongst these n persons. The Bing search engine is employed to locate supplementary information from the web about each person selected by the user. Every URL returned by Bing is transformed to a document, inserted in a GATE corpus and processed by ANNIE. As a result ANNIE yields data about organizations, persons, locations and addresses. Unlike the generate report feature, the results of ANNIE are not stored in the database, in fact they will only be compared against the original data to identify comparable entities. The term “original data” refers to the data extracted from the URL supplied by the user as part of the case details. The sub graphs associated with individual persons contain nodes which exhibit the data extracted form Bing results that coincide with the original data. The system also caters for recognizing repetitive data in diverse materials, thus, an entity which is detected in more than three web pages will endow a red edge in the resultant diagrammatic graph. Such graphs guide the user to detect who are the main participants that contribute in the original material. The links for graph and PDF creation are also presented in concert with the result returned from a case searching activity. Users are offered a feature for downloading images,

Start

Input Case Details Find WHOIS data

Execute ANNIE on the URL specified in the Case Det. To detect person names, addresses, organizations and locations Store case Details, WHOIS data, person names, addresses, organizations and locations in database Display case Details, WHOIS data, person names, addresses, organizations and locations

Create PDF of Report?

yes

yes

Create Graph of Relations?

no

no

Generate a PDF using case Details, WHOIS data, person names, addresses, organizations and locations

Display list of person names

Find all entities similar amongst the persons selected

Choose names to generate graph

Generate a graph using the similarities entities

Int c = number of person names chosen Int x =0

PDF file

Output the final graph

yes c == x?

Create PDF of Final Report? yes

no

Stop

Search Bing about person[x] Extract each URL returned by Bing using ANNIE

no

Generate a PDF using case Details, WHOIS data, original entities, and the information present in graphs PDF file

Generate a graph using the entities that coincide with original ones C++

Figure 2: The Generate report and Generate Graph Features Any information downloaded by the system is stored in a folder on disk space, where the default path is C:\\. However, the user can modify the destination folder by typing in the desired path name, if the systems detects that the given path name is invalid, the user is shown and error note. The Jsoup parser has been employed to parse the HTML code of the underlying web page. Any tags of the form are examined to obtain the attribute designating the source URL of the image. The image is acquired from the web via its source URL, transformed into a buffered image format and stored in a sub-folder named “images” in the form of a jpg image. Moreover, tags of the form are extracted and stored in an HTML file named “External Links”, whereby tags of the form or are retained in an HTML file named “Video Links”. By selecting a link residing in these files the user is directed to the web page containing the allied content. Cookies are acquired with the support granted by the CookieManager Java API.[22] However, administrators are common users having superior permissions that are responsible for registering, deleting and searching users, amending user’s passwords and viewing administrative activities. History feature consists of records involving information about which

administrators utilized the system, the kind of activity performed and on whom the action was performed. Similar to the user’s features, administrators are also required to be logged in before attempting to carry out a task.

5. RESULTS AND EVALUATION All entities extracted by ANNIE are filtered to produce the user with a cleaner case report. Levenshtein distance is employed to eliminate entities that are similar to some other entity of the same type with a degree of 80%. Moreover, the feature map of each entity is examined to collect additional information. Person names which do not contain a gender attribute or do not contain a tile such as “Mr” are ignored. Moreover, locations, addresses and organizations with a missing type attribute are not taken into account. A collection of 20 web pages have been processed by the information extractor and their results were analyzed visually to assess the precision of valid entities after being filtered. Table 1 illustrates the amount of valid entities for each annotation types in the form of percentage vales.

belonging to the final report is stored in a database to supply the user with a comprehensive searching facility. When the contents of the final report are displayed, the user can easily transform it into a PDF file by simply selecting the option named “Generate a PDF”. The scope of the PDF is to provide convenient printing and transmission of data services. Moreover, the traditional process requires cyber-crime members to visually analyze each image present in the web page and store it on disk by right clinking and pressing the “Save Image As” option. This problem has been addressed by implementing a HTML parser in the Web Page Extractor system for identifying images, external links and video links. By default, results are placed in a folder on drive C, which consists of two HTML files for the external and video links, a sub folder for the images, a text file for cookies and a copy of the web page. After the demo was performed, Superintendent Paul Caruana clearly stated that the system has fulfilled all the requirements “The tool automates a number of processes that normally require manual intervention from members of our team. We believe that Ms. Cini has managed to address all the project prerequisites.”

Table 1. Results after filtering the entities Persons

Locations

Addresses

Organizations

77.5%

99.0%

96.5%

82.4%

Given that the overall precision of entities processed by system is 88.9% implies that the end user is presented with data that can be easily comprehended. Although entities of type Person have the least amount of valid entities (77.5%), it is considered to be quite a reasonable value when taking into account the ambiguity of terms related to persons names. Entities of categories location and address are the most adequate as they carry a precision value above 95%. Unlike person names, country names and their cities are standard and finite thus, the gazetteer is supplied with complete location details. The majority of address entities appear in web pages using the same structure, for instance a URL always begins with a transfer protocol like http or ftp, in addition, an IP address consists of a groups of integers separated with a dot. Therefore, jape rules integrated within ANNIE can identify such entities easily. In the case of organization entities, identification of entities is somehow more cumbersome due to name ambiguity. Although, the system detects a number of entities as being organizations, the filtering process eliminates the useless once and retains those which are related to worldwide organizations such Vodafone and University. The final precision value of the entities related to organizations is 82.4%. A sample of reports generated by the cyber-crime members have been compared to the reports generated automatically by the system A typical cyber-crime report consists of partial references to the WHOIS data acquired and screen shots of the web page under consideration and of the results returned by the WHOIS server. The automated system is not only capable of undertaking the manual task but also provides sophisticated features for collecting personal details, addresses, locations and organization information contained in the web page. The latter personal details can be used to create a graph of relation for the purpose of designating the collaboration of particular persons in the web page. Manual reports are generated by typing a word document to illustrate the findings. Nevertheless, the automated system only requires the user to enter the case details in an HTML form, as the entire analysis process is performed automatically. The data

Moreover, a survey was conducted to determine the degree of relevance peculiar to the entities extracted from the results returned by the search engine, during the construction of the relations graph. Questions are based on a case report containing personal information pertaining to the surveyed person. 20 surveys were distributed to people consigned to professional careers. In question 1, surveyed people were asked to take a look at the entities listed under their nameand stipulate a rating from 1 to 5 to indicate their degree of relation to these entities. Results convey that the above half of the entities pertaining to individual people are relevant as no low rating (i.e. rate 1 or 2). In question 2, surveyed people were asked to check if there any entities with an asterisks (*) listed under their name. An entity with an asterisks illustrates it has been detected in more than 3 web pages. Nearly half of the reports generated included entities which were detected in more than 3 web pages. Such entities are supposed to indicate a deep relationship that user has with entities of this form. In fact in Question 3, people were asked to give a rating that indicates the depth of their relationship with these entities. About three quarters of the people replied that they are highly related to these entities. Only one person gave a rating of 3 whereas another two declared that 4 is the adequate rating. While extracting search results pertaining to the people chosen, similar entities were retained and shown to the user under a heading named “Similarities”. The next couple of pie charts show the relevance of such entities. Question 4 informs the reader to take a look at the persons names selected prior the generate graph feature and state whether s/he is familiar with them no not. 80% of the people surveyed have agreed that they know the people which they have been associated with. Although 20% claimed that they are not acquainted with the people selected by the user, it doesn’t mean that the similarities information is false but means that these people have no direct relationship, for instance, they might be working within the same organization but never had the opportunity to collaborate. In fact, a true example of this situation is demonstrated in the survey of Mr. James Pawney where he declared that he doesn’t know the people being compared with him, but he is familiar with other people shown in the section named personal information. In question 5 people were asked to choose a category (yes, no, partially) for declaring whether the entities listed in the similarities section convey information which divulge an association between the group of people selected. The majority

replied that not all entities give evidence of a true relationship. 31% do agree the similarities entirely expose common facts whereby, a small percentage argued that this information is incorrect. As stated before, people might be associated via some form of organization but have never cooperated with each other.

[4] [5] [6]

6. CONCLUSION AND FUTURE WORK The requirements stated during the planning phase of the project have been entirely accomplished. The Web Page Extractor system has amplified the way a cyber-crime investigation is undertaken. The automated system comprises tasks which were not being handled via the conventional system, including analyses of the web page content for the purpose of identifying personal details, addresses, locations and organization information, generation of a graph of relations from the list persons recognized in the given URL and HTML source parsing for detecting external links, video links and images. Users are provided with a searching facility as all generated reported are dumped to a database. All entities extracted are being filtered in order to display the user valuable data. Feedback acquired from the surveyed people has proved that the feature in control of the relations graph is providing the user with important data. Surveyed people proposed several improvements with regards to the similarities graph. In their opinion similar entities must contain supplementary information to designate information about the relationship. Although, the goals of the system were entirely fulfilled, there is always place for improvements. Extraction can be enhanced by providing a feature which enables the user to insert novel entities into the Gazetteer in order to enlarge the amount of entities to be used for subsequent annotations. Moreover, video links are not being captured in a professional manner. Since a video can be embedded in a number of diverse links such as , and , it is difficult to adhere a particular tag name and extract the information within that tag. A typical approach would analyze the entire set of tags that are expected to comprise video files and check whether the attributes entrenched in that tag contain any video material. The graph feature can be enhanced from several perspectives, for instance, the time stamp of the web page from where an entity is extracted can be retained to track the relationship between the person and the entity by gathering information from diverse web pages. The WHOIS feature needs to be amplified to handle each and every domain name that exists, not limited to specific ones.

[7] [8]

[9]

[10] [11] [12] [13] [14]

[15]

[16] [17]

[18]

7. ACKNOWLEDGEMENTS I am grateful to Dr.Alexiei Dingli for kindly offering to be the supervisor of this thesis. His professional cooperation meant a lot to me. I am indebted to Superintendent Paul Caruana and the entire cyber-crime team, who have supplemented descriptive requirements for the project and graciously provided me with material for testing purposes.

[19] [20] [21]

8. REFERENCES [1]

[2] [3]

Broadhurst, R. (2006). Developments in the global law enforcement of cyber-crime. Policing: An International Journal of Police Strategies and Management. Benjamin, R., Gladman, B., Bs Oue, B., & Randell, B. (n.d.). Protecting IT Systems from Cyber Crime. Abiteboul, S. (1997). Querying Semi-Structured Data.

[22]

Laender, A. H., Ribeiro-neto, B. A., da Silva, A. S., & Teixeira, J. S. (June 2002). A Brief Survey of Web Data Extraction Tools. SIGMOD Record, 84-93. Crescenzi, V., & Mecca, G. (1998). Grammars have exceptions. Information Systems. Hammer, J., Mchugh, J., & Garcia-molina, H. (1997). Semistructured data: The TSIMMIS experience. In In Proceedings of the First East-European Workshop on Advances in Databases and Information SystemsADBIS '97 (pp. 1-8). Liu, L. P. (2000). XWRAP: An XML-enabled wrapper construction system for Web information sources. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of 26th International Conference on Very Large Dara Bases. Rome, Italy. Hiremath, P. S., & Algur, S. P. (2009). Extraction of Data from Web Pages: A Vision Based Approach. . International Journal of Computer and Information Science and Engineering. Freitag, D. (1998). Machine learning for information extraction in informal domains. Califf, M. E., & Mooney, R. J. (1999). Relational learning of pattern-match rules for information extraction. Newman, D., Cheudugunta, C., & Smyth, P. (n.d.). Statistical Entity-Topic Models. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence. Muslea, I., Minton, S., & Knoblock, C. A. (2001). Hierarchical Wrapper Induction for Semistructured Information Sources. Journal of Autonomous Agents and Multi-Agent Systems, 93-114. Adelberg, B. (1998). NoDoSE - A tool for SemiAutomatically Extracting Structured and Semistructured Data from Text Documents. In SIGMOD Record (pp. 283-294). Seattle, WA. Laender, A. H., Ribeiro-Neto, B., & da Silva, A. S. (2002). DEByE - Data Extraction By Example. Riberio-Neto, B., Laender, A. H., & Da Silva, A. S. (1999). Extracting Semi-Structured Data Through Examples. In Proceedings of the International Conference on Knowledge Management, (pp. 94-101). Kansas City, MO. Ma, H. (February,2008). Fastblocking of undesirable webpages on clientPC by discriminating URL using neural networks. In Expert Systems with Applications (pp. 1533-1540). Hu, W., Wu, O., Chen, Z., Fu, Z., & Maybank, S. (June,2007). Recognition of Pornographic Web Pages by Classifying Texts and Images. JCorporate - http://www.jcorporate.com Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: an Architecture for Development of Robust HLT Applications. In In Recent Advanced in Language Processing (pp. 168175). Java SE 6 Documentation http://docs.oracle.com/javase/6/docs/

Lihat lebih banyak...

Web Page Extractor

Descripción

Comentarios