Extracting tasks in design process records

June 14, 2017 | Autor: Koichi Hori | Categoría: Design process, Text Analysis, Stream Processing, Documents

Descripción

2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE)

Extracting Tasks in Design Process Records Katsuaki Tanaka

Koichi Hori

Center of Information and Communication Technology Hitotsubashi University Email: [email protected]

School of Engineering The University of Tokyo

Abstract-We extracted design process as series tasks for objects from design process records. First, we tried to extract topics and its transitions from documents. The system divided documents into smaller text fragments and grouped these into

transition as topic transition structure in chronological order and represents design process as series of tasks for an object based on the topic transition structure.

clusters. We regarded a cluster as a topic. The system iterated

II. REL ATED WORKS

clustering as topic extraction at certain intervals based on created times of documents and generated a graph of topics and its transitions. Then the system chose clusters that related to focused object and searched co-occurring words with the object from text fragments. It represented clusters labeled with these words and its transition.

Index Terms-Information Sharing System, Topic Detection, Document Stream Processing, Design Support System

I.

INTRODUCTION

Recently, systems have become larger and more compli cated, and many people engage in developing any system. This makes it difficult for people to comprehend the overall system and sometimes causes serious accidents because information and knowledge are not shared among people. Many design support systems were developed to resolve this problem. In these systems, it was tried to collect information with a model of design target and its function [1], [2]. The model would be a good basis to support checking the design or next design process. But it requires designers to input information to support constructing itself. It changes design method and would take costs. Another method of design support is system would be more efficient acquiring information from accumulated documents along with design activities such as specifications, meeting minutes, etc. There are many studies on managing, retrieving, structuring and visualizing information to utilize a large set of docu ments [3]. These systems are effective in information retrieval and filtering. For example, full-text search and ranking are fundamental technologies for Web search engines. Automatic text categorization is another example of effective technology. These studies show that computers are good at storing and managing a lot of information. Such a study could be a support system for a designer by giving structure to information and representing it. Therefore, how to organize accumulated documents is important in a design support system which aims knowledge reuse and man agement. We introduced chronological order to organize documents. In this paper, we propose a system that extracts topics and its

One major method for organizing documents is cluster ing. There have been many studies on clustering methods, including hierarchical clustering [4], k-means clustering and fuzzy clustering. These clustering methods do not take time information into account, and therefore they cannot organize information related to design process. In the Topic Detection and Tracking (TDT) research [5], several tasks were set for handling news articles, such as topic detection and topic tracking. Several techniques for time series documents are being studied. For example, Yang and her colleagues applied the k-nearest-neighbor classification method to the topic tracking [6]. Franz et al. [7] used a model to handle time information in news articles and reported that time information was useful in topic tracking. Hasegawa et al. proposed T-scroll [8] which extracted topics and its transitions from newspaper articles and represent them as a graph. These studies gave us ideas for developing our topic transition extraction system. Chance Discovery [9], [10] is a method for structuring information and supporting decision making by KeyGraph [11]. A user acquires knowledge from the structure, and then takes actions to reorganize it and make a change. The information structure used in Chance Discovery represents relations between items of data, but it does not consider chronological order or causal relations between items. The information structure represents the structure at a given time. But in this method, information can be reorganized at any time in interaction with the user. In other words, the structure of information has changes in the Chance Discovery process and they represent design process. Hori [12] proposed a knowledge method called the Knowl edge Nebula Crystallizer; it stores text units and organizes them to build new knowledge. Akaishi [13] proposed a topic decomposition and recomposition method based on the re lationships between words in a document to show that a document contains a variety of knowledge descriptions. These studies gave us ideas for developing our system extracting design tasks for an object from topic transitions.

978-1-4577-0687-5/11/$26.00 ©2011 IEEE

373

I: 1]" 1]" -H-

-H-

-

-

-F-

-

-G-

-

-

-

-

-E-

-

-

-c-

-

(a)

Fig. 1.

Fig. 2.

III.

(b) Topics and topic transitions

Extracting design tasks for an object

TOPIC TRANSITIONS AND TASKS FOR AN OBJECT

A. Focusing to Topic Transitions in Chronological Order

When design work is performed by a group of people, the group has meetings to share their tasks and they make documents as minutes. These documents are records of design process from group members' viewpoints. In this paper, we focused on them and try to discover clues to the knowledge applied to the design object. There are various ways to find knowledge from a set of documents. The most heuristic method is that a person reads and understands the documents. A search engine is an effective help in this way. An automated method is text mining for a purpose. However, most methods handle documents as a set of descriptions of a whole creative process without a chronological viewpoint. For example, as Fig. lea) shows, a method stores and uses whole documents, but does not consider the causal relationships between documents over time. It is like looking at a photograph of a tree that represents the outside appearance: the photograph does not show the growth rings inside the tree. In design of an artificial object, designers make actions to the object. Their intentions for actions and methods of actions are derived from their knowledge. Designers see changes as results of own tasks and their knowledge encourage them to think of new changes. This procedure is iterated in the design process. In this way, design process is composed with time passes. We could observe it from design process records as series of design tasks. For example, a process of designing radio communicator on a satellite is that; choosing a base model or a reference one, investigation weather it satisfies requests, examination on a real machine, then making a decision to modify it or

374

considering another one. Knowledge which applied to a design object is behind its design process. Chronological order could reorganize information described in documents of the design process. That is, an object or an information structure at a given time does not exist of itself, but exists as an accumulation of changes to itself. To handle topic transitions, a user can grasp how the structure of the object at a given time has been organized. It is important information because a task to make a new change on the object without knowing the reason the object takes its current form, the task could be harm intention of another designer or past himselflherself. By taking notice of the time at which documents were generated, a user could understand how topics changed in these documents and guess the intention of changes (see Fig. l(b)). As in our example of a photograph of a tree, considering how time passes means looking at many photographs showing how the tree grew over time. From these considerations, we thought that a chronology of tasks in information structure would be useful for knowledge reuse and management.

B.

Design Tasks for an Object

When design work is performed by a group of people, each member of the group has his or her own view to the design target and a designer ordinarily focuses an object that is part of the target. Therefore record of design work contains many different arguments concurrently and extracted topic transitions would consists of different kind of discussions like entangled strings. A designer usually focuses an object when he or she engaged to the design process. And as discussed before, knowledge which applied to design an object is behind its design process as series of tasks. It is desirable to untangle extracted topic transitions fit to the object. To reorganize topic transitions, we focused an object which is a part of the design target; selects topics that have mentions to the object and represents transitions of these topics. Fig. 2 shows example of selecting topics related to an 'Object' and represents topic transitions about it. Tasks included in these transitions would represent the design process. Based on these ideas, we propose our system consisted two parts. The first extracts topics and its transitions from documents, while the second represents series of tasks for an object included in topic transitions. IV.

EXTRACTING TOPIC TRANSITIONS

In this section we discuss the first system, extracting topics and its transitions from documents which increase in number with time such as meeting minutes. We introduced clustering to extract topics from documents. Clustering transforms data into a number of clusters that are groups of data without human supervision. It is often used to classify documents. We did not treat a cluster of documents as a topic, because a document often includes many topics and a cluster of documents would indicate a compound of topics. Instead, the system transformed each document into smaller

text fragments and grouped these into clusters of fragments. Then we regarded a cluster of text fragments as a topic. The system iterated topic extraction, not for new document arrivals but at certain intervals, because there exist times in which no changes are made and no documents are written in the creative design process. To extract such time, we set checkpoints at selected time intervals in chronological order and made a document group including older documents for each checkpoint, and extracted topics from each group. An old topic may be forgotten as time passes. We treated clusters as new that had new fragments made from documents that had been added to the current document group. On the other hand, clusters that consisted of old fragments were treated as old topics. To forget old topics, the system reduced the weight of fragments in old topics when clustering them. The next step was to detect the relationship between clusters from neighboring document groups. The system calculated the similarity between clusters based on the number of common text fragments in each cluster. In this way, the structure of clusters and their relationship in chronological order were obtained. We called this the topic transition structure. To visualize the structure, we used the TouchGraph Link Browserl. Fig. 3 shows an example of a visualized topic transition structure. A node is a cluster that indicates a topic. In Fig. 3, two of the feature words of the cluster were used as the label of the related node. The numerical values of a label represent the number of the checkpoint that the cluster belongs to and the number of text fragments that the cluster contains. The direction of the arrow of a link shows its chronological direction. We describe the detail of the extraction procedure of topic transition structure in the following.

Wand the overlap �, to decompose a document di into text fragments fj. There are studies for text segmentation methods [5], [14]. However, the method used for text segmentation is not im portant, because text fragments are organized into clusters in the next step, and the unit of information is the cluster in the following discussion. We therefore use windowing and clusters text fragments. C. Clustering Text Fragments

At every checkpoint, clustering is applied to the correspond ing document group with a predefined cluster number K. In brief, Dn is decomposed into clusters Cn,k:

A cluster indicates a topic. If the upper bound to the number of clusters is fixed and a new cluster is added composed of new elements far from existing clusters, it is necessary to reorganize existing clusters to reduce their number. As a result, clusters that are close to each other are collected into one cluster. Thus, with the goal that similar items are organized into the same clusters as time passes, and newly added topics form new clusters, we fix the number of clusters. We used probabilistic latent semantics indexing (pLSI) [15] for clustering. pLSI is a method to reduce dimensions of high dimensional data such as documents. pLSI is based on an assumption that latent classes zk(k 1,2,... , K) generate relations between text fragments si(i 1,2, . . . ,1) and words Wj(j 1,2, . . . , J). Clustering Si is performed by assuming Zk as a cluster. Cluster Ck for Si is determined as follow: =

=

=

A. Making Document Groups

For a document di, let c(di) denote its creation time. Suppose a set D of documents is given and let E(D) denote == the creation time of the earliest document in D, i.e., E(D) mindiED c(di). Similarly, let L(D) denote the creation time of the latest document in D. Then we set a checkpoint at every time interval T such that: T

==

L(D) - E(D) N

.

B.

=

D. We used N

=

50.

Decomposing Documents into Text Fragments

Because a document often includes more than one topic, the system should extract them. It decomposes each document into smaller text fragments and groups these into clusters of fragments. Then we regard a cluster of text fragments as topic. Windowing is then performed with a window size of 1 http://touchgraph.sourceforge.netJ

In this phase, we used a morphological analysis system MeCab2 to get words Wj from a text fragment Ii. The word class of each word was also gotten. D. Reducing the Weight of Old Topics

Based on the time interval T, we make N document groups Dl, D2,"', DN, where:

Note that Dl � D2 � ... � DN

Every Dn is divided into clusters in this method and Cn,k is defined corresponding to Zk of Dn.

A topic that appears many times in minutes would be regarded as important by designers. On the other hand, a topic with decreasing appearances over time would be considered no longer important in the design process. A topic that is not mentioned loses its importance as time passes. Therefore, we reduce the weight of a text fragment that is included in a cluster if the segment has been not mentioned at a checkpoint, to reduce its influence. After making clusters from document group Dn, text fragments included in a cluster Cn,i that has added no new text fragment have their weight increased by R (R ::; 1.0) in the next document group Dn+1' 2http://mecab.sourceforge.netJ

375

Fig. 3.

Graph of topic transition structure

E. Calculating Similarity between Clusters

To measure the similarity between clusters, sim ( Cn,i,Cm,j) is defined as follows:

.

s�m(Cn,i,Cm,j)

=

ICn,i n Cm,jl . ICn,'.I

function

(1)

where n and m are checkpoint numbers such that 1 ::::: n < m ::::: N and Cn,i denotes the ith cluster at checkpoint n. The following Jaccard similarity measure is usually used for the similarity of clusters:

ICn,i n Cm,j I (2) ICn,�. U Cm,).1· However, when Cn,i � CnH,j(k � 1), it means that the cluster corresponding to Cn,i has been merged with other clusters to make a broader topic, Cn+k,j. Therefore, we use Jaccard(Cn,i,Cm,j)

=

(1) as our similarity measure. The similarities between clusters at adjoining checkpoints are calculated using expression (1), and clusters with similar ities of 0.3 or more are linked. In this way, the topic transition structure of D is extracted. Fig. 3 is an example visualization of extracted topic transition structure. V. EXTRACTING TASKS FOR AN OBJECT

The topic transition structure is extracted from many doc uments written by many writers from their own viewpoints with unsupervised method by a computer. Therefore it includes many design processes overlapping each other. The aim of the second system is to represent series of tasks for an object included in the topic transition structure. The object should be specified by a user of the system as a word

376

in contrast with topic transitions were extracted automatically by a computer. A topic in topic transition structure is a cluster generated with pLSI, and then occurrence probabilities of words in each cluster have been calculated. First, a user specifies an object as a combination of a word and a threshold of occurrence probability. The system chooses clusters that have grater occurrence probabilities of the word than the threshold as topics that have mentions to tasks for the object. Second, it searches words co-occurring particular kinds of nouns with the object from text fragments in each cluster and treats them as tasks. Then it generates a graph. Nodes of the graph are selected clusters and links of the graph are links in topic transition structure. The label of each node is words found as tasks. The system visualizes the graph along with time passes using the Graphviz3. Fig. 4 shows a graph visualizing series of tasks for an object. In this way, tasks for an object are represented along with time passes, and different kind of them in the same period are visualized separated. We describe the detail of the extraction procedure of tasks for an object in topic transition structure in the following. A. Selecting Clusters

The system selects clusters that have mentions to an object. The object should be specified by a user as a word Wj and a threshold Thj to determine that a cluster has a relevance to it. Occurrence probabilities p(Wj,Cn,k) of a word Wj in a cluster Cn,k is

3http://www.graphviz.org!

Fig. 4.

Graph visualizing series of tasks for

p(Cn,k) and p(WjICn,k) have been calculated for clustering in IV-C. Then the system finds group of clusters DC that satisfies below in each Dn when Thj is the threshold: DC

==

{Cn,k

p(Wj,Cn,k) ;::::: Thj, n 1 . . . N, k 1 . . . K} =

B.

=

Finding Tasks

Next, the system tries to find tasks for the object. A task for an object in documents would be appeared around the word which represents the object. So text fragments that include the word Wj specified by the user would include tasks for it. In this paper, we focused documents written in Japanese. In Japanese, a task is often described a noun which could do, it is just opposite that "-ation" makes a verb a noun in English. Since each word in text fragments has been labeled to a word class by the morphological analyzer as mentioned in IV-C, the system treats a word in such word class co-occurring with Wj in a text fragments as a task. Then the system stores words as tasks with time stamps of text fragments where they are included for each clusters in DC. C. Representing Tasks for an Object

Then the system represents series of tasks for the object as a graph. Clusters DC have been selected that related to the object in acquired topic transition structure and words which mean tasks in each cluster have been chosen. The system generates a graph from DC and these words. It represents a cluster in DC as a node. The label for a node is consists of the words that have been chosen as tasks for the object in the cluster. Links between clusters are copied from the topic transition structure. The system visualizes the graph along with time passes using the Graphviz. For easy grasp of tasks concurrently operated, the system arranges clusters that belong to the same checkpoint in vertical. As mentioned before, Fig. 4 is a part of such a graph. VI.

OBSERVATION

We applied our system to the design and operation of a super-small satellite called CubeSat. The CubeSat is a world wide educational satellite development program. Several teams of university students or labs have joined this program. They design, develop and operate their own satellites. We used our system for a team at the University of Tokyo4. In the design 4http://www.space.t.u-tokyo.ac.jp/cubesatl

an

object

process for CubeSat, various kinds of documents such as meeting minutes and e-mails are generated. These documents contain important information such as design rationales, and problem identification and solutions. However, designers must solve many problems in the design process and it is hard for them to store documents in a well-organized manner. We applied the proposed topic transition extraction method to the CubeSat project's 398 minutes from January 2000 to December 2002. All were written in Japanese. The average length of the minutes is 2,703 bytes. All minutes include the date and the content of discussions. In this experiment, we used a window size of W 800 characters to decompose the documents into text fragments. Fig. 5 shows distribution of (p(w, zd,p(W,Z2)) when W 400,800,1600. When W 400,1600, occurrence probabili ties of win Zl and Z2 are almost same. p(Wj,Zk) would be indicate how each Zk could be different and clusters could have different kind of fragments. We set the window size W 800 for documents of the CubeSat project. =

=

=

=

We examined the number of clusters K at 50, 100 and 200 for each documents group Dn. The number of clusters has effect that the system could represent concurrent tasks from documents like Fig. 4. If the same kinds of tasks are represented as different node, a user could understand that they are same because labels of them would be similar. On the other hand, if different kind of tasks are represented in a node, it is difficult to distinguish these tasks. Therefore, for the CubeSat documents, K 50 seemed enough number of clusters to distinguish tasks concurrently operated, but we used K 100. It seemed a little noisy when K 200. =

=

=

We examined R and it seemed fit

=

0.3,0.5 for K

=

50,100,200,

extraction when (K, R) (50,0.3), (100,0.5), (200,0.5). The number of new documents at each checkpoint is independent of the number of clusters. Then larger K causes reduces clusters which have new text fragments at each checkpoint, and smaller K increases them. Therefore smaller R to reduce weight of old clusters is desirable for smaller K and vise versa. to

Fig. 3 shows a part of the extracted topic transition structure with (K, R) (100,0.5). According to a member of the CubeSat project, the result is approximately correct. =

Fig. 4 shows a part of tasks for the object "DJ" extracted by the system. "DJ" was name of a radio examined in earlier phase of the CubeSat project. We could find tasks that

377

Fig. 5.

Wand

(p(w, Zl),p(W, Z2))

communication sub-system used DJ series radio, accounting for buying it, examining function of transmitter, etc. We could grasp them easier than chronological ordered text fragments because tasks were grouped into clusters, and co-occurring tasks were represented in parallel. V II.

CONCLUSION

The structure of information is important to a knowledge reuse system. We focused on design process and used the chronological order of design records to organize information. We proposed a system to extract concurrent tasks in chrono logical order in design process records. The minutes of the CubeSat project, which is a satellite development project, were used as data. We set some check points in the minutes according to chronological order and made clusters of text fragments derived from documents at these checkpoints. Weights of text fragments in clusters that were not referenced were reduced, so old topics were gradually forgotten. Similarities between clusters were calculated and the system visualized the topic transition graph. In this way, topics and its transitions were extracted. Next, we represent series of tasks for an object based on a keyword that a user selected. The system searches clusters related to the keyword, finds tasks described in these clusters, generates a graph of them and displays it as tasks for the object. In general, the structure of a problem means the structure used to obtain the final solution of the problem. However, in the process of creative work, especially in complex artifact design, solutions cannot be obtained at once. Various trial and error processes are required, in which many adjustments must be made between sub-problems. Therefore, during the problem-solving process, the problem structure of the final solution may not be as useful as a structure that corresponds to the current phase of the solution. Topic transition extraction and reorganization in chronological order identifies changes in problem structure over time and thus could make an effective contribution to problem-solving processes. On the other hand, when a user uses reorganization keywords for which concepts do not change during the problem-solving process, such as

378

power sources or antennas in our case, unchanging topics are collected and topic transition reorganization is not effective. REFERENCES [1] Y. Nomaguchi,Y. Shimomura,and T. Tomiyama,"A design knowledge management system based on a model of synthesis," Transactions of the Japanese Society for Artificial Intelligence, vol. 20, no. 1, pp. 11-24, 2005. [2] M. Takeuchi, Y. Koji, Y. Kitamura, Y. Hayashi,M. Ikeda, and R. Mi zoguchi, "Development of a system for capturing design rational with integration of dynamic knowledge management," Transactions of the Japanese Society for Artificial Intelligence, vol. 22,no. 3,pp. 263-275, 2007. [3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Infonnation Retrieval. Addison Wesley,1999. [4] S. Sharma,Applied Multivariate Techniques. John Wiley & Sons,1996. [5] J. Allan,Topic Detection and Tracking: Event-based Infonnation Orga nization. Kluwer Academic Publishers,2002. [6] Y. Yang,T. Ault,T. Pierce,and C. W. Lattimer,"Improving text catego rization methods for event tracking," in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000,pp. 65-72. [7] M. Franz and J. S. McCarley,"Unsupervised and supervised clustering for topic tracking," in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000,pp. 310--317. [8] M. Hasegawa and Y. Ishikawa,"T-scroll : A trend visualization system based on clustering of a time-series of documents," Information Pro cessing Society of Japan, vol. 48(SIG_20(TOD_36)),pp. 61-78,2007. [9] Y. Ohsawa and Y. Nara,"Modeling the process of chance discovery by chance discovery on double helix," in Proc. of AAAI Fall Symposium on Chance Discovery, 2002,pp. 33--40. [10] Y. Nara and Y. Ohsawa, "Exploring collaboration topics from docu mented foresight of experts," in Proc. of 8th International Conference, Knowledge-Based Intelligent Information and Engineering Systems,

vol. 2,2004,pp. 823-830. [11] Y. Ohsawa, N. E. Benson, and M. Yachida, "Keygraph: Automatic index ing by co-occurrence graph based on building construction metaphor," in Proc. of Advanced Digital Library Conference, 1998,pp. 12-18. [12] K. Hori,"Do knowledge assets really exist in the world and can we ac cess such knowledge? -knowledge evolves through a cycle of knowledge liquidization and crystallization-;' Lecture Notes in Computer Science, vol. 3359,pp. 1-13,2005. [13] M. Akaishi, "A dynamic decompositionirecomposition framework for documents based on narrative structure model;' Transactions of the Japanese Society for Artificial Intelligence, vol. 21,pp. 428--438,2006. [14] M. Hearst, "Texttiling: Segmenting text into multi-paragraph subtopic passages;' Computational Linguistics, vol. 23,pp. 33-{)4,1997. [15] T. Hofmann, "Probabilistic latent semantic indexing," in Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,

1999,pp. 50--57.

Lihat lebih banyak...

Extracting tasks in design process records

Descripción

Comentarios