Assessing multiview framework (MF) comprehensibility and efficiency: A replicated experiment

August 21, 2017 | Autor: Pasquale Ardimento | Categoría: Information Systems, Quality Management, Experimental Design, Software Quality, Computer Software, Effect size, Software Process Improvement, Project Work, Goal Orientation, Empirical Study, Empirical Studies, University Student, Quality Model, Effect size, Software Process Improvement, Project Work, Goal Orientation, Empirical Study, Empirical Studies, University Student, Quality Model

Share Embed

Laporkan tautan ini

Descripción

Information and Software Technology 48 (2006) 313–322 www.elsevier.com/locate/infsof

Assessing multiview framework (MF) comprehensibility and efficiency: A replicated experiment Pasquale Ardimento, Maria Teresa Baldassarre, Danilo Caivano, Giuseppe Visaggio * Dipartimento di Informatica, Universita` di Bari-Via Orabona, 4, 70126 Bari, Italy Received 26 August 2005; accepted 27 September 2005 Available online 20 December 2005

Abstract Goal oriented quality models have become an important means for assessing and improving software quality. In previous papers, the authors have proposed an approach called multiview framework, for guiding quality managers in designing and managing a goal oriented quality model. This approach has been validated through a controlled experiment carried out with university students. In this paper, the authors discuss a replication of the controlled experiment, carried out with 28 university graduates attending a master degree course in an Italian university. Although research hypotheses are the same, context differs. In the replication, experimental subjects were more representative of practitioners, because their master degree course required project work with industrial partners. Using a cross-over experimental design we found that subjects using the multiview framework made significantly fiewer errors (p!0.05, effect sizeZ1.08) and took significantly less time (p!0.51, effect sizeZ1.82) to review the status of a project than when they used a standard GQM approach. This result was consistent with the results of our original experiment. q 2006 Elsevier B.V. All rights reserved. Keywords: Empirical studies; Replications; Software process improvement; Goal question metrics

1. Introduction The importance of goal oriented quality models as a means for assessing and improving software quality is well known. They are particularly advantageous because they take into account business targets, project needs and context characterization. Among various models, the goal question metrics approach (GQM) [5] is the most commonly used and has been successfully adopted in many industrial contexts [4,6,7,13,17]. Nevertheless, in the authors’ experience [1,2] and in reports of the use of GQM in industrial contexts [14,18,21–23] there are a number of limitations to this approach. In particular, quality models in an industrial setting tend to be very large and include numerous goals and metrics; this makes interpretation very complex because many metric values must be controlled and analyzed for each goal, and also leads to situations where there are many dependencies among goals. This complexity * Corresponding author. Tel.: C39 80 5442300; fax: C39 80 5442536. E-mail addresses: [email protected] (P. Ardimento), baldassarre@di. uniba.it (M.T. Baldassarre), [email protected] (D. Caivano), visaggio@di. uniba.it (G. Visaggio).

0950-5849/$ - see front matter q 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2005.09.010

makes it difficult to identify appropriate improvement actions. To address these issues, the authors have proposed the multiview framework (MF) [1]. The framework is based on GQM and guides quality managers, through a set of well formalized steps, to define, tailor, and manage a large goaloriented quality model. We have performed previous two evaluations of the MF. The first evaluation was a post mortem analysis [25]. We constructed a quality model for a completed industrial project using the MF approach and compared it with the quality model prepared by the project staff using the conventional GQM methodology. We confirmed that structural characteristics of the MF-based quality model were better than those of the original GQM-based quality model [1]. The second evaluation was a formal experiment using undergraduate subjects, who used either the quality model defined through the multiview framework approach (S-GQM) or the existing GQM-based quality model (NS-GQM), and the actual data from the project available at two points in the project lifecycle to assess whether planned project goals had been achieved. We measured the time (effort) it took to determine whether goals had been met and the number of errors assessing the status of goals. Students using the MF-based quality model took less time and made fewer errors than students using the original GQM-based quality model. Details are reported in [2,3].

314

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

This paper is a further validation of the approach. The two research hypotheses of the original study, also investigated in this replication, aimed at answering the following research questions: (1) Does multiview framework (MF) lead to a quality model that requires less effort for interpretation than one defined in the conventional GQM based format? (2) Does multiview framework (MF) lead to a quality model that is less error prone than one defined in the conventional GQM based format? This paper is a strict replication of our previous formal experiment, using the same materials, the same experimental setting, the same experimental design, and the same experimenters. The main difference is that the student subjects in this experiment were graduates taking an MSc degree who, as part of their MSc course, had undertaken project work with industrial partners. Thus, the subjects in this experiment were more representative of practitioners (at least novice practitioners) than the original ones. The rest of the paper is organized as follows: the next section discusses the motivation that led to the validation and replication of the controlled experiment. Section 3 presents the experiment in all its details. Sections 4 and 5 discuss the results and compare them to the first experiment. We present our conclusions in Section 6. 2. Discussion points Quality is defined as the set of desired characteristics that a software process or product must have in order to satisfy its requirements. How it is measured inevitably depends on the context and on the viewpoint from which measurement is being carried out. Software engineers use goal oriented quality models, such as goal question metrics (GQM) [2] for measuring software quality because they adapt to business and project characteristics better than other methods and can be combined and integrated with process and organization maturity models such as CMM and ISO [26]. Although, these characteristics of the GQM approach have encouraged its use in industrial contexts [4,7,13,16,23], this approach has some limitations. For example, there is often a large gap between what is discussed in literature [9] and what is verified in practice. More precisely, the literature proposes a threshold of a 4–5 goal quality model. It suggests that goals should not address more than one purpose, quality focus or viewpoint, and that the measurement plan should be gradually extended. Unfortunately, in real projects the dimensions of the quality model are not decided only by the quality manager, they also depend on project characteristics and business needs that are determinant in defining goals and therefore measurements. Also, a large quality model makes interpretation more complex, introduces dependencies among goals, requires more effort to manage the entire measurement plan which may include measurements related to process, product, project management and cost/benefit aspects as well as quality [1].

GQM, as reported in the literature [5], is not enough to manage all these issues. Furthermore, although literature reports on various modifications and extensions made to the first definition of GQM [9–11,17,20–22], to the authors’ knowledge, few works provide empirical evidence to validate the proposed improvements [10,15,19]. In most cases validation consists in applying the findings in industrial applications and describing the results obtained. The multiview framework (MF) approach proposed by the authors [1] addresses the weaknesses in the conventional GQM methodology such as dimensions, complexity and dependencies between goals of a measurement plan. The MF approach was validated theoretically [1] and empirically [2,3]. In this paper, we provide further evidence of the value of the MF approach by replicating our empirical evaluation. In particular, we used a 2!2 cross-over design [30] to assess the differences between the effect of the two treatments, S-GQM (the MF approach) and NS-GQM (the standard GQM method) in terms of efficiency and comprehensibility. 3. Empirical study The replication, described in this paper, aims to validate results obtained in the first experiment [2]. As pointed out in [31] ‘an exact replication cannot exist’, so it is more correct to refer to partial replications that attempt to minimize the variation between the original experiment and the replication. Furthermore, it is important to examine the same hypothesis or an extended/specialized version of them. So, if the two investigations come to the same conclusions, they can be considered supportive and mutually validating. This replication can be classified as what is known in literature as ‘strict replication’ [8] in that it does not vary any of the research hypotheses and it reuses instrumentation of the original experiment. This type of replication is important for increasing confidence in the validity [12] of experimental results in that it confirms that results from the original experiment are repeatable and have been appropriately documented by the original experimenters. The replication is described following [24]. 3.1. Experiment definition The replication aimed at validating that the results of the original experiment are repeatable, i.e. whether a quality model resulting from the application of MF is more efficient and comprehensible than a quality model obtained by using the conventional GQM paradigm. To this end, the two different quality models (S-GQM and NS-GQM) mentioned in Section 2, also used for the post mortem validation [1], were used in this experiment. Being this a replication with the intent of validating previously obtained results, the research goal has been defined as follows:

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

RG Replicate a previous experiment [2] For the purpose of assessing the repeatability of the experiment With respect to comparing MF-quality models with GQM-based quality models From the point of view of experimental rigour In the context of an internal replication varying only the subject group 3.2. Experiment planning 3.2.1. Hypotheses formulation To ensure the experiment is a proper replication of the original experiment, the same hypotheses have been adopted: HEFF 0 : there is no difference in the effort required to interpret S-GQM plans and the effort required to interpret NS-GQM plans. HEFF 1 : there is a difference in the effort required to interpret S-GQM plans and the effort required to interpret NS-GQM plans. HCOMP : there is no difference in the error proneness of 0 interpretation between S-GQM plans and NS-GQM plans. HCOMP : there is difference in error proneness of interpret1 ation between S-GQM plans and NS-GQM 3.2.2. Variables selection The dependent variables were: effort spent in man hrs., i.e. the work time spent by each subject for interpreting all the quality model goals(S-GQM or NS-GQM) according to the dataset of measurement values provided, and error proneness, i.e. number of incorrect conclusions made in interpreting the goals of the quality model (S-GQM or NS-GQM) measured on an ordinal scale. The independent variables were the ones that could be controlled and changed during the execution and that had some effect on the dependent variables. They represent the treatments: NS-GQM, i.e. the quality model representation that had been produced and used during the industrial project according to the GQM paradigm, as reported in the literature; S-GQM, i.e. the same quality model represented after applying MF. 3.2.3. Selection of subjects The experimental subjects were graduates attending a 1 year master course on software technologies, held at the University of Sannio in Benevento, Italy. They all had a graduate degree in engineering disciplines. Within the master course program, lessons on software quality were scheduled. The authors of this paper taught the lessons and planned the replication of the experiment. Students received the same training as subjects of the original study and were trained as professionals in defining goal oriented quality models, analyzing and interpreting measurement results. Also, the course program required each student to undertake project work with industrial partners. Graduates were therefore involved in real projects and worked with practitioners. In this sense, they can be considered as a representative sample of

315

Table 1 Experimental design Group/run

RUN1

RUN2

GROUP A GROUP B

S-GQM/MT1 NS-GQM/MT1

NS-GQM/MT2 S-GQM/MT2

practitioners (or at least novice practitiones). The sample was made up of a total of 28 subjects, all students of the course, that were randomly assigned to two equal groups of 14 persons, identified as ‘Group A’ and ‘Group B’. 3.2.4. Experiment design The experiment design was one factor (GQM quality model) two treatments (NS-GQM and S-GQM). More details on the quality models are in [1,2]. We have used a two-treatment, two period (2!2) cross-over design [30]. More precisely, the experiment was organized in two experimental periods (RUN_1 and RUN_2), and subjects were randomly assigned to either one of two groups (Group_A and Group_B). Each group received the treatments according to two different sequences: S-GQM followed by NS-GQM for Group_A (Sequence1), and NS-GQM followed by S-GQM for Group_B (Sequence2). In this way each subject is measured twice, one with each treatment, i.e. each subject acts as his/her own control. This type of design was most appropriate given the small sample size (14 subjects for each group). A graphical representation of the experimental design is given in Table 1. The experiment was designed to emulate monitoring a software project at two periods in the project lifecycle. We obtained all the measurement data at two distinct stages in the project lifecycle (time points T1 and T2). T1 and T2 correspond to two data collection points that took place in the original industrial project. During the first phase of the experiment, subjects used one of the two versions of the quality plan (S-QGM and NS-GQM) to assess the status of the project (with respect to goal achievement) given the data available at time T1 (MT1). During the second phase, each subject used the alternative quality plan to assess the status of the project given the measurement data available at time T2 (MT2). 3.2.5. Instrumentation Instrumentation used for carrying out the experiment consisted of objects, guidelines and measurement instruments. The instrumentation provided to the subjects during each phase of the experiment was: † A data form with the name, group, start and end times. † One version of the quality plan † A measurement plan [27] i.e. the list of metric values available at T1 (in RUN_1) or T2 (in RUN_2). † A set of decision tables that help to determine whether a quantifiable goal has been achieved. † An interpretation form to allow the subject to report his/her evaluation of the goals.

316

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

Measurement Plan 1 Metric

Description

Measurement

NAttCritTe

Number of critical activities given in outsourcing

65

NAttSched

Number of scheduled activities

100

NAttStrum

Number of activities requiring facilities and support tools

50

NCadNoDati

Number of system crash that have led to data loss

40

NCaduteDif

Number of system crash determined by system defects

NCamb

Number of changes carried out

NCambDif

Number of changes that have led to other defects

NCambParOK

Number of changes carried out with successful parameters

NCdTCamb

Number of test cases changed following to changes carried out on the software

33 84

200 min

NCdTEx

Number of test cases carried out

NCdTPrev

Total number of test cases

84

NDifDopCamb

Number of defects that the user identified following to changes on the system

20

Fig. 1. Extract of collected measures at time point T1 (MT1).

As reported previously, the two versions of the quality plans were constructed during the initial evaluation of the MF method [1]. The metrics collected at times T1 and T2 ere specified in two lists one for each time period. Fig. 1 shows an extract of the list for T1. The metrics (named in column 1) that do not have a value in column 3 were those that were not available at time T1. A similar list was constructed for time T2. The decision tables [28,29] are used to support goal interpretation. Each goal in the quality model has an associated decision table. Fig. 2 shows an extract of a decision table. It is a tabular representation that can be indexed by metric values to determine whether or not a goal has been achieved and to recommend remedial actions if a goal has not been achieved. A decision table is composed of four parts: A. A condition stub used for interpreting the goal. B. Condition entries that correspond to metric baseline values. C. An action stub which defines the set of improvement actions that must be applied if the collected measures do not fulfill the fixed baseline values. D. Action entries which are the result of the combination of condition entries and possible action. The guidelines for the experiment were provided in the data form. In particular, subjects were directed to consider the measures in the relevant measurement list, use the decision tables related to the specific quality models, evaluate the status of the goals and report their interpretation on the interpretation form.

3.3. Experiment operation Before executing the experiment all instrumentations were defined, as reported in Section 3.2. Since they were all students on an IT MSc course, the students were trained in the same manner, therefore they had similar experience and knowledge. Training consisted of explaining the conventional GQM approach, and the multiview framework. Students were trained to understand and use decision tables for interpretation. Before the actual experiment, we carried out a trial simulation of the experiment. We reproduced the experimental environment and asked the subject to perform a project monitoring task using a sample quality model and instrumentation, similar to the ones used in the experiment. During the simulation, the students completed a questionnaire. This allowed us to get feedback from the students and clarify any misunderstandings before the experiment. The actual experiment was organized in two experimental runs carried out in two days. For each run, the assignment was

Fig. 2. Extract of a decision table for goal interpretation.

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

to: analyze the measures of a selected subset of metrics (MT1 or MT2) related to either of the treatments (S-GQM or NSGQM), use the decision tables, and interpret each goal. The results were reported on the interpretation form. Since each measurement activity, during real project monitoring, involves only a subset of processes or products, not all goals of the quality model could be interpreted. In this case, the goals were signed as ‘not interpretable’ in the interpretation form. Effort was reported in the data sheet by the experimenters. In particular, we specified the start time when the instrumentation was distributed and the experimentation started. The end time was reported, when students handed in the interpretation form.

4. Data analysis In accordance with the experimental design, cross-over analysis was carried out after investigating the distribution of collected data. Analysis refers to both sequences across both experimental runs, as represented in Fig. 3. We analyzed the differences between effort and error proneness obtained from each group in using S-GQM and NS-GQM. For the hypotheses testing, an a-value was fixed at 5%. The dependent variables, aiming at assessing effort and error proneness of S-GQM compared to NS-GQM, as stated in the research hypotheses, were tested to investigate if the differences in their values were statistically significant. In the next paragraphs, the results of data analysis, are commented. Before presenting details of data analysis, we present some general information about the input data of the experiment: the quality models and the datasets of measurement values. Bear in mind that the quality models are two different representations obtained by adopting the conventional GQM approach (NSGQM), and the MF (S-GQM). For this reason the total number of metrics is the same. Table 2 summarizes the information related to both quality models used in the experiment. As it can be seen, the number of metrics with measurement values associated is higher at the second time point (MT2). This is because at that point of the process, more metrics could be collected. Thus, it can be seen that the experiment instrumentation and execution are similar to what a quality team does during measurement interpretation and goal evaluation, in real industry contexts. Table 3 reports the number of metrics with measurement values for both MT1 and MT2; and the number of goals interpretable for S-GQM and NS-GQM according to the datasets of measurement values.

RUN 1

RUN 2

Sequence 1

S-GQM

NS-GQM

Sequence 2

NS-GQM

S-GQM

Fig. 3. Data analysis schema.

317

Table 2 Summary of quality models

Total nr. of goals Total nr. of metrics Average nr. of metricsper goal Nr. of observedmetrics (reported MT1 and MT2)

NS-GQM

S-GQM

8 168 32,75 80

11 168 20,18 80

4.1. Normality tests In Ref. [30] the author suggests two methods for analyzing cross-over designs: if the interaction effect due to the order in which the subject uses the treatment is zero or negligible, a standard cross-over analysis can be used. The t-test is the best test if the difference is approximately Normal, otherwise a permutation test can be used. If the interaction term is not negligible, Senn [30] argues that the only appropriate analysis is a between subject analysis of the data in the first phase of the experiment. So, before carrying out the analysis we carried out a normality test on the collected data. Being a small sample (14 subjects for each group) for a total of 28 datapoints, we carried out a Shapiro-Wilk test [32]. This test for normality, developed by Shapiro and Wilk (1965), has been found to be the most powerful test in most situations. It is the ratio of two estimates of the variance of a normal distribution based on a random sample of n observations. The test statistic W is roughly a measure of the straightness of the normal quantile–quantile plot. Hence, the closer W is to one, the more normal the sample is. The test hypotheses are formulated as follows: H0 the data are normally distributed H1 the data are not normally distributed The normality test has been carried out for both dependent variables effort and error proneness. Results are reported in the next two paragraphs. 4.1.1. Effort normality test Figs. 4 and 5 plot the differences (RUN1–RUN2) on the vertical axis and values on the horizontal axis that would be expected if the differences were normally distributed. In Table 3 Dataset of measurement values and interpretable goals according to MT1 and MT2 RUN1(MT1)

Nr. metrics with measurement values Nr. metrics with no values associated Nr. goals interpretable Nr. goals non-interpretable

RUN2(MT2)

NS-GQM

S-GQM

NS-GQM

S-GQM

56

56

62

62

24

24

18

18

7 1

7 4

7 1

8 3

318

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

Normal Probability Plot of Seq 1 Differences

0

2.00

–500

0.50 Seq 1 Differences

Seq 1 Differences

Normal Probability Plot of Seq 1 Differences

–1000

–1500

–1

0

1

–2.50

–4.00 –2.00

–2000 –2

–1.00

2

–1.00

Expected Normals Fig. 4. Normality plot for effort differences in Seq1.

1875

1.75

Seq 2 Differences

Seq 2 Differences

3.00

1250

625

1

0.50

–0.75

–2.00 –2.00

0 0

2.00

Normal Probability Plot of Seq 2 Differences

Normal Probability Plot of Seq 2 Differences

–1

1.00

Fig. 6. Normality plot for error proneness differences in Seq1.

2500

–2

0.00 Expected Normals

2

–1.00

Expected Normals

0.00

1.00

2.00

Expected Normals

Fig. 5. Normality plot for effort differences in Seq2.

particular, the first figure shows the differences for sequence 1 and the second one shows the differences for sequence 2. Results of the Shapiro-Wilk test are reported in Table 4 According to the test, the data are normally distributed in both sequences.

Fig. 7. Normality plot for error proneness differences in Seq2.

4.2. Cross-over analysis After providing evidence of data distribution, cross-over analysis was carried out. Ref. [33] summarizes the cross-over design as follows (Table 6): where:

4.1.2. Error proneness normality test Analogously to the previous paragraph, the results of normality tests are reported for the differences in error proneness collected in both experimental periods. In particular, Figs. 6 and 7 graphically represent the data distributions for each sequence. The test results are reported in Table 5. It can be seen that we can accept the assumption of data being normally distributed in this case as well.

mj is the average effect for subject j mk is the average effect for subject k a is the effect of treatment A b is the effect of treatment B t is the difference between treatment A and Treatment B (i.e. tZaKb) † p is the period effect

Table 4 Shapiro-Wilk test for effort

Table 5 Shapiro-Wilk test for error proneness

† † † † †

Effort

W test value

Prob level

Decision (5%)

Error proneness

W test value

Prob level

Decision (5%)

Sequence 1

0.9042

0.13

Sequence 1

0.9476

0.5249

Sequenze 2

0.9728

0.9122

Cannot reject normality Cannot reject normality

Sequence 2

0.9312

0.3176

Cannot reject normality Cannot reject normality

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322 Table 6 Expected response for the design

319

Sequence-by-Period Means

Sequence

Period 1

Period 2

j k

A/B A/B

mjCa mkCb

mjCbCpClA mkCaCpClB

Table 7 Statistics used to estimate effects Subject

Cross-over difference (AKB)

Period difference (P1KP2)

Subject total

j

aKbKpKlAZ tKpKlA aKbCpClBZ tCpClB

aKbKpKlAZ tKpKlA aKbKpKlBZ KtKpKlB

2mjCaCbCpClA

k

2mkCaCbCpClB

Mean Effort (in seconds)

4500 Subject

Treatment S N

4000

3500

3000

2500 RUN1

† lA is the period by treatment effect due to sequence A/B † lB is the period by treatment effect due to sequence B/A Table 7 shows various statistics that are used to estimate the various model parameters. The treatment effect t is estimated by adding the cross-over difference (and thus removing the period effect) and averaging the result giving: tK(lA–lB)/2. Thus, if the carry-over effect (lA–lB)/2 is zero or negligible, adding the cross-over difference and obtaining the average, gives us an unbiased estimate of the treatment effect (assuming we have the same number of subjects in each sequence group). We agree with [33] in stating that cross-over analysis is more difficult to apply in software engineering than in clinical trials in that in most cases ‘learning’ inevitably occurs because software engineering a human intensive activity. Moreover, a ‘wash-out’ period will most likely not be able to ‘wash-out’ the acquired knowledge or skills of subjects. So, the causes of carryover effects must be examined in detail. The next two paragraphs present the results of the analyses carried out on the dependent variables. In particular, for each variable we provide: a plot of the raw data, a summary of the cross-over analysis, a plot of sequence-by-period means, and a sum-difference plot. 4.2.1. Results for effort Results of the cross-over analysis are summarized in Table 8. The two treatment means in a 2!2 cross-over study are significantly different at the 0.05 significance level (the actual significance level was lower than 0.0001). The design had 14 subjects in sequence 1 (S-GQM/NS-GQM) and 14 subjects in sequence 2 (NS-GQM/S-GQM). The average response to Table 8 Cross-over analysis summary for effort Parameter

T value (DFZ26)

Prob level

Lower 95, 0% confidence limit

Upper 95,0% confidence limit

Treatment Period Carryover

10.4848 K2.4048 2.5459

0.0000 0.0236 0.0172

938.9004 K496.8139 94.9306

1396.8139 K38.9004 890.7837

RUN2

Period Fig. 8. Sequence-by-period means plot for effort.

treatment S-GQM was 2775 s. (i.e. 46 min) and the average response to treatment N was 3942.85 s (i.e. 66 min). A preliminary test rejected the assumption of equal period effects at the 0.05 significance level (the actual significance level was 0.0236). Also, a preliminary test rejected the assumption of equal carryover effects at the 0.05 significance level (the actual significance level was 0.0172. The period effect may have been due to the larger number of measurement values used for interpretation in RUN2 (when MT2 was used) than used in RUN1 (when MT1 was used) as shown in Table 3. So, a more metrics had to be interpreted. In addition, in the NS-GQM interpretation all the available metrics needed to be examined for each goal, leading to proportionally more effort. On the other hand, the ‘structured’ organization of S-GQM helps to localize interpretation and therefore requires less effort, i.e. in S-GQM the number of available metrics does not influence interpretation complexity. We believe that the carryover effect identified in Table 8 may be attributed to the fact that S-GQM is independent of the acquired skills. This does not occur in NS-GQM. So, the subjects that used S-GQM first seemed to require less time for interpreting goals in RUN2 when using the NS-GQM. These results are depicted in Fig. 8 which shows a graphical representation of the sequence-by-period means plot. Due to tool restrictions, we could only adopt one character notation, so ‘S’ refers to S-GQM treatment, and ‘N’ to NS-GQM treatment. It shows the mean responses on the vertical axis and the periods on the horizontal axis. The lines connect like treatments. As it can be seen, on average there is a greater amount of effort spent for carrying out the interpretation in the NS-GQM than in S-GQM. This occurs in both runs. The S-GQM structure may be more comprehensible because goals are only weakly dependent and have a low complexity of interpretation. In other words, the overall documentation used for goal interpretation is easier to use and understand. Finally, Fig. 9 plots the sum of each subject’s two responses on the horizontal axis and the difference between each subject’s two responses on the vertical axis. Dot plots of the

320

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

Sum Difference Plot

0

Sequence

1 2

–625

Difference (S - N)

–1250

–1875

–2500 5500

6500

7500

8500

Sum (S + N) Fig. 9. Sum-difference plot for effort. Fig. 10. Sequence-by-period means plot for error proneness.

sums and differences have been added above and to the right of the scatter plot, respectively. Each point represents the sum and difference of a single subject. Different plotting symbols are used to denote the subject’s sequence. A horizontal line has been added at zero to provide an easy reference from which to determine if a difference is positive (favors treatment NS-GQM) or negative (favors treatment S-GQM). As it appears from the figure, the general tendency of the datapoints is in favor of S-GQM. This supports the results shown in Table 8. 4.2.2. Results for error proneness The same analyses were carried out for the second dependent variable: number of errors. Table 9 summarizes the results of the cross-over analysis. As it can be seen, the two treatment means in a 2!2 cross-over study are significantly different at the 0.05 significance level (the actual significance level was 0.005). The average number of errors to treatment S-GQM was 1,32 and the average response to treatment NSGQM was 2,46. A preliminary test failed to reject the assumption of equal period effects at the 0.05 significance level (the actual significance level was 0.7042). A preliminary test failed to reject the assumption of equal carryover effects at the 0.05 significance level (the actual significance level was 0.4601). So, no period or carryover effect arose. A graphical representation of the sequence-by-means values is in Fig. 10. Due to tool restrictions, we could only adopt one

character notation, so ‘S’ refers to S-GQM treatment, and ‘N’ to NS-GQM treatment. As it can be seen, throughout both periods there are more errors made interpreting the NS-GQM plan than the S-GQM plan. Since the interpretation of NS-GQM is more complex, and the goals are highly dependent, it is more difficult to avoid error when analyzing the measures and identifying the improvement actions to carry out. It also appears from the figure that there is a slight decrease in NS-GQM errors rates in period 2 and a slight increase in error rates for S-GQM. The latter was attributed to the larger number of available measurement values in RUN2, than in RUN1. As a consequence a higher number of goals were interpretable. In NS-GQM the reason for a less evident error proneness may have occurred because the subjects had previously acquired skills in goal interpretation with the structured quality model. This had evidently

Sum Difference Plot

2.00

Sequence

1 2

0.00

–2.00

Difference (S - N)

Table 9 Cross-over analysis summary for error proneness –4.00

Parameter

Treatment Period Carryover

T value (DFZ26) 3,0705 0,3838 0,7499

Prob level

0,005 0,7042 0,4601

Lower 95, 0% confidence limit

Upper 95,0% confidence limit

0,3778 K0,6222 K1,2437

1,9079 0,9079 2,6723

–6.00 0

4.00

8.00

12.00

Sum (S + N) Fig. 11. Sum-difference plot for error proneness.

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322 Table 10 Effect sizes in each experiment Effect size estimates Effort Errors

Original experiment [2]

Replication

1.60 1.04

1.82 1.08

influenced their ability in using the more complex and nonstructured approach in the second run. This does not account for the opposite effect observed among subjects in sequence2. Fig. 11 represents the sum-difference plot of each subject’s responses. It supports the treatment results summarized in Table 9. It appears from the figure, the general tendency of the datapoints, for both sequences, is in favor of S-GQM. 5. Discussion This is an internal replication (carried out by the same researchers) of a controlled experiment. The research goal is to assess the validity of the proposed approach (S-GQM) in order to build a body of knowledge around the research hypotheses. Many disciplines systematically adopt meta-analysis as a means for comparing studies and combining their results, although it is not yet a standard technique in the software engineering context [31,34]. We agree that this is an important technique for comparing different studies and have therefore carried out meta-analysis in order to determine if the two studies (original experiment [2] and this replication) produce significantly different results. In particular effect size estimates have been calculated. Further references for effect size and their estimation can be found in [35]. There are several definitions and models for estimating effect size. In our case Cohen’s d is used [36], calculated as the difference between treatment means divided by the standard deviation. It has been calculated for both dependent variables in both studies. Results are summarized in the Table 10. Cohen suggested that effect sizes could be classified into three groups: Small (z0.2), Medium (z0.4) and Large (z0,8) . This classification was intended to assist power analysis. The results shown in Table 10 point out the consistency between the two experiments examined and confirm a large effect size. Given an alpha value of 0.05, the power of both experiments was better than 0.8. So, assuming that a cross-over experiment is used, 24 subjects can be considered as sufficient in any future replications to detect the expected effect and achieve an adequate power. 6. Conclusions This paper has presented an internal replication of the controlled experiment carried out by the authors in [2], and represents a further validation of the multiview framework presented in [1]. The replication was carried out using graduate university students attending a 1 year master course, the

321

original experiment used undergraduate students. This represents an important step for supporting the conclusion validity [12] of the study and a first generalization of results since in this case our experimental subjects were closer representatives of practitioners. The study examined the same research hypotheses in order to verify the validity of results of the first experiment. The cross-over analysis carried out on the collected data has identified significant differences in subject responses for effort and error rates throughout the entire observation period. It provides evidence that less effort and lower error rates occur for interpretation using S-GQM plans compared to NS-GQM plans. These considerations are also supported by metaanalysis effect size estimations used to compare the two independent studies. So, given the overall results of the replication, some important conclusions can be made on the efficacy and comprehensibility of a quality model designed with the multiview framework approach. Such results point out the validity of the treatment. In other words, the multiview framework leads to a better structured quality model with lower complexity, fewer dependencies and easier to manage during measurement activities. In the first experiment and in its replication, reported in this paper, we have noticed that the errors made were usually the same. This suggests the need for further experimentation to assess whether ‘the structured representation model not only requires less effort and is less error prone but also improves conformance to measurement process execution’. The benefit in this case is that a higher conformance implies that each improvement made to the quality model or to its interpretation can be easily transferred to the entire structure. In this way, the validated approach (MF) becomes an important means for collecting experience as the measurement process is carried out and changes are made. As future work, the authors are planning to carry out empirical studies directly with practitioners in order to provide further support for the experimental results. References [1] M.T. Baldassarre, D. Caivano, G. Visaggio, Multiview framework for goal oriented measurement plan design, Fifth International Conference on Product Focused Software Process Improvement—PROFES, Nara Japan, April 2004. [2] M.T. Baldassarre, D. Caivano, G. Visaggio, Comprehensibility and efficiency of multiview framework for measurement plan design, Proceedings of the International Symposium on Empirical Software Engineering—ISESE, Rome Iatly, October 2003. [3] M.T. Baldassarre, D. Caivano, G. Visaggio, Abstraction sheets for improving comprehensibility of quality models, Second workshop on software quality, Edinburgh Scotland, May 2004. [4] J. Barnard, A. Price, Managing code inspection information, IEEE Software 11 (2) (1994) 59–69. [5] V.R. Basili, G. Caldiera, H.D. Rombach, Goal question metric paradigm, Encyclopedia of Software Engineering, vol. 1, Wiley, London, 1994. pp. 528–532. [6] V.R. Basili, M.K. Daskalantonakis, R.H. Yacobellis, Technology transfer at motorola, IEEE Software 11 (2) (1994) 70–76.

322

P. Ardimento et al. / Information and Software Technology 48 (2006) 313–322

[7] V.R. Basili, S. Green, Software process evolution at the SEL, IEEE Software 11 (4) (1994) 58–66. [8] V. Basili, et al., Building knowledge through families of experiments, IEEE Transactions on Software Engineering 25 (4) (1999) 456–473. [9] L.C. Briand, C.M. Differding, H.D. Rombach, Practical guidelines for measurement—based process improvement, Software Process— Improvement and Practice 2 (1996) 253–280. [10] L.C. Briand, S. Morasca, V.R. Basili, An operational process for goaldriven definition of measures, IEEE Transactions on Software Engineering 28 (12) (2002) 1106–1125. [11] A. Brockers, C. Differding, G. Threin, The role of software process modeling in planning industrial measurement programs, Proceedings of the third International Software Metrics Symposium—METRICS01, Berlin, March 1996, pp. 31–40. [12] T.D. Cook, D.T. Campbell, Quasi-Experimentation design and analysis issues for field settings, Houghton Mifflin Company, 1979. [13] M.K. Daskalantonakis, A practical view of software measurement and implementation experiences within motorola, IEEE Transactions on Software Engineering 18 (11) (1992) 998–1010. [14] A. Fuggetta, L. Lavazza, S. Morasca, S. Cinti, G. Oldano, E. Orazi, Applying GQM in an industrial software factory, ACM Transactions on Software Engineering and Methodology 7 (4) (1998) 411–488. [15] A. Gopal, M.S. Krishnan, T. Mukhopadhyay, D.R. Goldenson, Measurement programs in software development: determinants of success, IEEE Transactions on Software Engineering 28 (9) (2002) 865–875. [16] R.B. Grady, Practical software metrics for project management and process improvement, Hewlett-Packard Professional Books, 1992. [17] T. Kilpi, Implementing a software metrics program at nokia, IEEE Software (2001) 72–77. [18] A. Loconsole, Measuring the requirements management key process area, Proceedings of the 12th European Software Control and Metrics Conference—ESCOM01, London England, April 2001, pp. 67–76. [19] M.G. Mendonc¸a, V.R. Basili, Validation of an approach for improving existing measurement frameworks, IEEE Transactions on Software Engineering 26 (6) (2000) 484–499. [20] R.J. Offen, R. Jeffrey, Establishing software measurement programs, IEEE Software, 3, 45–53. [21] R.V. Solingen, E. Berghout, Improvement by goal-oriented measurement—bringing the goal/question/metric approach up to level 5, Proceedings of the E-SEPG, Amsterdam, The Netherlands, June 16–20, 1997.

[22] R.V. Solingen, E. Berghout, Integrating goal-oriented measurement in industrial software engineering:industrial experiences with and additions to the goal/question/metric method (GQM), Proceedings of the METRICS, London, England, April 4–6, 2001, pp. 246–258. [23] R.V. Solingen, F. Latum, M. Olivo, E.W. Berghout, Application of software measurement at schlumberger RPS: towards enhancing GQM, Proceedings of the European Software Control and Metrics Conference, Netherlands, May 17–19, 1995. [24] C. Wohlin, P. Runeson, M. Host, M.C. Ohlsson, B. Regnell, A. Wessle`n, Experimentation in Software Engineering, Kluwer Academic Publishers, Dordrecht, 2002. [25] M.V. Zelcowitz, D.R. Wallace, Experimental models for validating technology, IEEE Computer (1998) 23–31. [26] K. Pulford, A. Kuntzmann-Combelles, S. Shirlaw, A Quantitative Approach to Software Management, Addison–Wesley, Reading, MA, 1995. [27] N.E. Fenton, S.L. Pfleeger, Software metrics: a rigorous and practical approach, International Thompson Computer Press, 1997. [28] U.W. Pooch, Translation of decision tables, ACM Computing Surveys 6 (2) (1974) 125–151. [29] Available at: http://www.econ.kuleuven.ac.be/tew/academic/infosys/ research/prologa/prologa.htm [30] S. Senn, Cross-over Trials in Clinical Research, second ed., Wiley, London, 2002. [31] J. Miller, Replicating software engineering experiments: a poisoned chalice or the Holy Grail, Information and Software Technology 47 (2005) 233–244. [32] S.S. Shapiro, M.B. Wilk, An analysis of variance test for normality (complete samples), Biometrika, 1965. [33] B. Kitchenham, J. Fry, S. Linkman, The case against cross-over designs in software engineering, Proceedings of the 11th International Workshop on Software Technology and Engineering Practice (STEP’04), 2004. [34] J. Miller, Can results from software engineering experiments be safely combined, Proceedings of Metrics, 1999. [35] S. Kramer, R.Rosenthal, Effect sizes and significance levels in small sample research in: R.Hoyle (Ed.). Statistical Strategies for Small Sample Research, Sage publications, Beverly Hills, CA, 1999 [36] J. Cohen, Statistical power analysis for the behavioural sciences, Academic Press, New York, 1977.

Lihat lebih banyak...

Assessing multiview framework (MF) comprehensibility and efficiency: A replicated experiment

Descripción

Comentarios