An optimal neural network plasma model: a case study

Share Embed


Descripción

Chemometrics and Intelligent Laboratory Systems 56 Ž2001. 39–50 www.elsevier.comrlocaterchemometrics

An optimal neural network plasma model: a case study Byungwhan Kim a,) , Sungjin Park b a

b

Department of Electronic Engineering, Sejong UniÕersity, 98, Goonja-Dong, Kwangjin-Gu, Seoul 143-747, South Korea Department of Electrical Engineering, Chonnam National UniÕersity, 300, Yongbong-Dong, Buk-Ku, Kwangju 500-757, South Korea Received 1 September 2000; accepted 31 January 2001

Abstract Artificial neural networks, particularly backpropagation neural network ŽBPNN., have recently been applied to model various plasma processes. Developing BPNN model, however, is complicated by the presence of several adjustable factors whose optimal values are initially unknown. These may include initial weight distribution, hidden neurons, gradient of neuron activation function, and training tolerance. A methodology is presented to optimize various factor effects, which was accomplished by implementing genetic algorithm ŽGA. on the best models. Particular emphasis was placed on a qualitative measure of initial weight distribution, whose magnitude and directionality were varied. Interactions between factors were examined by means of a 2 4 factorial experiment. Parametric effect analysis revealed the dissimilarity between the best and average prediction characteristics. Both gradient and initial weight distribution exerted a conflicting effect on both average and best performances. GA-optimized models exhibited about 20% improvement over the experimentally chosen best models. Further improvement of more than 30% was achieved with respect to statistical response surface models. Plasma modeled is an inductively coupled plasma, whose experimental data were collected with Langmuir probe from an etch equipment capable of processing 200-mm wafers. q 2001 Published by Elsevier Science B.V. Keywords: Plasma modeling; Backpropagation neural networks; Genetic algorithm optimization; Langmuir probe

1. Introduction Plasmas play a crucial role in etching fine patterns and depositing thin films. To manufacture processes Žor equipment. with less cost and efforts, numerical process simulations are highly desirable, which in turn require accurate plasma models. Historically, plasma models have been developed using the first principle physics involving continuity, momentum balance, and energy balance inside a high frequency, ) Corresponding author. Tel.: q82-2-3408-3329; fax: q82-23408-3329. E-mail address: [email protected] ŽB. Kim..

high intensity electric or magnetic or both fields w1,2x. Physical models attempt to derive self-consistent solutions to complex physical equations by means of computationally intensive numerical simulation methods, which typically produce distribution profiles of electrons and ions within the plasma. Although simulations are somewhat useful for equipment design and optimization, they are subject to many simplifying assumptions due to the lack of understanding of physical and chemical processes, which often result in somewhat larger discrepancy between model predictions and actual measurements. As a result, other efforts have been focused on empirical approaches to plasma modeling such as re-

0169-7439r01r$ - see front matter q 2001 Published by Elsevier Science B.V. PII: S 0 1 6 9 - 7 4 3 9 Ž 0 1 . 0 0 1 0 7 - 1

40

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

sponse surface model ŽRSM. w3x. Despite its popularity in actual manufacturing site, however, the models derived via the RSM are inherently limited in that they attempt to linearize nonlinear plasma process data. As an alternative to both physical model and RSM, an adaptive learning technique that utilizes neural networks combined with statistical experimental design has been employed to model an etch process w4x and demonstrated a superior accuracy in data learning and prediction over the RSM. Neural networks differ from RSM in that they learn complex plasma data both adaptively and nonlinearly without any linearization required in RSM. Among many neural networks, the backpropagation neural network ŽBPNN. w5x has been the most popularly used in interpreting underlying etch mechanisms due to its proven excellent high prediction accuracy w6–9x. Training factors typically involved in building a BPNN model may include hidden neuron, training tolerance, initial weight distribution, and function gradient. Due to the simplicity, the first hidden neuron is the most frequently varied and its optimal value is determined either experimentally or using some optimization techniques w10x. A systematic approach w11x was once presented to optimize the effects of the first three factors including learning rate and momentum. However, this is critically limited in that it neglected various random effects of initial weight distribution on model performance. Despite an attempt to accommodate this effect w12x, this examined factor effects only on the average prediction performance, not the optimal or, at least, best one. Moreover, the size of models generated is too small to estimate initial weight complexity. Actually, the network model can reach many local minima depending on a particular initial weight distribution. Analytical voting model was presented to identify randomness in the network generalization w13x. Although network prediction capability was claimed to depend largely on the size Žor range. of the weight rather than the number of weight w14x, it is yet unclear to which model characteristic Žaverage or optimal. this argument is valid. Applicability of a measure w15x to nonlinear plasma data still remains questionable. Due to this complexity inherent in the initial weight distribution, it has rarely been attempted to examine interactions between random initial weight distribution and other

training factors as well as to discriminate factor effects on between average and optimal models. In the context of function gradients, some improvement in prediction accuracy was reported with their adjustment w16x. Under random initial weights, this study aims at designing an optimal neural network plasma model for process simulation: modeling principal plasma attributes under variations in source power, chamber pressure, and argon flow rate. A 2 4 factorial experiment w17x has been conducted to simultaneously investigate the effect of four training factors: initial weight distribution, gradient of neuron activation function, hidden neurons, and training tolerance. The objective of this experiment is to evaluate factor effects on network performance as well as their optimization. Different degrees of factor significance to average and optimal predictive capabilities are to be discriminated. Optimal models obtained by implementing genetic algorithm ŽGA. w18x are compared to both experimentally chosen best models and statistical RSM. Plasma to be modeled is a multipole inductively coupled plasma w19x generated in an etch system capable of processing 200 mm wafers. Plasma data were collected with Langmuir probe.

2. Experimental data and neural networks 2.1. Experimental data acquisition A schematic diagram of an experimental ICP etch system is shown in Fig. 1. A low-pressure planar plasma is generated using an array of six magnetic cores, mounted on the top of vacuum chamber Al anodized. Each ferrite core is equally spaced by 40 mm and has three turns of coils. The RF power to the coils is delivered by an RF generator ŽRF 30S.. Due to the time-varying magnetic fields between the poles of opposite polarity, free electrons in the gas are accelerated in the direction parallel to the fields and the plasma is then produced. A ceramic plate was employed to separate the plasma source from the chamber, with an area of 430 = 430 mm2 and a height of 290 mm. An electrostatic chuck has a diameter of 200 mm and is driven by another RF power generator ŽRF 5S. at the same 13.56 MHz. Plasma attributes collected using the Langmuir probe include a plasma

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

41

Fig. 2. A typical architecture of backpropagation neural networks. Fig. 1. A schematic diagram of multipole inductively plasma etch system.

2.2. Basics of neural networks density Ž Ne ., electron temperature ŽTe ., and plasma potential Ž Vp .. To characterize plasma behaviors systematically, a 2 3 full factorial experiment was employed with one center point. Additionally, nine experiments were conducted to evaluate model prediction ability. A total of 18 experiments were thus used for model development. Equipment factors that were varied in the design include source power, pressure and Ar flow rate. Experimental ranges related are shown in Table 1. Using the probe, a total of 35 measurements for each experimental trial were taken across the diameter of the wafer. For this data, each plasma attribute was averaged and this averaged one was used as the model output Žor response..

Table 1 Experimental factors for plasma characterization Factor

Range

Unit

Source power Pressure Ar flow rate

500–1500 3–10 20–100

Watts mTorr sccm

Among the various ANN paradigms, the BPNN was employed for this plasma modeling due to its proven high accuracy in learning nonlinear process data. A typical architecture of BPNN is exhibited in Fig. 2. BPNN consists of one or more layers of neurons: input layer, hidden layer, and output layer. The input layer receives external information such as that represented by the three adjustable equipment parameters in Table 1. The output layer transmits the data and thus corresponds to the various plasma attributes Želectron density, electron temperature, and plasma potential.. In this study, the number of neurons in the output layer was set to unity since each attribute was modeled one by one. BPNN also incorporates AhiddenB layers of neurons that do not interact with the outside world, but assists in performing nonlinear feature extraction on the data provided by the input and output layers. Activation level Žor firing strength. of a neuron is determined by a nonlinear sigmoid function denoted as: out i , k s

1 1 q eyint i , k

,

Ž 1.

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

42

where int i, k and out i, k indicate the weighted input to the ith neuron in the k th layer and output from that neuron, respectively. The BP algorithm by which the network is trained begins with a random set of weights Ži.e. connection strengths between neurons.. An input vector, which has been normalized to lie in the interval between y1 and 1, is then presented to the network, and the output is calculated using initial weight matrix. Next, the calculated Žor predicted. output is compared to the actually measured output, and the squared difference between the two determines the system error. The Euclidean distance in the weight space the network attempts to minimize is the accumulated error Ž E . of all the input–output pairs, which is expressed as: q

E s 0.5

Ý Ž d j y out j .

2

,

Ž 2.

js1

where q is the number of output neurons, d j is the desired output of the jth neuron in the output layer, and out j is the calculated output of that same neuron. In BP algorithm, this error is to be minimized via the gradient descent optimization, in which the weights are adjusted in the direction of decreasing the E in Eq. Ž2.. A basic weight update scheme, commonly known as the generalized delta rule w4x, is expressed as: Wi , j, k Ž m q 1 . s Wi , j, k Ž m . q hDWi , j, k Ž m . ,

Ž 3.

where Wi, j, k is the connection strength between the jth neuron in the layer Ž k y 1. and the ith neuron in the layer k, and DWi, j, k is the calculated change in the weight to minimize the E in Eq. Ž2. and defined as: EE DWi , j, k s y

EWi , j, k

.

Ž 4.

Other m and h indicate the iteration number and an adjustable parameter so-called Alearning rate,B respectively. By adjusting weighted connections recursively using the rule in Eq. Ž4. for all the units in the network, the accumulated E over all the input vectors is to be minimized.

3. Results 3.1. Factor effect analysis As stated previously w11x, implementation of BP algorithm depends on many training factors: training tolerance, initial weight distribution, hidden neurons, momentum, and learning rate. Since the last two were found to give little impact on model prediction accuracy, they were excluded in this study. Instead, another factor called the gradient of the sigmoid function of Eq. Ž1. was included. Their effects are examined on both the average and best performances in a comparative fashion. 3.1.1. Initial weight distribution For the qualitative measure of effects of initial weight, a number of BPNN models were generated for a given factor combination. To estimate an appropriate limit to this number, it was varied from 100 to 500 by an increment of 100 while monitoring variations in prediction performance. Results are shown in Fig. 3, where the initial weight was uniformly distributed between "1.0. Other training tolerance, hidden neurons, and activation slope were set to 0.1, 4, and 1.0, respectively. During model development, BPNN was trained on the nine experiments from a 2 3 factorial experiment including one center point, and trained BPNN was subsequently tested on another nine experiments not pertaining to the training data. Root mean-squared ŽRMS. prediction error measured on the test data is symbolized as a square in Fig. 3. As expected, prediction performance varies considerably with initial weight distribution. For comparison, five best errors chosen in each of five intervals are numerically indicated. The second error, found from the 165th model in the second interval of w100, 200x, has the smallest magnitude. Despite more generation of models, this error still remains the best one. Due to this observation, the limit of model generation was estimated to be 200 in this study. Another RMS error marked with the triangle represents the average prediction error, which initially increases dramatically and then seems to stay constant. This is somewhat contradictory to our general understanding that model predictive capability is improved by averaging multiple models.

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

Fig. 3. Prediction characteristics electron density model for random initial weight distribution.

43

44

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

Effects of initial weight distribution are experimentally examined by varying the magnitude from "0.2 to "1.4. For a given magnitude, simulations were repeated 200 times and resultant average and best RMS errors are shown in Fig. 4. Fig. 4 reveals that for each model the average performance is degraded consistently with an increase in initial weight distribution. This can mainly be attributed to the premature saturation phenomenon, induced at an excessively widened weight distribution w20,21x. At wider distributions, this degradation becomes more significant. Decreasing the weight distribution, meanwhile, improves the average performance, which partly illustrates an advantage benefited from a weight decay w22x. By contrast, the best performance is significantly deteriorated at too smaller weight distributions. Although this is hardly discernable for electron temperature model in Fig. 4, this is evident by noting that the best error Ž1.083. obtained at "0.8 increased to 1.995 at "0.2. It is thus noteworthy that

Fig. 5. Prediction performances with variations in the directionality and magnitude of initial weight distribution.

initial weight distribution exerts a conflicting effect on both performances. From the standpoint of prediction accuracy, initial weight with bi-directionality is next compared to the one with uni-directionality. For the latter one, the weight was varied either incrementally from 0.2 to 1.4 or decreasingly from y0.2 to y1.4. Results are shown in Fig. 5. For the weights distributed in either uni-directionality, corresponding average performances demonstrate a similar behavior but electron density at the magnitudes larger than "1.0. Noticeable also are the best performances corresponding to uni-directional weight fairly larger than those for bi-directional weight. This suggests that initial weight with bi-directionality is more advantageous in improving prediction accuracy.

Fig. 4. Average and best prediction models vs. initial weight distribution.

3.1.2. ActiÕation gradient, hidden neuron, and training tolerance Although the gradient of sigmoid function plays a significant role in determining BPNN prediction performance, studies have rarely been conducted to ex-

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

45

larger gradients. An expected similar phenomenon at narrower initial weight distribution is transparent in Fig. 4. Another hidden neuron variable acts as a feature extractor and provides an estimate of the number of AconflictB contained in the input and output mapping. A conflict refers to input and output mappings that require incompatible weight solutions w23x. With too few hidden neurons, the network often fails to learn the relationships. Furthermore, the computation time to complete the training process is too costly. An increase in the hidden neuron may either increase or decrease the generalization capability for unseen patterns. Thus, it becomes a complicated task to find an optimal number of hidden units. In this study, this number was varied from one to seven by one and for this range of hidden neurons no difficulty with network convergence was encountered. Resultant prediction errors are depicted in Fig. 7. As seen in Fig. 7, each model behaves much differently with the

Fig. 6. Average and best prediction models vs. activation gradient.

amine its effects. As a divisor of the weighted input int i, k , the gradient can be regarded as another weight regularization factor. In this experimentation, the gradient was varied from 0.5 to 1.5 by 0.2 and its effects on model prediction capability are depicted in Fig. 6. An observation common to all models is that the average performance is deteriorated drastically with decreasing the gradient. This can presumably be attributed to an over decrease in some output neurons, by which the weights in the hidden layer are hard to be updated w16x. Inevitably, this may lead to an excessively longer training time, which actually occurred at the gradient less than 0.50 in this study. A similar phenomenon to this can be conjectured to exist at larger initial weight distributions since the weights are divided by the gradients less than 1. This is clearly exhibited in Fig. 4, which implies that both factors affect the average performance adversely. Generally, the best performance seems to worsen at

Fig. 7. Average and best prediction models vs. hidden neurons.

46

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

accuracy of neuron outputs. At each training epoch, it is given by:

Ts

)

k

Ý Ž d j y out j . js1

ky1

2

,

Ž 5.

where k represents the total number of test vector and it is nine in this study. A small tolerance usually leads to small learning error, but can result in less generalization capability as well as longer training time. Conversely, a larger tolerance enhances convergence time at the expense of accuracy in learning. Thus, the level of tolerance should be carefully chosen and in this experiments it was varied from 0.01 to 1.30 by 0.02. Resultant error characteristics are displayed in Fig. 8. For the plasma potential model, increasing the tolerance degrades the average performance. By contrast, little variations are observed for other models. At larger tolerances, meanwhile, the best performance of each model appears greatly deteriorated. 3.2. Model-based optimization Õia genetic algorithm Fig. 8. Average and best prediction models vs. training tolerance.

number of hidden neuron. A consistent behavior is observed only for electron temperature model. A significant degradation in the best performance at a smaller size of hidden neuron mainly originates from the networks being over-trained. As a termination criterion for network training, training tolerance ŽT . determines the overall quality of the network modeling capability by specifying the

Genetic algorithm ŽGA. was utilized to search for a particular factor setting that minimizes a best prediction error given. To accomplish this, the relationships between training factors and the best performance were characterized by a 2 4 full factorial experiment. Those experimental factor ranges chosen from the preceding factor effect analysis are shown in Table 2. BPNN was trained on the 16 factorial arrays and subsequently tested on the auxiliary eight data. For each of 16 experiments, 200 models were generated repeatedly and as a model response one best

Table 2 Ranges of training factors experimentally chosen from factor effect analysis

Error tolerance Hidden neurons Initial weight distribution Sigmoid gradient

Electron density Ž10 11 rcm3 .

Electron temperature ŽeV.

Plasma potential ŽV.

0.03–0.07 3–5 "0.8–" 1.0 0.9–1.3

0.03–0.07 3–6 "1.0–" 1.2 0.9–1.1

0.01–0.03 4–6 "0.8–" 1.0 0.7–0.9

B. Kim, S. Park r Chemometrics and Intelligent Laboratory Systems 56 (2001) 39–50

47

Table 3 Comparison of GA-optimized model, experimentally chosen best model, and statistical RSM Training factors and models

Electron density Ž10 12 rcm3 .

Electron temperature ŽeV.

Plasma potential ŽV.

Initial weight distribution Sigmoid gradient Hidden neuron Training tolerance GA-optimized model best Model RSM

"0.983 0.895 3.15 Ž3. 0.03 0.172 0.198 0.274

"1.200 0.900 5.85 Ž6. 0.05 0.141 0.209 0.501

"1.197 0.925 5.14 Ž5. 0.02 0.696 0.888 2.215

RMS prediction error was selected out of 200 local minima. On the basis of the best error models thereby developed, GA was implemented. In GA optimization, each training factor was coded in a 40-bit string and this resulted in a total chromosome length of 120 bits. During each computational cycle, an initial population of 100 potential solutions was created with each represented by a binary string and manipulated by the genetic operators. Next, the performance of each individual of the population is evaluated and a selection mechanism is subsequently activated to choose the best string with the highest fitness for the genetic manipulation process. The crossoÕer operator takes two chromosomes and parts of their genetic information are swapped to produce two new chromosomes based on a specified crossover probability. Another mutation probability is given to the mutation operator, which randomly changes a fixed number of bits every generation. Here, those numerical probabilities of crossover and mutation used in this optimization are 0.95 and 0.01, respectively. A particular input setting generated by the GA meets a given fitness function expressed as: Fs

1 1 q Ý r
Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.