A general-purpose dynamically reconfigurable SVM

Share Embed


Descripción

A GENERAL-PURPOSE DYNAMICALLY RECONFIGURABLE SVM Jonas Gomes Filho, Mario Raffo, Marius Strum, and Wang Jiang Chau Department of Electronic Systems, University of Sao Paulo Av. Professor Luciano Gualberto, trav. 3, 158 email: [email protected], [email protected], [email protected], [email protected] ABSTRACT This paper presents an hardware implementation of the Sequential Minimal Optimization (SMO) for the Support Vector Machine (SVM) training phase. A general-purpose reconfigurable architecture, aimed to partial reconfiguration FPGAs, is developed, i.e., it supports different sizes of training sets, with wide-range number of samples and elements. The effects of fixed-point implementation are analyzed and data on area and frequency targeting the Xilinx Virtex-IV XC4VLX25 FPGA are provided. The architecture was able to perform the training in different learning benchmarks and the reconfigurable architecture was able to save 22.38% of FPGA’s area.

Important dedicated HW implementations of the training phase of the SVM have been implemented in digital FPGAs [5, 6, 10] to obtain speed-up when compared to software processing. A first totally hardware implementation for a SVM algorithm was reported in [5], based on a numerical resolution for the QP problem, therefore, being applicable only for small cases. SMO-based solutions were later presented, first by the authors of [7], as a Software-Hardware (SW-HW) system and , later, in [10], as a parallel architecture for the computation of multiple-pair of samples. Although faster than a totally software SMO implementation, the solutions were targeted either to small examples or to a specific case. In the last decade, with the technology based on static memory (SRAM), FPGAs has provided a unique aspect of flexibility: the capability for dynamic reconfiguration, which involves altering the programmed design at run-time [8]. The dynamically reconfigurable systems (DRS) provides benefits as design flexibility and area reduction, while still providing significant speed benefits over purely software solutions. In this work we present a dynamically reconfigurable SVM-SMO architecture, motivated by the sequential nature of some of the SMO tasks, basically the iterations on: 1Working set selection; 2- Coefficient pair optimization; and 3- Global data updating. For the best of our knowledge, this is the first dynamically reconfigurable architecture developed for the SVM training. The system was designed following a modular scheme, defining concurrent and interchangeable logical areas for the reconfiguration and leaving a high-level synthesis system to generate the RTL model for each module [12]. In addition, we propose a general purpose implementation of the SMO algorithm, which is able, for a given fixed-point data representation, to train a varied set of examples with different number of samples and elements. To determine the suitable size of the fixed-point data for an ample set of benchmarks, a study of its effects on the precision and classification error was carried out. After the RTL model generation and its synthesis, results on area and timing is presented.

1. INTRODUCTION In recent years, the Support Vector Machine (SVM) has been largely used in different applications [1], due to its high classifying capability without errors (generalization performance). The use of SVMs consists of two phases: training and testing. Training in SVM has as objective to find a hyperplane that separates the positive from the negative examples, maximizing a margin parameter. The margin is calculated adding the minimum distances from both classes to the separating hyperplane, requiring the solution of a quadratic programming (QP) problem [2], which can be solved by numerical methods [3]. However, the associated numerical solutions have been pointed as inadequate for large problems due to the storage of large matrices of size proportional to m2 and the large training times due to their slow convergence [3]. As an alternative, the Sequential Minimal Optimization (SMO) algorithm [4] is one of the fastest and easiest solutions to implement the most general, the non-linear and non-separable case. The algorithm is iterative and adopts an analytical solution, optimizing one pair of Lagrange Multipliers per iteration, therefore, avoiding the storage of large matrices in memory. This work was partially supported by the National Council for Scientific and Technological Development of Brazil (CNPq).

978-1-4244-6311-4/10/$26.00 ©2010 IEEE

107

In Section 2, the SVM theory and the SMO algorithm are revised, while in section 3 the digital architecture is shown. In Section 4, different size fixed-point representation is compared. Section 5 presents the reconfigurable architecture adopted in this work and finally, in Section 6, conclusions are set.

Among the many kernel functions that have been proposed for SVM mapping, the Gaussian kernel given by (5) is one of the most applicable kernels for the general case. It can efficiently map the original sample input space into a much larger dimension one, increasing the separation margin between the classes.

2. BACKGROUND

K(~xi , ~xj ) = e

2.1. Support Vector Machines

(5)

2.2. The Sequential Minimal Optimization

The objective of the SVM training phase is finding a classification function based on a set of samples {~xi , yi }m i=1 , where ~xi is an input pattern represented as a d-dimensional vector and yi ∈ {+1, −1} is the corresponding output. In order to split the input vectors into two classes, a separating hyperplane is found. The classification function is given by [2]: ˆ x) = w Φ(~ ~ · ~x + b

(1)

Where w ~ is the normal vector from separating hyperplane and b is a threshold parameter. Among all possible separating hyperplanes, SVMs find the one that maximizes a margin parameter [1], i.e., the maximum distance from both negative and positive points. In most cases some points are allocated on the wrong side of hyperplane, what may lead to classification errors. To reduce errors, the points can be mapped from the d-dimensional input space to a ndimensional feature space through a nonlinear function 0, yi = −1}

(6)

Ilow = {i | αi < C, yi = −1 or αi > 0, yi = +1}

(7)

Afterwards, by evaluating every sample I the working set {i1, i2} is formed, with the selection of the sample inside of Iup which maximizes the function 8 and the sample in Ilow that minimizes it. Fi = −yi

(3)

i=1

new old αi2 = αi2 +

C is an arbitrary parameter, which sets the trade-off between the margin parameter and the number of misclassified points, both important for a small generalization error. The advantage of using the dual formulation given by 2 is to avoid knowing explicitly the nonlinear function ϕ, needing only the inner product of two points in the feature space. This can be done due to the kernel function, defined by: K(~xi , ~xj ) = ϕ(~xi ) · ϕ(~xj )

∂L(α) ∂αi

(8)

In the optimization task, two bounds Hi and Lo are calculated based on constraints (3), and, then, the new value new new αi2 and αi1 are calculated.

(2) with 0 ≤ α ≤ C and

−k~ xi −~ x j k2 2σ 2

yi2 (blow − bup ) η

new old old new αi1 = αi1 + y1 y2 (αi2 − αi2 )

(9) (10)

where ∂ 2 L(α) = K(~xi1 , ~xi1 ) + K(~xi2 , ~xi2 ) − 2K(~xi1 , ~xi2 ) ∂αi2 (11) and bup and blow are the value of Fi for the pair {i1, i2}.

η=

(4)

108

since the kernel function is not computed and the division is executed as a simple right shift, while the classification error rate rate remains unaffected, because the SMO algorithm always converges for an optimal solution, based on the tolerance variable ε. Only the number of iterations increase slightly as shown in Section 4. The Gaussian kernel was adopted for this general purpose architecture. Since the kernel function is quite complex, Anguita et al. have proposed a hardware-friendly one, given as [9]: K(~xi , ~xj ) = 2−γk~xi −~xj k1

(13)

where γk~xi − ~xj k1 is the L1-norm and γ = 2±p is an arbitrary parameter with p = 0, 1, 2, · · · [9]. The Kernel can be implemented, in fixed-point arithmetic, using only shift and add operations through a coordinate rotation digital computer (CORDIC)-like algorithm. In order to adapt this calculation for any number of elements, the following procedure is adopted. First, the L1 norm is calculated as follow: k~xi − ~xj k1 =

d X

|xis − xjs |

(14)

s=1

where d is the number of elements of the input vectors. Then the result is shifted by γ, and the CORDIC-like algorithm is performed. Although the architecture is able to support a wide range number of input samples and elements, it is limited by the internal memory of the targeted FPGA. The quantity of memory required for storing all data is: m bits for y, mw bits for both α and F , where w is the word size in bits. For input vectors (~xi ), the memory consumption is d(m(w − 6)) bits, since the input data is normalized and do not use the first six bits of the fixed point. Therefore, the total memory required in bits can be expressed as:

Fig. 1. Modular architecture. Finally, the global data update is performed for every sample i, by the following equations. Finew = Fiold + ∆Fi1 + ∆Fi2

(12a)

old new ∆Fi1 = (αi1 − αi1 )K(~xi1 , ~xi )

(12b)

∆Fi2 =

old (αi2



new αi2 )K(~xi2 , ~xi )

(12c)

The initial values of α are 0 and the initial values of F are the same of y. The algorithm is finished when bup − blow < ε where ε is a user-defined tolerance parameter. This condition is proved to satisfy the KKT conditions [4].

rq = m(2w + d(w − 6))

(15)

Taking, for instance, some benchmarks from University of California-Irvine Machine Learning Repository [11], the required storage for training sets can reach Megabits. For example, the Adult benchmark is composed by 1605 samples, each one with 123 elements. In that case, adopting a 24 bit word representation, the required memory would be 4.08 Mb. Since large memory blocks are not available in FPGAs, this issue must be tackled by including external RAM blocks, with a fast interface for the dynamic access required by the SMO algorithm. Since we have targeted the Xilinx Virtex-IV XC4VLX25 FPGA in this work, for this first version of the architecture, we have taken the simplifying assumption that the input training set data is limited to 1.29Mb.

3. DIGITAL ARCHITECTURE This SVM architecture is designed for general-purpose training, i.e., it supports different number of input samples or elements; furthermore, elements may have either real or binary attributes. Based on the SMO algorithm, four separate blocks were developed in a modular design, represented by the gray boxes in Fig. 1. The kernel block is conceptually part of the updater block, however, due to its computational complexity and high usage, a dedicated block was designed for it. The blocks execute the tasks mentioned on section 2 with minor changes, as the substitution of parameter η by the constant 2. That allows a faster computation,

109

Benchmarks Dermatology Breast Cancer Tic Tac Toe

m/d 358/132 569/30 958/27

Table 1. Testing error and number of iterations. 20-bit Hardware 24-bit Hardware 32-bit Hardware T. Error N. Iter. T. Error N. Iter. T. Error N. Iter.

Software T. Error N. Iter.

Parameters C γ

2,86% 0.86% N/C

2,86% 0.86% 5,68%

10 1 3

393 515 N/C

2,86% 0.86% 5,68%

379 472 2572

4. FIXED-POINT REPRESENTATION SIZE

2,86% 0.86% 5,68%

381 472 2568

340 466 2481

0.125 1 0.5

Table 2. Number of slices and relative area occupation. Sel. Opt. Upd. Kernel Others

One of the main concerns of the SVM implementation in hardware is the type and size of numeric representation, what may affect the precision of the computation. In general purpose computing, variables are stored with 32 or 64 bits in floating-point precision, which is enough to avoid most of quantization problems. Dedicated hardware design, on the other hand, makes use of fixed-point with limited precision registers. For this implementation, where the error is propagated m times per iteration, it is recommended to analyze the quantization effects. The technical literature shows only a few works on this problem related to SVMs [4, 14], but not applicable to this case. In this paper, a short study, based on experimental data, was developed in order to obtain a satisfactory fixedpoint implementation. Several implementations were tested under a [6.k] fixed-point representation, with 6+k bits. Only 6 bits were defined for the integer part since any input data is normalized and belong to the [0,1] interval, as recommended by [15]. Moreover, with the usual values for the user-defined parameters C and γ, the operations will hardly reach to numbers that extend over the representation limits For testing the different cases, we fixed the tolerance precision parameter ε in the widely used 0.001 value [10]. Table 1 shows the results of the SMO algorithm applied to three benchmarks (column 1), whose number of samples and elements are given in column 2. Three hardware fixedpoint representations were tested, with the computation of the testing (classification) error, in percentage, and number of iterations: [6.14] representation in columns 3 an 4, [6.18], in columns 5 and 6, and [6.26], in columns 7 and 8. A software implementation in Matlab language for the SMO algorithm was simulated and the results are in columns 9 and 10. Finally, columns 11 and 12 present the best values of parameters C and γ (for software implementation). For measuring error and iteration values, we used a ten-fold cross validation method, where ten folds are stratified from the original set, nine of them are used for training and one for testing. This process is repeated ten times for testing every single fold, and the average numbers are used. It can be seen in Table 1 that, for 20 bits codification, the training algorithm could not converge for all cases, as for the Tic-Tac-Toe example. That can be explained by the

#Slices % Total

103 7.16

322 22.38

196 13.62

402 27.94

416 28.91

Fig. 2. Reconfiguration schedule. great number of input samples and iterations, which amplifies the error. Actually, for any k, 0 < k < 14, the tested benchmarks did not converge for the given ε. Results also show that, the classification error rate is not affected by the data representation precision in hardware implementations, even compared to the software floating-point representation. It can be also observed that as the representation precision increases, the number of iterations decreases, but this process stabilizes for 24 and larger bits codification. Therefore, the solution with the smallest number of bits, while maintaining the quality of results, is the 24-bit codification

Fig. 3. Partitions allocation.

5. RECONFIGURABLE ARCHITECTURE For the definition of the reconfigurable architecture, values on area consumption for each block was obtained by synthesis and presented in Table 3. It can be observed that the kernel and the optimization blocks occupy most of the

110

FPGA area. Since there is not a direct dependence between them and they are not fired at same time, it is possible to interchange these two blocks by dynamic reconfiguration, reaching an area saving of 22.38%. Based on these facts, a reconfiguration scheduling illustrated on Fig.2. is proposed. In the figure, a complete iteration execution is illustrated. In t1, the selection task is performed while the bitstream of the optimizer block is loaded into the FPGA. In t2, the optimization task is executed, while in t3, the FPGA is once more reconfigured for loading the kernel block, which will perform the update task together with the updater block in t4 as described in Fig.1. The partitions according to the time split are shown in Fig.3. This reconfiguration plan affects somehow the execution time. In order to analyse this effect, let us consider the iteration execution time in the architecture without dynamic reconfiguration: it =

m(9d + 18) + 17 f

of FPGAs area saving compared to a static solution, without dynamic reconfiguration.

(16) Fig. 4. Time penalty for 0 ≤ rit ≤ 502µs.

where it is the iteration execution time and f is the operation frequency. Consider now the execution time in a implementation using a dynamic reconfiguration as described in Fig.2. which is given by rit =

6. EXPERIMENTAL RESULTS

m(9d + 13) + 17 + 2rt f

(17)

Bitstream #Bytesrecp · rfrecp

(18)

For validating the architecture, a simulation using a dynamic circuit switching (DCS) technique [16] was carried out. This technique employs isolation switches between the reconfigurable modules and the static modules, allowing that one reconfigurable module be active when the other one be inactive, the technique also simulate the reconfigurable times in the interval. For that, a control unit must be developed for activate the modules at the right time and manage the reconfiguration delay. The control unit was described in VHDL, based on the Reconfiguration schedule shown in Fig. 2. The state machine of the unit control is illustrated on Fig. 5. The enable signals were used as information for knowing when a task starts, and, consequently, when another task is finished. The states 1 and 4 define the behavior of isolation switches; in the first state, the unit control activate the optimizer block and deactivate the kernel block after the reconfiguration delay (transition from state 1 to state 2), whereas in the fourth state the optimizer is deactivated and the kernel is activated after the waiting for the reconfiguration delay. The first state also executes the selection task, state 3 executes the optimization task, and state 5 executes the update task. Table 3 shows the total training time for software, hardware (DCS simulation), and the acceleration obtained for this architecture. It can be observed that when compared to a purely software solution, the hardware solution is, at least, 12.53 times faster.

with rt =

where rit is the iteration execution time with dynamic reconfiguration, rt is the reconfiguration time, Bitstream is the sequence of bytes used for the partial reconfiguration, #Bytesrecp is the number of bytes of the reconfiguration port, and rfrecp is the frequency of the reconfigurable port. The first term of equation (18) is reduced by 5m because the selection task is executed in parallel with the reconfiguration process. The time penalty, i.e. the extra-time imposed by a reconfigurable implementation, can be easily calculated as rit/it. Fig.4. shows the time penalty in function of rt for three benchmarks, a dashed line was placed in 274 µs, which represents the reconfiguration time for our target device, the Xilinx Virtex-IV XC4VLX25 FPGA. This time was estimated considering f = 50M Hz, rfrecp = 100M Hz, 4 bytes for reconfiguration port and the bitstream was calculated based on information given by datasheet. It can be observed that the time penalty varies according to the benchmark. We have 16.5%, 6.5%, and 10% of time penalty for the Breast Cancer, the Dermatology, and the Tic Tac Toe benchmarks, respectively. Showing that the larger the associated computation time, the smaller the penalty will be. As stated before, the synthesized circuit indicates 22.38%

111

[4] J. C. Platt, ”Fast training of support vector machines using sequential minimal optimization,” in Advances in kernel methods: support vector learning, 1st. ed, B. B. Schlkopf, C. J. C.; Smola, A. J., Ed. Cambridge, MA, USA: MIT Press, pp. 185-208, 1999. [5] D. Anguita, A. Boni, and S. Ridella, ”A digital architecture for support vector machines: theory, algorithm, and fpga implementation,” Neural Networks, IEEE Transactions on, vol. 14, no. 5, pp. 993-1009, 2003. [6] A. Ghio, S. Pischiutta, ”A support vector machine based pedestrian recognition system on resource-limited hardware architectures,” in Research in Microelectronics and Electronics, IEEE Conference, pp. 161-163, 2007. [7] R. Pedersen, M. Schoeberl, ”An embedded support vector machine,” in Intelligent Solutions in Embedded Systems, International Workshop on, Vienna, Austria, pp. 1-11, 2006. [8] X. Zhang and K. Ng. ”A review of high-level synthesis for dynamically reconfigurable FPGAs,” Microprocessors and Microsystems, v. 24, p.199-211. 2000. [9] D. Anguita, S. Pischiutta, S. Ridella, D. Sterpi, ”Feedforward support vector machine without multipliers” Neural networks, IEEE Transactions on, vol. 17, pp. 1328 - 1331, 2006.

Fig. 5. Unit control’s state machine. Table 3. Total training time Breast Cancer Dermatology Software Hardware Accel.

10.44s 0.348s 30

13.10s 1.045s 12.53

[10] R. A. Hernandez, M. Strum, Wang Jiang Chau, J. A. Q. Gonzalez, ”A VLSI implementation of the the SVM training,” in XV Workshop Iberchip, Buenos Aires, Argentina, pp. 204209, 2009.

Tic-Tac-Toe 109.2s 3.761s 29.03

[11] A. Asuncion, D. J. Newman, ”UCI Machine Learning Repository,” Available from: http://www.ics.uci.edu/ mlearn/MLRepository.html, Last access: November 2009. [12] J. A. Q. Gonzalez, J. C. Wang, ”Circuit Partitioning and HDL Restructuring for Behavioral Simulation of Dynamically Reconfigurable Circuit Partitions: a Case Study,” in 2nd International Conference on Electronic Design, Vera Cruz, Mexico, pp. 6-11, 2006.

7. CONCLUSION Experimental results show that 24 bits [6.18] is good enough for coding the input data, even for benchmarks with great number of operations and iterations. The reconfiguration architecture allowed an area saving of 22.38% against an acceptable penalty time. In order to improve the architecture and make it even more general, an external memory module may be added and a smart interface designed in order to support the dynamic access required by the SMO algorithm.

[13] S.S. Keerthi, S. K. Shevade, C. Bhattacharyya, K. R. K. Murthy, Improvements to Platts algorithm for SVM classifier design. Neural Computation, 13, pp. 637649, 2001. [14] Anguita, D.; Bozza, G., ”The effect of quantization on support vector machines with Gaussian kernel,” Neural Networks, 2005. IJCNN ’05. Proceedings. 2005 IEEE International Joint Conference on, vol.2, no., pp. 681-684 vol. 2, 31 July-4 Aug. 2005. [15] C. W. Hsu, C. C. Chang, and C. J. Lin, ”A practical guide to support vector classification”, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2003.

8. REFERENCES [1] C. J. C. Burges, ”A Tutorial on support vector machines for pattern recognition,” in Data Mining and Knowledge Discovery, vol. 2, pp. 121-167, 1998.

[16] C. W. P. Lysaght and J. Stockwood, ”A simulation tool for dynamically reconfigurable field programmable gate arrays,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 4, no. 3, pp. 381-390, 1996.

[2] C. Cortes, V. Vapnik, ”Support-vector networks,” Machine Learning, vol. 20, pp. 273-297, 1995. [3] L. Bottou, C. J. Lin, Support vector machine solvers, Large Scale Kernel Machines, Mit Press, Cambridge, MA, pp. 1-27, 2007.

112

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.