Special-purpose computer for holography HORN2

Share Embed


Descripción

Computer Physics Communications ~LSEVIER

Computer Physics Communications 93 (1996) 13-20

Special-purpose computer for holography HORN-2 T o m o y o s h i Ito a, H e s h a m E l d e i b a, K e n j i Y o s h i d a a, S h i n y a T a k a h a s h i a, Takashi Yabe b, Tomoaki Kunugi c a Department of Electronic Engineering, Gunma University, Kiryu, Gunma 376, Japan b Department of Energy Sciences, Tokyo Institute of Technology, Nagatsuda, Yokohama 227, Japan c Japan Atomic Energy Research Institute, Toukaimura, lbaragi 319-11, Japan

Received 8 August 1995; revised 19 September 1995

,Abstract

We designed and built a special-purpose computer for holography, HORN-2 (HOlographic ReconstructioN). HORN-2 calculates light intensity at high speed of 0.3 Gflops per one board with single (32-bit floating point) precision. The cost 9f the board is 500000 Japanese yen (5000 US dollar). We made three boards. Operating them in parallel, we get about 1 Gflops. Keywords: Hardware; Hologram; Holography; Special-purpose computer

1. Introduction

Study of holography [1] by computer, such as computer-generated hologram (CGH) [2], has become an active area, due to the advance of electronic technology. Simulations of holography, however, require large computational power so that they are not yet of practical use. In the case of generating a hologram whose size is L grids from a virtual image whose size is M points in a computer, the calculation cost is proportional to LM. For L = 1000 x 1000 and M = 100 x 100 x 100, for example, it is O(10~z). It takes several tens of days to make such a hologram by a workstation. In the case of reconstructing an object image from a hologram in a computer, we also have the same difficulty. In simulations of holography, the calculation of light intensity is heaviest and occupies almost all calculation cost. Therefore, if we accelerate only that, we can accelerate the total calculation. HORN-2, which we designed and built, is a special-purpose computer to calculate light intensity at high speed for simulations of holography. The calculation of light intensity is a simple arithmetic operation, N

14 = Z ~exp(ikR~j), j

otj

or, in the computer holography, we often use only the real part of Eq. ( 1 ) [3,4], 0010-4655/96/$15.00 (~) 1996 Elsevier Science B.V. All rights reserved SSDI 00 I 0-4655 ( 9 5 ) 0 0 125-5

(1)

T. lto et aL/ Computer Physics Communications93 (1996) 13-20

14

Host

As, Rj, R~

\

HORN-2

Computer

Fig. 1. Basic structure of the HORN-2 system.

N

1,~= Z ~--Jjcos(kR~j). J

(2)

Here, for generating a hologram, the indices ot and j show the hologram and the object, Aj is the intensity of the object, R~j is the distance between the hologram and the object, and k is the wave-number of the reference light. For reconstructing an object image, the indices ot and j show the reconstruction area and the hologram, Aj is the intensity on the hologram. Fig. 1 shows the basic structure of the HORN-2 system. The special-purpose computer, HORN-2, calculates only Eq. (2). In the case that we need the imaginary part of Eq. (1), we calculate it by adding the phase to rr/2 in the cosine function. The other calculations, such as setting data and graphics, are calculated by a general-purpose computer, such as a workstation or a personal computer, which is connected to HORN-2. The calculation cost of them is negligible as compared with that of Eq. (2) because it is proportional to only L + M. HORN-2 is the second machine of our HORN hardware. The first machine, HORN-1 [4,5], was constructed in March 1993. HORN-I had some restrictions since we designed the hardware very simple to develop it easily. Firstly, HORN-I can deal with only a fixed sized hologram of 400x400 grids. Secondly, in the HORN-I system, we can exchange Aj data by exchanging ROM (Read Only Memory) chips, namely, we cannot control them from the host computer by software. HORN-2 was improved over HORN-1. HORN-2 can deal with any size of hologram. All data can be controlled from the host computer by software. Since we control HORN-2 through the host computer, we can use it as one of subroutine, just like a hardware subroutine. In the HORN-2 system, we adopted a personal computer as the host computer and GPIB (General Purpose Interface Bus) as the interface. The number of operations in the summation of Eq. (2) is about 30. Since HORN-2 calculates them by the pipeline at 10 MHz, it has a speed of 0.3 Gflops (30 operations x 10 MHz). We made three boards of HORN-2. Operating them in parallel, we get about 1 Gflops. The precision of HORN-2 is single (32-bit floating point) precision which is ordinarily used on general-purpose computers. The HORN system can be easily used in parallel because the cost of the communication is little as compared with that of the calculation. The communication cost increases by only L+M, while the calculation cost increases by LM. We are planning to develop the highly parallel HORN system for 3D Television in cooperation with the Telecommunications Advancement Organization of Japan (TAO) which is founded by the Ministry of Posts and Telecommunications of Japan. In Section 2, we describe the hardware design of HORN-2. In Section 3, we show the performance. In Section 4, we discuss the development of the parallel system of HORN, which is our next project.

T. lto et al./ Computer Physics Communications 93 (1996) 13-20

.

15

32

Inter-

c _ _ _ _

I _

±Z L" _ J i /

RAlvl

Fig. 2. Block diagram of the HORN-2 pipeline.

2. H a r d w a r e d e s i g n

HORN-2 was designed by a pipeline architecture to accelerate the calculation of Eq. (2). The pipeline is completely controlled by the host computer. We operate the HORN-2 system as follows. (1) The host sets parameters and data, i.e., N, k, Aj and Rj = (xj, yj, zj), before the pipeline operates. (2) The host sets R,~ = (x,~,y~, z,~) in registers on HORN-2. (3) When the pipeline receives R~, it automatically starts to operate. As soon as the operation ends, the N

pipeline returns the result, 14 = E

(Aj/R,~j) cos(kR~j), to the host. J

We repeat (2),(3) for all a. On the side of the host computer, we can handle HORN-2 as a subroutine for Eq. (2). Therefore, we can adapt HORN-2 to our software which we used or are using with little change.

2.1. Pipeline Fig. (1) (2) (3) (4)

2 is the block diagram of the HORN-2 pipeline. It is divided into eight steps as follows. Subtraction Dx = xj - xa and Dy = yj - y,~, Square Dz~ = D~ x Dx and Dzy = Dy x Dy, Addition D2xy = D2x + D2y, R 2 = D2xy + (A Z ) 2, Addition

(5a) Conversion by Interpolation Table (5b) Conversion by Table

R-' = l i v e ,

16

T. lto et al. / Computer Physics Communications 93 (1996) 13-20

D

28 upper 18 //__ _ _ //

f(

o)-int

If(

o)] ROM

upper 18 - //

Fi f'

0) j J ) "\

(

ROM --

""

// 10

~

/

0

j ~ - -

lower

~ multi. Fig. 3. Interpolation circuit in the pipeline.

(6a) Conversion by Table (6b) Multiplication (7) Multiplication

U = cos (2rr0), V = A j x R -~, W = U x V, N

(8)

Accumulation

I~ = Z

Wj.

j=l

Here, in the steps ( 1 ) , ( 2 ) , ( 5 ) and (6), two calculations are executed in parallel. In step (4), we directly use ( A z ) 2 = ( z j - z~,) 2 as input data. It is because a hologram is two dimensions; z,, = 0 in the mode of generating a hologram, zj = 0 in the mode of reconstructing a 3D image. In the HORN-2 system, therefore, the host directly sends (Az)2 to the pipeline instead of zj and z,~. On the result I,~, the 32 bits data width is sufficient resolution for us to recognize the 3D structure. Even 16 bits resolution (216 = 65536) is enough so that we decided the outputs of Aj and R -I to 16 bits. We assigned 32 bits to the data width of 1,~ only because we used a 32 bit floating point processor chip for that operation. On the contrary, we need keep a high precision for the calculation of R, since it decides the range of the calculation area. Fig. 3 is the block diagram of step (5a), where 8 is obtained from R 2 by linear interpolation. Here, = R 2,

(3)

f ( s c) = --~--k( '/2 277" "

(4)

We divide the input data sc into ~o (upper part) and A~: (lower part), and then expand f ( ~ ) as follows. f(ge) = f(s%) + =--

k

27r s~ = ~0 + a ~ : .

[

f,A~ +

1_de2 2 ¢,A~ 2 + . . .

1 ~ - ' / 2 A .-¢- - -

I

-3/2

2

(5) ]

,

(6) (7)

In the IEEE format which we use, single precision data consist of three parts, i.e., sign 1 bit, exponent 8 bits and mantissa 23 bits. Since sc > 0, we don't use the sign bit. For the exponent, we use only 5 bits which can express the range to 32 bits (232). It is sufficient for our simulations because the 23 bits precision of mantissa restricts the dynamic range to 23 bits (223). In the HORN-2 system, we normalize the variables to assign R 2 to (2 °, 223). Therefore, we used 28 bits as the input sc. We assigned the lower l0 bits to As~, that is, Asc "~ 2-13(0. In this case, the third term in Eq. (6) is negligible because the ratio of the third term to the first term is about 2 -29. We can, therefore, calculate f ( ~ ) by the linear interpolation of Eq. (6).

T. Ito et al./ Computer Physics Communications 93 (1996) 13-20

17

We assigned the upper 18 bits to so0. We obtain f(~c0) and ( d f / d ( ) [ 6 , by looking up in the tables which are made of ROM chips. The parameter k (constant number) is contained by the tables. When we use another k, we rewrite the ROM chips. The symbol int[ ] in step (5a) and in Fig. 3 means the integer part of [ ]. The cosine function is periodic by 27r so that we can normalize kR by 217"and subtract the integer part from it in order to reduce the data bus width.

2.2. Interface We used GPIB (General Purpose Interface Bus) for the communication between HORN-2 and the host computer. The communication speed of GPIB is about 100 k B / s for the normal usage. It is not so fast, but sufficient for the HORN-2 system as shown in Section 3. The host computer sends the data in the form of an 8-bit address plus a 32-bit data. Here, the 8-bit address indicates the memory or the register where the data is stored. Since GPIB can only do 8-bit data at a time, the host sends those 40-bit data in a sequence of 5 bytes data. HORN-2 returns the 32-bit accumulated intensity I,~ when the operation ends. The host computer receives it in a sequence of 4 bytes data.

2.3. Packaging We packaged HORN-2 on a universal board whose size is 40 cm by 37 cm. Fig. 4 is a top view of HORN-2. The total number of chips is 76. In the pipeline, we used L64134 chips made by Logic Devices Co. for the 32-bit floating point (single precision) processor unit, HM628128 RAM (Random Access Memory) chips made by HITACHI for the memories, and HN27C4096 and HN27C1024 ROM chips made by HITACHI for the tables. Since the L64134 chip has both an ALU (Arithmetic Logic Unit) and a multiplier, it can execute addition, subtraction, multiplication and sum of products. We also used a TMS9914A chip by Texas Instruments Co. as the GPIB adapter for the interface, and a PLD (Programmable Logic Device) chip of MACH435 made by AMD Co. for the control circuit. All chips are mounted on the wire-wrapping board. The cost for the electronic parts used in HORN-2 is about 500000 Japanese yen (US $5000). We started designing HORN-2 in May 1993 and finished constructing it in March 1994. After that, we made two copies of the original HORN-2.

3. Performance

The calculation of ( A j / r R a j ) c o s ( k R ~ j ) is equivalent to about 30 floating point operations on a generalpurpose computer. Therefore, one board of HORN-2 with 10 MHz has the peak performance of about 0.3 Gflops. At present, we have three HORN-2 machines. The parallel system with three HORN-2 boards has the peak performance of about 1 Gflops. The effective performance is decided on the ratio of the communication cost to the calculation cost. We can use HORN-2 both for generating a hologram from a virtual 3D object and for reconstructing a 3D image from a hologram. The communication costs of these two cases are different since we need not send zj for the former case or z,~ for the latter case from the host computer to HORN-2. However, the difference is little. In this section, we discuss the performance of HORN-2 assuming that the communication cost contains both zj and :~. Here, in Eq. (2), we let the total number o f j be L and the total number of o~ be M. In the case of generating a hologram, L is the number of points of an original object and M is the number of grids of a hologram. In

18

T. lto et al. / Computer Physics Communications 93 (1996) 13-20

Fig. 4. Top view of HORN-2.

the case of reconstructing a 3D image, L is the number of grids of a hologram and M is the number of points of a reconstructed area. As described in Section 2.2, the input data are sent to HORN-2 in the form of a sequence of 5 bytes. The output data are returned to the host in the form of a sequence of 4 bytes. Total number of the input data, xj, yj, zj, A j , xa, yc~ and z,~, is ( 2 0 L + 15M) bytes. Total number of the output data, I,,, is 4M bytes. The communication time is Tcomm~ (20L + 19M)tcomm,

(8)

where /comm is the time necessary to transfer 1 byte of data between the host and HORN-2. The calculation time by HORN-2 is (9)

Thorn = LMthorn,

where thorn is the clock period of HORN-2. Since thorn = 10 -7 sec (10 MHz) and tcomm = 10 -6 sec (100 kB/s) in the HORN-2 system, the ratio of Tcom m tO Thorn is

Tcomm Thorn

200 -M

+

190 L

(10)

For L, M > 2000, Tcomm is less than 10 % of Thorn and the effective performance is more than 90 % of the peak performance of 0.3 Gflops. There is no bottleneck on the communication since M and N are usually more than 2000. Today's computer display has the resolution of about 1000 x 1000. For one example, we think a system which has a hologram of 1000 x 1000 grids and an object (or a reconstructed area) of 100 x 100 × 100 points. Total calculation time for generating one hologram (or for reconstructing one image) is Tlotal = Thorn at- Tcomm ~ Thorn ~ 106 X 106 X 10 - 7 = 105(sec),

( 11)

(12)

T. Ito et al. / Computer Physics Communications 93 (1996) 13 -20

19

that is, about 1 day. Next, we also discuss the performance of the parallel system with n boards of HORN-2, since we now have three boards. The calculation time by the system with n boards is

1LMthom.

Thorn =

(13)

t!

The calculation speed is, of course, n times faster than that of the one board system. On the other hand, the communication time is as same as that of one board system; Tcomm = (20L + 19M)tcomm.

(14)

For the input data concerning j, namely, x j, yj, zj and Aj, we can send them to all HORN-2 boards simultaneously. For the data concerning or, namely, x,~, y,~, z,~ and 1,~, we can divide them into n parts and assign each M/n points to each HORN-2. Eventually, the communication cost of n boards system doesn't increase as compared with the case of one board system in spite of the parallel processing. This is the great merit of the H O R N system so that we can easily use it in parallel. In other words, computer holography is suited to a special-purpose hardware system. The ratio of Tcommto Thorn is

Tcomm

- -

Thor.

=n

(200 M--

190)

+ --

'

(15)

where we substitute thorn = 10 -7 sec and /comm = 10-6 sec. The ratio is n times larger than that of one board system. In the case of n = 3, which is our real system, for L, M > 5000, Tcomm is less than about 10 % of Thorn and the effective performance is more than 90 % of the peak performance of about 1 Gflops.

4.

Discussion

It is easy to calculate Eq. (2) in parallel. For a highly parallel HORN system, it is necessary to implement a H O R N pipeline into a custom LSI. We are designing such a custom LSI, HORN chip, with TAO (Telecommunications Advancement Organization of Japan) which is established in 1992 by the Ministry of Postal Service and Telecommunications of Japan in order to study and to develop a 3D TV system. In a 3D TV system, we will use computer-generated holograms. Firstly, we make holograms from 3D virtual objects in a computer system. Next, we reconstruct the 3D images from the holograms by an optical system. We adopt the HORN system as the processor unit for making holograms. Using the HORN chips, we will construct a highly parallel HORN system, HORN-3. Here, we briefly discuss the performance of HORN-3 with p HORN chips. As discussed in Section 3, the calculation time is 1

Thorn3 = -- LMthor,3, P

(16)

and the communication time is

Tcomm 3 = (20L + 19M)tcomm3,

(17)

where /horn3 is the clock period of HORN-3 and tcomm3 is the time necessary to transfer 1 byte of data between the host and HORN-3. We are planning to set thorn3 = 3 × 10 -8 sec (33 MHz) and tcomm3 = 10 -7 sec ( 10 M B / s ) , which is easily feasible for today's technology: For example, today's CPU is operational with 100 MHz and PCI (Peripheral

20

T. lto et al. / Computer Physics Communications 93 (1996) 13-20

Component Interconnect) bus, one of the standard interfaces on a personal computer, has the peak performance over 100 MB/s. Here, we think an example that we make a 1000 × 1000 grids hologram ( M = l 0 6 ) from a 105 points object (L = 105). We get 3 x 103 - (sec), P Tcomm3 = 1.5 (sec).

Thorn3

(18) (19)

We can make the hologram in 5 minutes at p = 10, in 30 seconds at p = 100, and in 5 seconds at p = 1000. For a real-time 3D TV system, we need to generate holograms more than 10 per 1 second. A highly parallel H O R N system will be able to get the performance over it in the future.

Acknowledgements We are grateful to Makoto Taiji and the members of TAO, especially, Osamu Nishikawa, Takatsune Okada and Kenji Matsumoto, for their useful discussion.

References [I] [2] [3] [4] [5]

D. Gabor, Nature 166 (1948) 777. G. Tricoles, Appl. Opt. 26 (1987) 4351. A.D. Stein, Z. Wang and J.S. Leigh Jr., Computers in Physics 6 (1992) 389. T. Yabe, T. Ito and M. Okazaki, Jpn. J. Appl. Phys. 32 (1993) L1359. T. Ito, T. Yabe, M. Okazaki and M. Yanagi, Comput. Phys. Commun. 82 (1994) 104-110.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.