ZEMAX Users' Knowledge Base - http://www.zemax.com/kb
Running ZEMAX on a Multi-CPU Computer
http://www.zemax.com/kb/articles/196/1/Running-ZEMAX-on--a-Multi-CPU-Computer/Page1.html
By Mark Nicholson
Published on 24 July 2007
 

Optical design has always pushed the boundaries of what's possible on computers. ZEMAX currently supports up to 16 processors per machine, helping to solve the most difficult problems in optical design.

ZEMAX automatically divides most lengthy calculations, such as ray-tracing, diffraction analysis, and optimization into multiple parallel tasks. For example, ZEMAX can trace one ray on one processor, while tracing another on a second processor, and so on, and then combine the results. ZEMAX supports up to 16 CPUs without any user intervention, giving speed increases up to 16x over a single processor.

This article discusses multi-CPU operation in more detail, and gives examples of how well performance scales over multiple processors.


Introduction

Optical design has always pushed the boundaries of what's possible on computers. ZEMAX currently supports up to 16 processors per machine, helping to solve the most difficult problems in optical design.

ZEMAX automatically divides most lengthy calculations, such as ray-tracing, diffraction analysis, and optimization into multiple parallel tasks, called threads. For example, consider the Geometric Bitmap Image Analysis feature, which is discussed in detail in the article How to Produce Photo-Realistic Output Images.  This feature takes a .jpg or .bmp bitmap image such as this:



and traces rays from each pixel through the optical system to the detector. For a fairly poor imaging system the resulting image is like so:



On a computer with a single CPU, ZEMAX would start at pixel 1 and trace all rays, then go to pixel 2 etc until all pixels have been traced. On a four processor machine, however, the task can be cut up like so:



This calculation would run four times faster than the single CPU case, minus a small amount for the overhead of splitting the image up, launching the threads, receiving the thread data when the threads return, and stitching the data back into a single image. Good software engineering can minimize, but not eliminate entirely, this overhead.


ZEMAX Architecture
ZEMAX was written as a multi-threaded application from the very start. These means that every window in ZEMAX - including the main menu window - is actually a separate thread. Both ZEMAX-SE and EE are fully multi-threaded, and they will use all available CPUs without extra payment or licensing.

A thread, in this context, means a package of data required to complete a particular calculation. Its important to distinguish between threads, which are generated by an application, and the allocation of those threads to the available CPUs in the machine, which is handled by the operating system. 

Because every window inside ZEMAX is its own thread, you can do things like updating one window while editing the Settings of another window, etc:



In this example, the FFT MTF window is recalculating, but the user is still able to access the Settings of the Field Curvature plot and update it independently. This is why you do not have to wait for one Analysis feature to finish calculating in order to start or modify another in ZEMAX!

If there is only one CPU in the machine, writing highly multi-threaded code has the advantage of very efficient execution. But when the machine has multiple CPUs, it gives a huge advantage: individual threads can be run on different CPUs! Even better, many ZEMAX Analysis windows and Tools are inherently multi-threaded, and so split up over multiple CPUs automatically. The Geometric Bitmap Image Analysis is one such, and we will discuss others in the pages to follow. Best of all, the utilization of the multiple CPUs in the machine is transparent to the user: ZEMAX and the operating system negotiate the optimum threading level to use.

Running ZEMAX on an 8 CPU machine
The author recently took delivery of a Dell 690 workstation, which contains 8 Xeon CPUs:



These are 'real' CPUs, not hyper-threaded ones (hyper threading is an older technology in which spare CPU cycles were scavenged and made to appear like a second processor. While some performance gains are available via hyperthreading, it does not compare to having real CPUs inside the box!)

The machine has 8 GB of RAM, and runs Windows XP64. ZEMAX has a built in Performance Test feature, found under Tools...Miscellaneous...Performance Test. Performance Test runs a check on the number of ray-surfaces per second and the number of system updates per second the computer hardware/lens combination is capable of. The ray surfaces per second performance number is measured by tracing a large number of random skew rays through the current optical system, and then dividing the number of rays times the number of surfaces traced through by the elapsed time in seconds. It is the most pertinent measurement of 'ray-tracing speed'.

The system updates per second is calculated by performing many system updates and then dividing the number of system updates performed by the elapsed time in seconds. System updating includes recomputing the pupil positions, field data (such as ray aiming coordinates), lens apertures, index of refraction, solves, and other fundamental checks on the lens that must be performed prior to any ray tracing.

The speed will vary tremendously depending upon the system processor, clock speed, and lens complexity. For no particular reason, within ZEMAX Development Corporation we have settled on using the double Gauss sample file provided with ZEMAX (see {ZEMAXroot}\Samples\Sequential\Objectives\Double Gauss 28 degree field.zmx) as the test file for comparing performance on different machines. With the 8 CPU machine and the July 25 2007 release of ZEMAX I obtained:



More than 93 million ray-surfaces per second! This is an amazing ray tracing speed. What's more, it is extremely linear with the number of CPUs used to perform the calculation:







Do All Features Use Multiple Processors?

Every feature in ZEMAX is a separate thread, so it is independent of all other features. Not all features are internally multi-threaded however. There is an overhead in launching and managing threads, and in receiving the data from the threads on completion, and in stitching the results back together again.

Also, remember that each thread must include a full copy of all lens data. If every feature spawned multiple internal threads, the amount of memory would quickly rise.

Instead, the computationally demanding features are internally multi-threaded. This includes (but is not limited to) optimization, global optimization, tolerancingHuygen's calculations, diffraction calculations, physical optics and non-sequential ray-tracing. Many individual Analysis features are also internally multi-threaded. ZEMAX manages the multi-threading to ensure optimal use of the resources in the machine for the task in hand.

For example, let's optimize the double Gauss sample file using a default wavefront merit function. Make all radii (except the surfaces with infinite radii) and thicknesses variable. Place an f/# solve on the last radius, and make it f/3. Then build a default merit function like so:



I then optimize it and tell ZEMAX to use all 8 CPUs



Now damped-least-squares optimization involves taking the derivative of the merit function with respect to each variable, and the calculation is multi-threaded at the variable level, so that each derivative can be computed as its own thread. However, when the optimizer is run, the result is surprising:



Only one CPU is being used! (13% CPU Usage represents only 1 of the 8 CPUs in the machine being used.) Why?

Well, in some respects this problem is too simple to benefit from multi-threading, or at least, it is too efficiently implemented in ZEMAX to need to be multi-threaded. This simple macro:

! How long does it take to compute the merit function?
FORMAT 4.3 EXP
cycles = 100
TIMER  # set the timer

FOR i = 1, cycles, 1
      dummy = MFCN() # update the merit function
NEXT i

PRINT "The average time to compute the merit function is: ", ETIM()/cycles, " seconds"

PRINT "Program End"
END

reveals:

Executing C:\Program Files (x86)\ZEMAX\MACROS\QUICKIE.ZPL.
The average time to compute the merit function is: 3.100E-004 seconds
Program End

The merit function takes only 310 microseconds to compute on one processor of this machine! Remember that the merit function in this case is the RMS wavefront error over three fields and three wavelengths. This shows the incredible efficiency of ZEMAX's Gaussian Quadrature and DLS algorithms. The overhead of copying all the data, launching threads, and receiving data back again is not justified when the merit function is computed this quickly. If a DENC (Diffraction Encircled Energy) operand is added to the merit function, the CPU utilization approaches 100%.



In general, a computation can be split up into "parallel-izable" tasks which can be split up over multiple CPUs, and serial tasks which can only take place in a single thread of execution. The CPU utilization shown in Task Manager is a rough representation of CPU usage, and during execution of a multi-threaded taks you will see the CPU usage vary from 13% (single CPU in this case) to 100%.  


Summary
ZEMAX is engineered to exploit all the CPUs in the user's computer without any user intervention. It will automatically determine the optimum number of threads to launch for any given calculation.