Ingenieurgesellschaft für
technische Software

Performance Aspects

By ongoing further developments of the equation solvers PERMAS achieves a very high computation speed. Both, direct and iterative solvers, are continuously optimized. Performance Aspects

Basic properties

  • Very good multitasking behavior due to a high degree of computer utilization and a low demand for central memory.
  • The central memory size used can be freely configured - without any limitation on the model size.
  • The disk space used can be split on several disks - without any logical partitioning (e.g. optimum disk utilization in a workstation network).
  • There are practically no limits on the model size and no explicit limits exist within the software. Even models with many million degrees of freedom can be handled.
  • By using well-established libraries like BLAS for matrix and vector operations, PERMAS is adapted to the specific characteristics of hardware platforms and thus provides a very high efficiency.
  • Another increase of computing power has been achieved by an overall parallelization of the software. See also XPU
  • By simultaneous use of several disks (so-called disk striping) the I/O performance can be raised beyond the characteristics of the single disks. Direct I/O is available for NVME technology

Parallelization

Eigenvalues, MLDR

PERMAS is also fully available for parallel computers. A general parallelization approach allows the parallel processing of all time-critical operations without being limited to equation solvers. There is only one software version for both sequential and parallel computers. Finite element analysis is a "classical" field of high performance computing.

On shared memory computers the parallelization is based on POSIX Threads, i.e. PERMAS is executed in several parallel processes, which all use the same memory area. This avoids additional communication between the processors, which fully corresponds with the overall architecture of such systems.

In addition, PERMAS allows asynchronous I/O, which realizes better performance by overlapping CPU and I/O times. Moreover, a Nvidia GPU may be used. See also INTEL.

Parallelization does not change the sequence of numerical operations in PERMAS, i.e. the results of a sequential analysis and a parallel analysis of the same model on the same machine are identical (if all other parameters remain unchanged).

PERMAS is able to work with constant and pre-fixed memory for each analysis. This also holds for a parallel execution of PERMAS. So, several simultaneous sequential jobs as well as several simultaneous parallel jobs or any mix of sequential and parallel jobs are possible.

Performance Aspects

The parallelization is based on a mathematical approach, which allows the automatic parallelization of sequentially programmed software. So, PERMAS remains generally portable and the main goal has been achieved: One single PERMAS version for all platforms. Parallel PERMAS is available for all UNIX/Windows platforms, where a sequential version is supported, too.

Due to the development of faster CPUs and higher I/O speeds in the recent years, the gap to the network speeds has become larger. So, on distributed memory machines acceptable speed-ups using parallelization are more difficult to achieve. Consequently, for the time being shared memory architectures show much better speed-ups with PERMAS.

The parallel execution of PERMAS is very simple. Because there are no special commands necessary, a sequential run of PERMAS does not differ from a parallel one - except for the shorter run time. Only the number of parallel processes or processors for the PERMAS run has to be defined in advance.

PERMAS on Intel® Xeon® Scalable Processors

PERMAS boosts the performance on Xeon® Processors to unseen levels.
Learn more about PERMAS on Intel Scalable Processors and the people behind it.
Click on the image to download Flyer.
Parallel Performance

The overall performance of PERMAS always depends on the performance of both hardware and software. The close cooperation between Intel and Intes over many years ensures the ongoing adoption of new features to be at the forefront of high performance computing. As a consequence, a new processor release is always accompanied by the best adapted software. This is what INTES wants to provide to its customers.

We want to target all customers who have an increasing need for high performance FE solutions. Simulation driven design fosters this trend to more accurate simulation results. A higher accuracy is possible by using larger and more complex models.

The new INTEL® XEON® SCALABLE PROCESSORS are supported by PERMAS® from the very beginning. On these processors, PERMAS shows excellent performance as documented on a joint flyer.
There is up to 56% higher 4-socket performance than a previous-generation server.

Parallel Performance

The leap in performance on this new INTEL® XEON® SCALABLE PROCESSORS is mainly due to the AVX 512 instruction set, because it perfectly supports the high level matrix operations in PERMAS. Increased memory bandwidth helps to better exploit the speed of the processors.

Large simulation models are mostly running out-of-core. A high speed storage device like Intel’s NVMe SSD drives are directly addressed by PERMAS without the need to use an I/O controller resulting in very efficient I/O and short overall run times. In particular, short access times combined with a direct I/O scheme in PERMAS provides high data transfer to optimally feed the processors. An additional increase of data transfer can be obtained by striped SSD drives.

Processor systems with several sockets are very suitable to increase throughput for multiple jobs, particularly in combination with high performance SSD drives.

PERMAS-XPU - GPU Accelerator

PERMAS supports NVIDIA Tesla Cards. Since 1996 PERMAS uses a unique parallelization concept with a run-time parallelization of all matrix operations based on a dynamically generated task-graph of hierarchical block-operations. This concept gives excellent speedups especially on shared memory machines and ensures bit-identical results independent of the number of cores or the amount of memory used.

Contact Simulation

During the German MCSimVis and the European H4H project from 2009-2015 this concept was extended by a seamless integration of NVIDIA Cards. An NVIDIA Card may be used as an additional floating-point accelerator just like plugging in an extra socket of extra CPU cores.

The collaborative work of all CPUs plus the GPU acceleration is available for any PERMAS analysis and is not restricted by any hardware resource. I.e. PERMAS is known for solving huge FEM simulation problems even on limited hardware resources. E.g. efficiently working with TByte matrices on a system with only some GByte memory is not a problem for PERMAS. This is supported by asynchronous handling of I/O and computations.

Thus the extra speedup from NVIDIA Tesla Cards can be seen even for out-of-core simulations involving PBytes of local I/O. Typically, on standard single or multi socket compute servers, an extra Tesla Card boosts the PERMAS performance by another factor 2 to 4, as shown for a large contact analysis that shows an overall speedup of 1.8 for the whole job.