Running COMSOL in parallel on clusters

Solution Number: 1001
Title: Running COMSOL in parallel on clusters
Platform: All Platforms
Applies to: All Products
Versions: 4.0, 4.0a, 4.1, 4.2, 4.2a
Created: November 21, 2006
Last Modified: January 31, 2012
Categories: Solver, Mesh
Keywords: solver memory parallel cluster

Problem Description

This solution describes how you enable distributed parallelization (cluster jobs) in COMSOL.

Solution

COMSOL supports two mutual modes of parallel operation: shared-memory parallel operations and distributed-memory parallel operations, including cluster support. This solution is dedicated to distributed-memory parallel operations. For shared-memory parallel operations, see Solution 1096.

COMSOL can distribute computations on compute clusters using the MPI model. One large problem can be distributed across many compute nodes. Also, parametric sweeps can be distributed with individual parameter cases distributed to each cluster node.

Cluster computing is supported on Windows (Windows HPC Server 2008 / R2) and Linux, including common schedulers like LSF, PBS, and Sun Grid Engine (SGE, also known as Oracle Grid Engine).

NOTE: to use COMSOL on a compute cluster, you need the Floating Network License (FNL) option.

At the bottom of this page are quick guides that explain how to get started with cluster computing, and how to get more information.

Some useful tips and troubleshooting guides are provided below.

Fundamentals

The following terms occur frequently when describing the hardware for cluster computing and shared memory parallel computing:

  • Compute node: The compute nodes are where the distributed computing occurs. The COMSOL server resides in a compute node and communicates with other compute nodes using MPI (message-passing interface).
  • Host: The host is a hardware physical machine with a network adapter and unique network address. The host is part of the cluster. It is sometimes referred to as a physical node.
  • Core: The core is a processor core used in shared-memory parallelism by a computational node with multiple processors.

The number of used hosts and the number of computational nodes are usually the same. For some special problem types, like very small problems with many parameters, it might be beneficial to use more than one computational node on one host.

Cluster distribution, Windows and Linux

Example models for cluster testing are included in the Model Library under COMSOL Multiphysics->Tutorial models.

Troubleshooting

Your first stop is to make sure you have the latest release installed. Check that you have COMSOL 4.2a or later, and do Help->Check for Updates to install any software updates. The latest updates are also available here.

Error messages due to communication problems between Linux nodes

If you get error messages, make sure that the compute nodes can access each other over tcp/ip and that all nodes can access the license manager in order to check out licenses. If you run the ssh protocol between the hosts on a Linux cluster you need to pre generate the keys in order to prevent the nodes to ask each other for passwords as soon as communication is initiated:

# generate the keys
ssh-keygen -t dsa
ssh-keygen -t rsa
# copy the public key to the other machine
ssh-copy-id -i ~/.ssh/id-rsa.pub user
@hostname
ssh-copy-id -i ~/.ssh/id-dsa.pub user@hostname

Check that the nodes can access the license manager

Linux: Log in to each node and run the the command
comsol batch -inputfile /usr/local/comsol42a/models/COMSOL_Multiphysics/Equation-Based_Models/point_source.mph -outputfile out.mph

The command above should be issued on one line. /usr/local/comsol42a is assumed to be your COMSOL installation directory. Make sure you have write permissions for ./out.mph. No error messages should be produced, or you may have a license manager connectivity problem.

Windows HPCS: Log in to each node with remote desktop and start he COMSOL Desktop GUI. No error messages should be displayed.

Errors on Infiniband based Linux clusters

If COMSOL crashes on on your Linux Infiniband enabled cluster, producing error logs containing for example com.comsol.util.exceptions.FlException: error in opening zip file

Update to the latest software version. This problem was fixed in COMSOL 4.1 update 1. If you cannot update at this time, try the following workaround:


Before starting COMSOL set the environment variable I_MPI_RDMA_TRANSLATION_CACHE to the value disable . If this does not help set I_MPI_DEVICE to ssm or sock.

Problems with the Cluster Computing feature in the model tree

If you get the error message "Process status indicates that process is running"

  1. Cancel any running jobs in the Windows HPCS Job manager or other scheduler that you use
  2. In COMSOL, go to Job Configurations>Batch1->Batch Data->External Process 1
  3. Select Operation: Cancel Process and then click Run Operation
  4. Select Operation: Clear status and then Run Operation

Cloud

COMSOL is not yet adapted to cloud computing.

Hardware Recommendations

See knowledgebase solution 1116 for cluster hardware recommendations.

See Also

See also COMSOL and Multithreading.

 

Related Files

cluster_install_linux_42a.pptx 753 KB
cluster_install_linux_42a.pdf 1.2 MB
cluster_install_win_42a.pptx 634 KB
cluster_install_win_42a.pdf 610 KB

Feedback

Poor | Excellent
Document quality?



Disclaimer

COMSOL makes every reasonable effort to verify the information you view on this page. Resources and documents are provided for your information only, and COMSOL makes no explicit or implied claims to their validity. COMSOL does not assume any legal liability for the accuracy of the data disclosed. Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark details.