Systems Operations - National Institute for Computational Sciences

The NICS High Performance Computing Operations group specializes in providing end-to-end, production-ready cluster computing experiences built using cutting edge business-class compute hardware, Peta-scale Lustre storage, and Infiniband-based high performance networking. Highly experienced with both Slurm and Torque/Moab scheduling environments, we are able to accommodate complex scheduling policy requirements while still lowering the barrier to entry for HPC Cluster computing for our users. The team offers scalable, shared-environment computing systems as well as bespoke compute enclaves, to accommodate multiple independent projects with unique platform or security needs.

Our team develops and provides web-based user portals for accessing and modifying user accounts and project accounting information. Such portals aim to serve as a one-stop-shop for all accounting needs from account request, to project creation and membership requests, to compute and storage auditing. Portal profile pages and flexible user request systems allows for many of the common account operations such as changing login shell and requesting membership to a project to be largely automated. These can be expanded to adapt as projects grow and evolve. All center-wide accounting information is collated within a single, scheduler-agnostic database to provide a unified view of project activity as wells as robust auditing and reporting capabilities.

We have relationships with numerous vendors such as AMD, Cray, Dell, DDN, Dell, HPE, Intel, RedHat, and SuperMicro.

We have partnered with Dell and Redhat for implementing an “On-Premises Cloud” based on Redhat’s Openshift platform. In addition, we have partnered with VMWare to build out a redundant and high availability virtualization cluster to run core services as well as any requests or needs from users. Using these competencies, we are working with external developers and other companies on hosting scientific gateways. We have also developed a robust and agile core infrastructure to help deploy, monitor, and secure these types of applications.

Below are additional details describing our core competencies:

Cluster Computing

Batch Scheduling with Slurm and Torque/Moab
HPC Storage and networking with Lustre and Infiniband
Enterprise Linux compute environments
Support for GCC and Intel compiler environments
Modular support for application software via Environment Modules
Custom reporting and monitoring

Accessibility

SSH access with mandatory MFA provided by Duo
Globus Endpoints for High Performance data transfer
OpenOnDemand to allow browser-based virtual desktop experiences running within the HPC cluster
X11 forwarding via login nodes for remote visualization
User portal for requesting accounts and viewing account information

Security

Bro/Zeek
IPtables
Fortinet Firewalls

Virtualization

Vmware

Containerization

Kubernetes
Docker
Openshift

Monitoring

Icinga/Nagios
Prometheus

HPC Networking

Dell Switching
ISCSI Storage Solutions
Dell Compelent

General Infrastructure Services

ISC DHCP
ISC Bind
Open LDAP
Local RPM repository hosting

Using an IDS Zeek, we monitor for any unusual traffic patterns and report on them. We also use Icinga as our monitoring solution. This allows us to keep an eye on hardware, and software. Coupled with Graphite and Grafana, we are not only able to quickly react to problems when they occur, but often catch something before it becomes an issue.

For additional information, please contact Tabitha Samuel, Group Leader, Systems Operations.