Skip to content Skip to main navigation Report an accessibility issue

Systems Operations

The NICS High Performance Computing Operations group specializes in providing end-to-end, production-ready cluster computing experiences built using cutting edge business-class compute hardware, Peta-scale Lustre storage, and Infiniband-based high performance networking. Highly experienced with both Slurm and Torque/Moab scheduling environments, we are able to accommodate complex scheduling policy requirements while still lowering the barrier to entry for HPC Cluster computing for our users. The team offers scalable, shared-environment computing systems as well as bespoke compute enclaves, to accommodate multiple independent projects with unique platform or security needs.

Our team develops and provides web-based user portals for accessing and modifying user accounts and project accounting information. Such portals aim to serve as a one-stop-shop for all accounting needs from account request, to project creation and membership requests, to compute and storage auditing. Portal profile pages and flexible user request systems allows for many of the common account operations such as changing login shell and requesting membership to a project to be largely automated. These can be expanded to adapt as projects grow and evolve. All center-wide accounting information is collated within a single, scheduler-agnostic database to provide a unified view of project activity as wells as robust auditing and reporting capabilities.

We have relationships with numerous vendors such as AMD, Cray, Dell, DDN, Dell, HPE, Intel, RedHat, and SuperMicro.

We have partnered with Dell and Redhat for implementing an “On-Premises Cloud” based on Redhat’s Openshift platform. In addition, we have partnered with VMWare to build out a redundant and high availability virtualization cluster to run core services as well as any requests or needs from users. Using these competencies, we are working with external developers and other companies on hosting scientific gateways. We have also developed a robust and agile core infrastructure to help deploy, monitor, and secure these types of applications.

Below are additional details describing our core competencies:

Cluster Computing

  • Batch Scheduling with Slurm and Torque/Moab
  • HPC Storage and networking with Lustre and Infiniband
  • Enterprise Linux compute environments
  • Support for GCC and Intel compiler environments
  • Modular support for application software via Environment Modules
  • Custom reporting and monitoring

Accessibility

  • SSH access with mandatory MFA provided by Duo
  • Globus Endpoints for High Performance data transfer
  • OpenOnDemand to allow browser-based virtual desktop experiences running within the HPC cluster
  • X11 forwarding via login nodes for remote visualization
  • User portal for requesting accounts and viewing account information

Security

  • Bro/Zeek
  • IPtables
  • Fortinet Firewalls

Virtualization

  • Vmware

Containerization

  • Kubernetes
  • Docker
  • Openshift

Monitoring

  • Icinga/Nagios
  • Prometheus

HPC Networking

  • Dell Switching
  • ISCSI Storage Solutions
  • Dell Compelent

General Infrastructure Services

  • ISC DHCP
  • ISC Bind
  • Open LDAP
  • Local RPM repository hosting

Using an IDS Zeek, we monitor for any unusual traffic patterns and report on them. We also use Icinga as our monitoring solution. This allows us to keep an eye on hardware, and software. Coupled with Graphite and Grafana, we are not only able to quickly react to problems when they occur, but often catch something before it becomes an issue.

For additional information, please contact Tabitha Samuel, Group Leader, Systems Operations.