HPC Systems Engineer

Job NameHPC Systems Engineer
Department4111130 - F IT ARS Infrastructure
Job ID3436
Job CodeSYS ADM 4 TX (006375)
IAPStaff Plan (target potential payout of $900, maximum of $1,800)
Bargaining UnitTX
Job FamilyInformation Technology
OrganizationUCSF Campus BU
Primary LocationSan Francisco, CA, United States
Detail URLhttps://careers.ucsf.edu/careers/JobDetail/San-Francisco-CA-United-States/2278

Job Description

Job Description:
The CoreHPC team at UCSF is seeking an HPC Systems Engineer to play a key role in the development, maintenance, and day-to-day operations of the Institute’s HPC clusters. The HPC Systems Engineer will: Apply advanced systems infrastructure concepts and skills to the operations and improvement of large-scale and highly complex research Cyber Infrastructure (CI) with unique computing, networking, and storage systems designed to address cutting-edge research problems Apply their engineering and design skills to develop new CI solutions, to develop and enhance monitoring to maintain the integrity of CI systems. Select methods, techniques and evaluation criteria to develop new CI solutions to address complex research problems. Be an active member of the support and maintenance efforts for the CoreHPC cluster, resolving user issues, fixing technical problems, resolving outages, patching, and maintaining systems' uptime and availability. Provides consultation, support, and guidance to researchers on how to address computational problems using standard tools, packages, and approaches. Develop enhancements of monitoring to maintain the integrity of CI systems. Participate in multiple technical projects simultaneously. Applies working knowledge of security control frameworks to maintain the integrity of the CI systems and the research being performed on them. Gives presentations to the associated team and other technical units. Evaluates new technologies, including performing moderate to complex cost/benefit analyses. This position may lead to cross-functional technical working groups and projects in support of onboarding research customers, or making systems improvements. Department Overview Academic Research Systems (ARS) serves the needs of the UCSF research community by providing an integrated repository of HIPAA-compliant clinical and life sciences data and a centralized, secure, professionally managed infrastructure for the storage and management of research data. ARS empowers medical scientific investigations by offering secure computing environments, data capture, management and analysis tools, and support services which meet researchers’ needs. The Core HPC team of the Academic Research Service (ARS) focuses on large-scale, high-performance computational and storage services for UCSF researchers so they can address complex computational, AI,  and data science problems.

Qualifications:
REQUIRED QUALIFICATIONS - Bachelor's degree in a related area such as computer science or engineering, and 6+ years of experience with large-scale or HPC systems * or* 10+ years of related experience with large-scale or HPC systems - Expert knowledge of HPC systems infrastructure design - Strong knowledge of high-performance parallel filesystems and storage such as GPFS, Lustre, Vast, DDN, etc. - Advanced knowledge of computer security best practices and policies including demonstrated experience securing research cyberinfrastructure systems to meet NIST 800-171 / 800-223, HIPPA or IS-3 requirements - Demonstrated testing and test planning skills. Demonstrated ability to create automated testing. - Knowledge of HPC job scheduler system design and operation such as SLURM or PBS, - Demonstrated skill (5 years +) deploying, managing, and troubleshooting Warewulf (or similar) infiniband based clusters - Ability to elicit and communicate technical and non-technical information in a clear and concise manner. - Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines. - Understanding of system performance monitoring and actions that can be taken to improve or correct performance. - Demonstrated advanced knowledge, skills and abilities associated with system problem identification and resolution. Experience with design, configuration, operation, repair, and tuning of technology systems. - Advanced experience writing and editing the most complex scripts used to perform system maintenance and administration. - Ability to write technical documentation in a clear and concise manner. Ability to develop runbooks defining complex technical processes in a clear and concise manner PREFERRED QUALIFICATIONS - Knowledge of the design, development, and application of technology and systems to meet business needs. - General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance. - Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.