AI Performance Engineer
Company: Cornelis Networks, Inc.
Location: Austin
Posted on: April 1, 2026
|
|
|
Job Description:
Cornelis Networks delivers the world’s highest performance
scale-out networking solutions for AI and HPC datacenters. Our
differentiated architecture seamlessly integrates hardware,
software and system level technologies to maximize the efficiency
of GPU, CPU and accelerator-based compute clusters at any scale.
Our solutions drive breakthroughs in AI & HPC workloads, empowering
our customers to push the boundaries of innovation. Backed by
top-tier venture capital and strategic investors, we are committed
to innovation, performance and scalability - solving the world’s
most demanding computational challenges with our next-generation
networking solutions. We are a fast-growing, forward-thinking team
of architects, engineers, and business professionals with a proven
track record of building successful products and companies. As a
global organization, our team spans multiple U.S. states and six
countries, and we continue to expand with exceptional talent in
onsite, hybrid, and fully remote roles. We’re seeking an AI
Performance Engineer that will optimize training and multi-node
inference across next-gen networking silicon and systems—adapters,
switches, and the software stack that ties it all together. You’ll
partner with architecture, firmware, software, and lighthouse
customers to turn lab results into field-proven wins with an
emphasis on distributed serving architectures and P99-aware
optimizations. Key Responsibilities: Own end-to-end performance for
distributed AI workloads (training multi-node inference) across
multi-node clusters and diverse fabrics (Omni-Path, Ethernet,
InfiniBand). Benchmark, characterize, and tune open-source &
industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5,
MLPerf) on current and future compute, storage, and network
hardware, including vLLM/TensorRT-LLM/Triton serving paths. Design
and optimize distributed serving topologies (sharded/replicated,
tensor/pipe parallel, MoE expert placement), continuous/adaptive
batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching,
and token streaming with tight p99/p999 SLOs. Optimize inferencing:
Validate RDMA/GPUDirect RDMA, congestion control, and
collective/point-to-point tradeoffs during inference. Design
experiment plans to isolate scaling bottlenecks (collectives,
kernel hot spots, I/O, memory, topology) and deliver clear,
actionable deltas with latency-SLO dashboards and queuing analysis.
Build crisp proof points that compare Cornelis Omni-Path to
competing interconnects; translate data into narratives for
sales/marketing and lighthouse customers, including cost-per-token
and tokens/sec-per-watt for serving. Instrument and visualize
performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF,
RCCL/NCCL tracing, app timers) plus serving telemetry
(Prometheus/Grafana, OpenTelemetry traces, concurrency/queue
depth). Evangelize best practices through briefs, READMEs, and
conference-level presentations on distributed inference patterns
and anti-patterns. Minimum Qualifications: B.S. in CS/EE/CE/Math or
related 5–7 years running AI/ML at cluster scale. Proven ability to
set up, run, and analyze AI benchmarks; deep intuition for message
passing, collectives, scaling efficiency, and bottleneck hunting
for both training and low-latency serving. Hands-on with
distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP,
sharded optimizers) and distributed inference architectures
(replicated vs sharded, tensor/KV parallel, MoE). Practical
experience across AI stacks & comms: PyTorch, DeepSpeed,
Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton
Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe.
Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks;
Python shell power user. Familiarity with network architectures
(Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems
at the performance-tuning level, including NIC offloads, CQ
moderation, pacing, ECN/RED. Excellent written and verbal
communication—turn measurements into persuasion with SLO-driven
narratives for inference. Preferred Qualifications: M.S. in
CS/EE/CE/Math or related Scheduler expertise (SLURM, PBS) and
multi-tenant cluster ops. Hands-on profiling & tracing of GPU/comm
paths (Nsight Systems, Nsight Compute, ROCm
tools/rocprof/roctracer/omnitrace, VTune, perf, PCP, eBPF).
Experience with NeMo, DeepSpeed, Megatron-LM, FSDP, and collective
ops analysis (AllReduce/AllGather/ReduceScatter/Broadcast).
Background in HPC performance engineering or storage (BeeGFS,
Lustre, NVMeoF) for data & checkpoint pipelines. Location: This is
a remote position for employees residing within the United States.
We offer a competitive compensation package that includes equity,
cash, and incentives, along with health and retirement benefits.
Our dynamic, flexible work environment provides the opportunity to
collaborate with some of the most influential names in the
semiconductor industry. At Cornelis Networks your base salary is
only one component of your comprehensive total rewards package.
Your base pay will be determined by factors such as your skills,
qualifications, experience, and location relative to the hiring
range for the position. Depending on your role, you may also be
eligible for performance-based incentives, including an annual
bonus or sales incentives. In addition to your base pay, you’ll
have access to a broad range of benefits, including medical,
dental, and vision coverage, as well as disability and life
insurance, a dependent care flexible spending account, accidental
injury insurance, and pet insurance. We also offer generous paid
holidays, 401(k) with company match, and Open Time Off (OTO) for
regular full-time exempt employees. Other paid time off benefits
include sick time, bonding leave, and pregnancy disability leave.
Cornelis Networks does not accept unsolicited resumes from
headhunters, recruitment agencies, or fee-based recruitment
services. Cornelis Networks is an equal opportunity employer, and
all qualified applicants will receive consideration for employment
without regard to race, color, religion, sex, sexual orientation,
gender identity or expression, pregnancy, age, national origin,
disability status, genetic information, protected veteran status,
or any other characteristic protected by law. We encourage
applications from all qualified candidates and will accommodate
applicants’ needs under the respective laws throughout all stages
of the recruitment and selection process.
Keywords: Cornelis Networks, Inc., Round Rock , AI Performance Engineer, IT / Software / Systems , Austin, Texas