Mark L Callan



Objective


Seeking a principal infrastructure engineering role building and operating large-scale, high-performance compute infrastructure. Over a decade of experience architecting distributed systems, container platforms, hypervisor fleets, and the automation, observability, and delivery pipelines that keep them running reliably at scale. Deep expertise in Linux systems programming, storage and networking, and production performance engineering.



Skills


Primary Languages:
  • Java
  • Python
  • C#
  • C/C++
  • Rust
  • Go
  • JavaScript
  • bash
Operating Environments:
  • Linux - Debian, EL, NixOS
  • Windows
Virtualization:
  • KVM/QEMU
  • Vagrant
  • VirtualBox
  • HyperV
Web:
  • HTML/CSS/LESS/JS
  • Dropwizard
  • .NET MVC
  • .NET WebAPI
  • WCF
  • Node.js
  • JQuery
  • Bower
  • Knockout
Dev Ops:
  • Docker
  • containerd
  • Ansible
  • Maven
  • Gradle
  • Jenkins
  • Team City
  • Oracle Deployment Orchestrator (Internal)
Hosting:
  • Oracle Cloud
  • GoogleCloud
  • Azure
  • AWS
  • IIS
  • Apache
  • NGINX
  • Kubernetes
Database:
  • MongoDB
  • MS SQL Server
  • MySQL
  • Oracle
Linux Debugging/Performance:
  • ftrace/ptrace
  • sysbench
  • fio
  • iperf
  • htop/top
  • sysfs
  • perf
  • blktrace
  • stress-ng
Some Experience With:
  • AngularJS
  • Ruby
  • Scala
  • Kotlin
  • Groovy
  • SmartNICs
  • RDMA / InfiniBand
  • iSCSI
  • Slurm
  • GPU / CUDA


Experience


September 2022 - Present
Oracle Cloud Infrastructure - Container Instance (Container VM) Data Plane - Seattle, Washington
Principal Software Development Engineer

As a member of the Container Instances team, helped build the service and take it to GA. Designed and built the hypervisor and customer instance availability monitoring systems. Designed and built the hypervisor imaging system. Designed and built the hypervisor image build, test and release pipelines. Helped build the remote command execution feature for executing commands on customer containers. Helped with performance testing and analysis, performed complex root cause analysis of issues in user and kernel space. Helped improve observability by ensuring telemetry data collected from hypervisors covered all relevant performance metrics.

  • Remote Execution
    • Designed and built a custom Socat like connection relay written in go. Used as a connection relay for remote command execution. The relay was used to plumb the front end connection from the customer facing remote execution endpoints into the containers running in the customers container instances on the hypervisor.
    • Helped setup the data plane API that handled incoming remote execution requests from customers, and routed them to the customer's containers.
    • Built the implementation on the container instance agent for sending the remote execution requests to containerd, and launching the connection relays for plumbing the connection from the hypervisor to the container instance and from the container instance to the container.
  • Encrypted NFS Storage Integration (Oracle FSS)
    • Architected a TLS tunnel proxy that provides fully encrypted NFS connectivity between container instances and Oracle's File System Service (FSS), satisfying security requirements without modifying the NFS client or server.
    • The tunnel ran as a dedicated process managed by the container instance agent, starting automatically at instance creation. It listened locally for NFS connections and forwarded traffic over a mutual TLS connection to the FSS backend, where it was handed off to the NFS servers.
    • Designed for operational transparency: standard mount -t nfs mounts pointed at the local tunnel endpoint, requiring no changes to mount semantics or application code running inside the instance.
  • Container Instance Service
    • Helped launch the Container Instances (Container Virtual Machine) service for Oracle Cloud Infrastructure. Container instances are used as the back end for serverless kubernetes, and also as a publicly available customer facing service.
    • Designed and built the Hypervisor Imaging and Testing pipeline. A completely automated process of taking a vanilla Linux OS image, configure it, install the dataplane services, export it to a bootable image, test it using nested virtualization, create an updater application for updating existing hypervisors, and deploy it to our staging environment.
    • Designed and built a system for monitoring availability of all customer instances.
    • Created the data plane availability monitoring dashboards and customer experience metrics reporting tools.
    • Designed and built the automated pipeline for testing hypervisor images in a Beta testing environment, rolling changes out to our integration testing environment, and carefully releasing to production in stages, with automated validation of each stage.
    • Created the Data Plane canary tests for monitoring production data plane health, as well as release pipeline stage validation.
  • Oracle Cloud Infrastructure Hypervisor Imaging
    • Designed and built the Hypervisor Imaging system for building and testing hypervisor images for Oracle Cloud Infrastructure.
    • Lead a team of engineers in building a service that takes a Vanilla Linux OS image and a package manifest, and builds a bootable Linux image that can be used to launch a Hypervisor. Hypervisor images are used to host all Virtual Machine images as well as Container Virtual Machine instances.
    • Designed and built the imaging host fleet monitoring systems.
    • Designed and built the application for updating all existing hypervisors.
    • Designed and built a fully automated testing pipeline for hypervisor images. The pipeline builds and tests the image in a virtualized environment, using paravirtualization and a mocked overlay, and then moves to testing on bare metal, and finishes with staging a deployment of the release candidate image to our staging environment. Tests include baseline performance tests, feature tests, and some chaos testing.

July 2017 - September 2022
Oracle Cloud Infrastructure - VMI Data Plane - Seattle, Washington
Senior Software Development Engineer

As a member of the VMI Dataplane team, assisted in the growth and management of a fleet of hypervisors from 500 to 80k hosts. This includes designing building and maintaining services that monitor and control the hypervisors, last line of 24/7 support for the customer service teams working with customers, performing complex root cause analysis of issues causing service interruptions, and a constant effort to automate and improve regular operations tasks.

  • OCI Virtual Machine Data Plane Operations Lead
    • In addition to typical engineering work, lead VM data plane operations for OCI.
    • Managed day to day operations. Managed customer escalations, and ongoing issues. Ensured any production issues were tracked, fixes were prioritized, and mitigations for bugs and security were delivered in a timely manner.
    • Worked with several high profile customers to ensure production issues were identified, tracked, and addressed quickly.
  • Virtual Machine Diagnostics
    • Designed, built and maintained a tool for providing self service virtual machine diagnostics for internal users.
    • Designed and implemented data plane level telemetry collection processes for providing continuous monitoring of all Virtual Machines.
    • Designed and implemented a system for propagating additional information from the control plane to the dataplane to streamline the correlation of data plane telemetry data to specific customer VM instances.
    • Added a front end for the diagnostics tool in the internal devops site to allow for self service VM diagnostics.
    • Designed the tool to be usable by individual engineers to perform custom diagnostics. This also aided in the development of the diagnostics tool itself.
    • Designed and led the development of a deep diagnostics tool for collecting detailed host and instance diagnostic logs and data. The tool was used to collect the data as a bundle, and automatically send it to the Oracle Linux team to perform in depth analysis. The tool helped reduce the time to engage kernel developers on the Oracle Linux team when debugging complex system level issues.
  • Hosthealth Service
    • Designed, built, and maintained the hosthealth monitoring service that monitors and reports the health of all hypervisors.
    • Pushed metrics directly from hypervisors to replace constant polling. This improved efficiency and reduced latency which improved the responsiveness of alarms.
    • Used monitoring data to create alarms that helped meet a 15 minute SLA for compliance violations.
    • Setup data visualization aids for quick insight into the health of the hypervisor fleet.
    • During initial rollout, the hosthealth service provided the first alarm for a major compliance issue that had gone unreported for almost 24 hours.
  • Canary Testing Framework
    • Designed, built and maintained a test framework for writing canary tests on my own initiative.
    • Significantly reduced the effort required to expand production test coverage.
    • Significantly reduced the computing resources required to run tests.
  • Microcontainer image
    • Designed and implemented the build and deployment process for the microcontainer image that runs all Oracle Cloud VMs.
    • Coordinated the effort to get up to date custom slimmed down qemu packages from the Oracle Linux team in California.
  • In-place kernel updates
    • Introduced support for in-place kernel updates as part of the regular hypervisor release.
    • Used KSplice to apply live patches to the host kernel without rebooting running instances.
    • Eliminated the need to reboot hosts affected by kernel bugs or security patches, dramatically reducing disruption to customer workloads.
  • Other Accomplishments
    • Helped support the growth of the service from ~500 hypervisors to almost 80,000k hypervisors
    • Helped plan and execute the effort to implement full end to end CI/CD for the VMI dataplane hypervisor stack.
    • Acted as release coordinator for a number of hypervisor deployments.
    • Root caused, fixed, tested and deployed fix during a full team emergent fleet wide patching effort that was preventing patching for the last 10% of the fleet.
    • Made significant contributions to a wide range of internal operations tools.
    • 11th member of the OCI VM team which has grown to over 100 engineers.

December 2013 - June 2017
Strongbark Inc. - Chicago, Illinois
Technical Director

Provided Architecture, Development, System Administration, DevOps, and Management services to a Chicago based startup. Working primarily on a product called INESQUE, a social shopping network.

  • Designed & Built
    • Web front end using Microsoft ASP.NET MVC, Bootstrap, Knockout.js, and a custom single page framework.
    • API used by both a mobile application and the web front end using Microsoft ASP.NET Web Api.
    • Data access layer using Mongo DB and C#
    • Data aggregation service for collecting product information from various affiliate advertising networks.
    • Image and video management using Microsoft Azure cloud storage, and Azure Media services for video processing/encoding.
  • Managed
    • All DevOPS for the company. Jenkins, Hg, NuGet, OctopusDeploy, Azure, CI/CD
    • Integration testing and Production environments in Azure
    • Internal IT infrastructure in the Chicago office.
    • Managed several consultants and independent contractors based in South Africa and Chicago.

June 2012 - January 2014
Banco Popular North America - Rosemont, Illinois
Application Developer/Analyst

As a full stack web developer, developed .NET web based and native windows applications that improved existing business processes throughout the bank.

  • Developed .NET web based and native windows applications that improved existing business processes throughout the bank.
  • Designed and implemented internal workflow platform that allowed applications to integrate with a unified portal for all workflow based applications.
  • Integrated internal workflow platform with existing IBM product Teamworks, which allowed all internal applications company wide to be accessed from one portal.
  • Helped develop uniform APIs and client libraries for accessing various bank and customer information.
  • Spearheaded many small POC projects. Including, but not limited to:
    • Using Image recognition libraries to scan large document batches for barcodes.
    • Creating API to expose internal organization wide document management solution.
    • Integrating Single Sign On using Windows authentication to authorize users on Teamworks, the IBM business process modeling and workflow platform.
    • Using google location services to create a branch locator tool for all Banco Popular bank branches in North America.

November 2011 - June 2012
Geneca - Oak Brook, Illinois
Software Engineer

Worked as a consultant developing web applications using common web technologies and the .NET framework.

  • Worked as a consultant for multiple clients doing full stack web and mobile development.
  • Gained experience working as a consultant, and dealing with the struggles of consulting and working with difficult clients.

April 2011 - November 2012
Banco Popular North America - Rosemont, Illinois
Application Developer

Developed .NET web based and native windows applications that improved existing business processes throughout the bank.

  • Designed and implemented web-based and native windows applications using, WPF, Silverlight, ASP.NET, MVC, C#, JavaScript, JQuery, Entity Framework, WCF, and Restful web services with Web API and WCF.
  • Helped start more rigorous testing practices in the application development department.
  • Helped facilitate movement to more agile like development methodology.


Education


May 2011
Illinois Institute of Technology - Chicago, Illinois
Bachelors of Science in Information Technology and Management


Personal Interests



References are available upon request.