Evaluating and Deploying State-of-the-Art Hardware to Meet the Challenges of Modern Workloads

At GEICO we are on a journey to entirely modernize our Infrastructure. We are building an open-source, cloud-agnostic hybrid stack to run across public and on prem private cloud infrastructure without having to expose vendor specific stacks to our application developers. This hybrid stack gives us flexibility to run workloads wherever we need them, and to migrate significant workloads from the public cloud to our on-prem infrastructure where cost or latency are better served for those workloads. 

Through that process we had to select new colocation facilities (moving from 6 facilities to 3 better balanced and geo-distributed sites), Open Hardware servers (based on the workload characteristics of our legacy and cloud footprints leveraging OpenBMC and Redfish for management), Open Network solutions (switches, routers and our own NOS for those systems), and OpenStack (including Ceph for SDS) to deliver fleet management solutions across our on prem footprint. 

This change is driving 30% to 3x cost savings per workload relative to the equivalent capacity, latency, and up time in our current cloud providers. We have also completely redesigned our current on prem network and servers from a demilitarized zone isolated network approach on MPLS cirtuits to a fully untrusted network (only decrypt where the user/account/application is allowed to have access) using direct internet access, and profoundly simplifying our hardware skus (going from over 200 instances in the public cloud down to 5 primary, and 15 specialty solutions to be phased out as our applications modernize). 

In this session we will walk through the hardware selection process taking our workload characteristics from the cloud and using that to optimize a subset of SKUs for our on prem cloud.

Interview:

What is the focus of your work?

I run infrastructure engineering for GEICO, which includes our hardware systems (compute, storage, networking, AI, etc.), our workflow automation, provisioning, and fleet management tools for the physical assets, and our full hybrid cloud stack (data protection services, identity and access management tools, OS, runtime, and container management solutions, cluster management, and service mesh across our public and private cloud footprint).

What’s the motivation for your talk?

Making it easier for developers to decode public cloud instances into a physical footprint, helping demystify where private cloud can be more efficient and where public cloud is optimal.

Who is your talk for?

Devops folks trying to understand the tradeoffs between public and private cloud for overall reliability, security, and efficiency.

What do you want someone to walk away with from your presentation?

More understanding of why private cloud is becoming increasingly necessary WITH public cloud offerings for Enterprise institutions. Where the cloud isn’t serving customers well. How to create a footprint that meets the needs of an actual business.

What do you think is the next big disruption in software?

I hate the word disruption: it feels like a buzz word. I believe the pendulum has swung to where AI and data security are requiring a hybrid approach to Infrastructure, and I’m looking to the open source community to come together to create the right design patterns to ensure we are able to run hybrid cloud efficiently and effectively.


Speaker

Rebecca Weekly

VP of Infrastructure @GEICO

Rebecca is VP of Platform and Infrastructure Engineering at GEICO, leading their hybrid cloud transformation to repatriate key workloads, develop and deliver a true hybrid Open Source stack, and modernize their physical infrastructure. She recently led the organization that built, validated, and automated the full lifecycle management of Cloudflare’s compute, network, storage, and AI systems in 300+ cities and 100+ countries delivering >20% of the world’s Internet traffic. Rebecca is the former Open Compute Project President and Chairperson, helping ensure that hyperscale innovation can be scaled to all organizations, is on Fortune’s 40 Under 40 2020 list of most influential people in Technology, is on Business Insider's 2022 Cloudverse100 list of the builders of the next generation of the Internet, and was voted CloudGirls Trailblazer for women in technology in 2023. In her "spare" time, she is the lead singer of the funk and soul band, Sinister Dexter, and enjoys her passion of dance and choreography. She has two amazing little boys, and loves to run (after them, and on her own). Rebecca graduated from MIT with a degree in Computer Science and Electrical Engineering.

Read more

Date

Wednesday Nov 20 / 01:35PM PST ( 50 minutes )

Location

Seacliff ABC

Topics

Hybrid cloud Infrastructure Platform Engineering

Slides

Slides are not available

Share

From the same track

Session AI HW/SW optimization

Maximizing Deep Learning Performance on CPUs using Modern Architectures

Wednesday Nov 20 / 11:45AM PST

As deep learning continues to drive advancements across various industries, efficiently navigating the landscape of specialized AI hardware has a huge impact on cost and speed of operation.

Speaker image - Bibek Bhattarai

Bibek Bhattarai

AI Technical Lead @Intel, Computer Scientist Invested in Hardware-Software Optimization, Building Scalable Data Analytics, Mining, and Learning Systems

Session

High-Resolution Platform Observability

Wednesday Nov 20 / 03:55PM PST

Many observability tools fail to provide us with the relevant insights for understanding hardware health and utilization.

Speaker image - Brian Martin

Brian Martin

Co-founder and Software Engineer @IOP Systems, Focused on High-Performance Software and Systems, Previously @Twitter

Session RISC-V

Optimizing Custom Workloads with RISC-V

Wednesday Nov 20 / 02:45PM PST

This talk will explore how RISC-V architecture can accelerate custom workloads, focusing on AI/ML applications. We’ll start by examining the RISC-V ecosystem and its increasing relevance in the software development landscape.

Speaker image - Ludovic Henry

Ludovic Henry

Member of Technical Staff @Rivos, Performance-Minded Engineer, Hardware & Software, Previously @Xamarin, @Microsoft, @Datadog

Session AI/ML

Unleashing Llama's Potential: CPU-Based Fine-Tuning

Wednesday Nov 20 / 10:35AM PST

Generative AI landscape is rapidly changing as new models are appearing in horizon every few days. However, the hardware and software characteristics of these models have many similar patterns and execution phases.

Speaker image - Anil Rajput

Anil Rajput

AMD Fellow, Software System Design Eng. Java Committee Chair @SPEC, Architected Industry Standard Benchmarks and Authored Best Practices Guides for Platform Engineering and Cloud

Speaker image - Rema Hariharan

Rema Hariharan

Principal Engineer @AMD, Seasoned Performance Engineer With a Base in Quantitative Sciences and a Penchant for Root-Causing