Fiodar Kazhamiaka
Postdoc, Computer Science, Stanford
<my first name>
Gates Building 422

I work at the Future Data Systems lab at Stanford as a postdoc, advised by Matei Zaharia and Peter Bailis. My research interests are in Systems for Sustainability, Data Science, and AI. My current research spans a range of topics: scalable and efficient systems for machine learning, query systems for data collected by autonomous vehicles, and PV+battery system design and control. I'm also a co-host of the weekly Stanford MLSys Seminar series; catch us live Thursdays at 1:30 pm Pacific!

I completed my PhD in Computer Science at the University of Waterloo under the guidance of Srinivasan Keshav and Catherine Rosenberg at the ISS4E lab. I modelled and optimized energy storage systems, with a focus on renewable energy sources and behind-the-meter applications. Results include new algorithms for operating and sizing PV+storage systems, a framework for solar farm operators to allocate budget between PV panels and batteries to maximize revenue, and tractable models for Lithium-ion batteries for optimization and simulation. My work was recognized through the 2020 SCS Cheriton Dissertation award and featured on ACM TechNews.

I'm keen on interdisciplinary work, and have had the pleasure of working with academics from power systems, economics, optimization, and electrochemistry disciplines. In my free time, I train and compete internationally as a member of Canada's national beach volleyball team.

Latest News
Aug 10
Career move: joining the newly formed Azure Systems Research Group in September 2023!
July 1
Milestone! 9000 subscribers to the Stanford MLSys Seminars YouTube channel
Dec 10
Our paper on scaling query serving systems was accepted to NSDI '22! Congratulations Peter!
Aug 25
Our work on solving large resource allocation problems was accepted to SOSP '21! [link]
Aug 12
The Stanford MLSys seminars are now in conjunction with a credited seminar course (CS 528).
May 3
My PhD thesis has been selected for the 2020 UWaterloo SCS Cheriton Dissertation award!
April 20
Our work on sizing multi-roof PV+storage systems was accepted to ACM eEnergy '21! [latest draft]
April 1
Check out our preprint of an algorithm for solving hyper-scale resource allocation problems in seconds!

Recent Projects

Resource Allocation in Computer Systems
Many problems in computer systems can be formulated as optimization problems, from job scheduling in clusters, to traffic engineering in wide area networks, to load balancing in distributed databases. Computing exact soluions to these problems is canonically considered to be intractable for large systems (eg. datacenters), so it's common to deploy fast but non-optimal heuristics. In our recent work on POP (Partitioned Optimization Problems), we show how near-optimal resource allocations can be computed several orders of magnitude faster than the exact solution. This work motivates the use of opimization in real systems. For more, check out our paper at SOSP 2021, as well as our pre-print featuring a similar algorithm for resource allocation problems.

Autonomous Vehicle Query Systems
Modern vehicles with various levels of autonomy (AV) are equipped with high-resolution sensors and processors that measure the state of the world as they drive through it. This data can answer fine-grained queries on the state of the physical world. How many people are in line at my favourite coffee shop? How many cyclists have crossed a specific intersection this year? Where is the nearest open parking spot? To be realized, AV query systems must address challenges around data volume, bias, and privacy. This is an ongoing project; for more, check out our position paper at CIDR 2021.

Heterogeneity-Aware Cluster Scheduling
The end of Moore's Law has brought about an era of specialized accelerators, such as GPUs, TPUs, FPGAs, and other accelerators. How do we extend common notions of fairness and throughput in job scheduling policies to a compute cluster with heterogeneous hardware?
We approach this question with Gavel, a scheduler for DNN training jobs that systematically generalizes a wide range of existing scheduling policies, such a max-min fairness, finish-time fairness, and minimum makespan. Gavel expresses these policies as convex mathematical optimization problems, and extends them to consider the setting with heterogeneous hardware. With Gavel, we can sustain higher job load, and improve end objectives such as makespan and job completion time by over 40% compared to heterogeneity-agnostic policies. For more details, check out our paper at OSDI 2020.

Future-Proof Solar PV and Storage Sizing
Suppose you want to purchase a system with solar panels and battery to power your home. How many panels do you buy? How big of a battery do you need? These questions are coupled, and depend on how often you're willing to go without power. Maybe you have some data to help understand what kind of system would have worked for you in the past, but what can this reliably tell you about the future?
To address this problem, we use of a recent advancement in empirical multi-variate probability concentration bounds to compute a robust least-cost system size for a given load target. This work was presented at ACM eEnergy 2018, and was voted as runner-up for the best paper (audience choice).
A refined version of our method was published in a journal in 2019. [link]
We recently extended this work to cover muti-roof settings! Find it at eEnergy 2021 [link]

Carbon Explorer: A Holistic Approach for Designing Carbon Aware Datacenters
Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Kiwan Maeng, Manoj Chakkaravarthy, Udit Gupta, David Brooks, and Carole-Jean Wu
Technology companies have been leading the way to a renewable energy transformation, by investing in renewable energy sources to reduce the carbon footprint of their datacenters. In addition to helping build new solar and wind farms, companies make power purchase agreements or purchase carbon offsets, rather than relying on renewable energy every hour of the day, every day of the week (24/7). Relying on renewable energy 24/7 is challenging due to the intermittent nature of wind and solar energy. Inherent variations in solar and wind energy production causes excess or lack of supply at different times. To cope with the fluctuations of renewable energy generation, multiple solutions must be applied. These include: capacity sizing with a mix of solar and wind power, energy storage options, and carbon aware workload scheduling. However, depending on the region and datacenter workload characteristics, the carbon-optimal solution varies. Existing work in this space does not give a holistic view of the trade-offs of each solution and often ignore the embodied carbon cost of the solutions. In this work, we provide a framework, Carbon Explorer, to analyze the multi-dimensional solution space by taking into account operational and embodided footprint of the solutions to help make datacenters operate on renewable energy 24/7. The solutions we analyze include capacity sizing with a mix of solar and wind power, battery storage, and carbon aware workload scheduling, which entails shifting the workloads from times when there is lack of renewable supply to times with abundant supply. Carbon Explorer will be open-sourced soon.
+ View Abstract
Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems
Peter Kraft, Fiodar Kazhamiaka, Peter Bailis, and Matei Zaharia
NSDI (2022)
We present data-parallel actors (DPA), a programming model for building distributed query serving systems. Query serving systems are an important class of applications characterized by low-latency data-parallel queries and frequent bulk data updates; they include data analytics systems like Apache Druid, full-text search engines like ElasticSearch, and time series databases like InfluxDB. They are challenging to build because they run at scale and need complex distributed functionality like data replication, fault tolerance, and update consistency. DPA makes building these systems easier by allowing developers to construct them from purely single-node components while automatically providing these critical properties. In DPA, we view a query serving system as a collection of stateful actors, each encapsulating a partition of data. DPA provides parallel operators that enable consistent, atomic, and fault-tolerant parallel updates and queries over data stored in actors. We have used DPA to build a new query serving system, a simplified data warehouse based on the single-node database MonetDB, and enhance existing ones, such as Druid, Solr, and MongoDB, adding missing user-requested features such as load balancing and elasticity. We show that DPA can distribute a system in < 1K lines of code (> 10× less than typical implementations in current systems) while achieving state-of-the-art performance and adding rich functionality.
+ View Abstract
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia
SOSP (2021)
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
+ View Abstract
Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training
Deepak Narayanan, Keshav Santhanam, Fiodar Kazhamiaka, Amar Phanishayee, and Matei Zaharia
DISPA workshop (2020)
Cloud providers offer instances with similar compute capabilities (for example, instances with different generations of GPUs like K80s, P100s, V100s) across many regions, availability zones, and on-demand and spot markets, with prices governed independently by individual supplies and demands. In this paper, using machine learning model training as an example application, we explore the potential cost reductions possible by leveraging this cross-cloud instance market. We present quantitative results on how the prices of cloud instances change with time, and how total costs can be decreased by considering this dynamic pricing market. Our preliminary experiments show that a) the optimal instance choice for a model is dependent on both the objective (e.g., cost, time, or combination) and the model’s performance characteristics, b) the cost of moving training jobs between instances is cheap, c) jobs do not need to be preempted more frequently than once a day to leverage the benefits from spot instance price variations, and d) the cost of training a model can be decreased by as much as 3.5× compared to a static policy. We also look at contexts where users specify higher-level objectives over collections of jobs, show examples of policies for these contexts, and discuss additional challenges involved in making these cost reductions viable.
+ View Abstract
Hey, you found me! Hope the rest of your day is this lucky!