Python-based Open Source LLM Cloud Native MLOps Deployment Practices

Yan Yi AWS Senior Solutions Architect and Technical Lead

Currently serving as Senior Solutions Architect and Tech Lead at Amazon Web Services (AWS), focusing on the architectural design of generative AI solutions. Leading the development of AI applications in industries such as media advertising and smart manufacturing. Proposed and designed optimization methods for deploying large language models on the AWS platform, assisting enterprises in addressing challenges related to computing costs, data security, and scenario adaptation. Previously led the implementation of multiple large-scale multinational enterprise artificial intelligence application projects, helping clients reduce costs and improve efficiency. Prior to that, served as Chief Engineer and Big Data Platform Architect at a well-known business intelligence company, leading the design of an analysis system processing petabytes of data daily, and completing the full-process architecture construction from data collection and storage to analysis and visualization. Possesses 15 years of experience in the data and AI fields.

Abstract

In the context of the rapid development of open-source large models, how to quickly convert open-source large models such as Qwen, DeepSeek, and Llama into usable API services has become a key challenge faced by developers. Traditional deployment often needs to handle complex steps including containerization, infrastructure configuration, and inference engine optimization. This not only requires in-depth operation and maintenance knowledge, but also involves a large amount of manual configuration.

Details

This speech will share practical experience in building a unified open-source model deployment system, and introduce a Python-based multi-engine abstraction layer architecture: trigger the deployment process through simple CLI commands, realize "build once, deploy anywhere" by using unified container design, optimize the inference engine for different models through dynamic configuration, and finally automatically deploy them as OpenAI-compatible API services.

Combined with practical cases of open-source models such as GPT-OSS, it will demonstrate the complexity management of multi-engine deployment, performance parameter optimization, cross-platform adaptation, and the construction of standardized interfaces.