Intro
Recently, I did a small experiment to compare the performance of gRPC Python and other languages. The result is very impressive. I would expect to see a interpreted code many times slower than a native code. Intuitively, this can only happen if the code is mostly native. Let us dive in to see how it was implemented.
gRPC Python implementation
gRPC has 2 major parts. Its serializer is built upon protobuf. The RPC part is built upon HTTP/2.
The gRPC project is driven by the RFP process. gRPC Python has the following RFPs:
L13 https://github.com/grpc/proposal/blob/master/L13-python-interceptors.md
L42 https://github.com/grpc/proposal/blob/master/L42-python-metadata-flags.md
L44 https://github.com/grpc/proposal/blob/master/L44-python-rich-status.md
L46 https://github.com/grpc/proposal/blob/master/L46-python-compression-api.md
L54 https://github.com/grpc/proposal/blob/master/L54-python-server-wait.md
L58 https://github.com/grpc/proposal/blob/master/L58-python-async-api.md
L64 https://github.com/grpc/proposal/blob/master/L64-python-runtime-proto-parsing.md
L65 https://github.com/grpc/proposal/blob/master/L65-python-package-name.md
L78 https://github.com/grpc/proposal/blob/master/L78-python-rich-server-context.md
L95 https://github.com/grpc/proposal/blob/master/L95-python-reflection-client.md
Out of them, L58 and L64 are the most interesting one. L58 is the async API. L64 is the runtime proto parsing. They are the core of the implementation.
Async API
L58 is a great document that is fun to read.
The most interesting part of grpcio
, the async Python gRPC API, is that it is a wrapper over the gRPC C++ API but used the Python async model. It other words, it does not use the Python networking stack but used the networking stack of the C++ API. How are they married together?
It turns out that the C++ API has a CompletionQueue interface. It allow the clients to poll for the completion status with tag.
Any async framework requires a mechanism to yield control. Python uses generator to yield. Therefore, we can imaging that the wrapper just need to poll the completion queue and yield the results in the generator.
The result is a very elegant API that allows users to implement a streaming server with extreme simplicity:
|
|
asyncio
is integrated C++ API using Cython.
Cython
I was previously not familiar with Cython. I thought it is just a Python to C++ compiler. Many of the integrations that I have seen where implemented using pybind11 for Python/C++ binding or pyo3 for Python/Rust binding.
It turns out that Cython is also very interesting. It is a very old project dating back to 15 years ago. It has 2 syntaxes, one with the .pyx extension and the other with the .pxd extension.
The .pyx extension is a superset of Python. It allows us to add type hints to Python so that it can generate good C++ code; remember that Python did not have type hints 15 years ago so that .pyx extension was invented to define the Python interface.
The .pxd extension allows us to export C++ API to Python.
The 2 extensions are sometimes used together to allow Python and C++ to meet half way. The tool then generates C++ code that can be compiled into a Python extension. We can inspect the generated code.
Protobuf
asyncio
uses the same Protobuf code as the sync code. This implementation uses the C++ protobuf runtime already built into the grpcio-tools C extension to parse the protocol buffers and generate textual Python code in memory. This code is then used to instantiate the modules to be provided to the calling application. This will be more clear when we look at the generated files.
Generated files
The code generated from the proto file looks like:
- server_streaming_pb2.pyi
- server_streaming_pb2.py
- server_streaming_pb2_grpc.py
The first 2 files are for protobuf. The last one is for gRPC.
server_streaming_pb2.pyi
looks like:
|
|
Note that it just define the interface for Python tools to use. It does not contain any implementation. The implementation is in
server_streaming_pb2.py
:
|
|
This is not quite readable, but we can imagine that it just deserialize some data and load into the C++ code. The C++ code then have the data structures needed to parse the protocol buffers.
Let us now look at the gRPC code server_streaming_pb2_grpc.py
:
|
|
As we can see, it calls grpc.experimental.unary_stream
and passes the serializer/deserializer from the protobuf code. The unary_stream
is because this method happen to be a unary stream call. Other possible calls unary unary, stream unary, and stream stream.
Isn’t this implementation very elegant?
Some more details
It turns out there are several C++ protobuf implementations. We can find out which one is used by running the following code:
|
|
Upc is currently Preferred. We can find more details from https://github.com/protocolbuffers/upb/tree/main/python
Conclusion
gRPC Python is extremely fast because it is mostly built around the C++ API.
The implementation using Cython is very elegant and we can borrow the idea to speed up our own Python code.
Last modified on 2023-07-14