How does the Python Interpreter interpret my Python?¶
Estimated time to read: 26 minutes
- Originally Written: June, 2020
I wanted to understand what happens when I run the Python command.
We'll take a high level look at the stages that Python goes through in order to run our code. First let's start with some references to build a picture.
CPython is the reference implementation of the Python programming language. Written in C and Python, CPython is the default and most widely-used implementation of the language"
https://en.wikipedia.org/wiki/CPython
A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer. "
https://docs.python.org/3/reference/lexical_analysis.html
In CPython, the compilation from source code to bytecode involves several steps:
- Parse source code into a parse tree (Parser/pgen.c)
- Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
- Transform AST into a Control Flow Graph (Python/compile.c)
- Emit bytecode based on the Control Flow Graph (Python/compile.c)"
https://devguide.python.org/compiler/
CPython can be defined as both an interpreter and a compiler as it compiles Python code into bytecode before interpreting it.
So based on the information above we can draw a high level diagram.
Info
This post covers the implementation using CPython. The process may differ for other interpreters.
As you can see from the diagram, CPython will read the code, create tokens, build a parse tree from the tokens, build an abstract syntax tree, use this AST to build a control flow graph, compile this into bytecode, interpret the byte code.
Let's now have a look at each stage to get a better understanding of what is produced. In each stage we will use a helper library to look under the hood.
Example Application¶
We'll use the following code as an example. This is a fairly simple Python script which has two functions, a couple of calls to those functions, and also a couple of print statements.
# Import the datetime library
from datetime import datetime
def whatIsTodaysDate():
return datetime.now()
def concatenateTwoStrings(stringA, stringB):
concat = stringA + stringB
return concat
todaysDate = whatIsTodaysDate()
print(todaysDate)
oneString = concatenateTwoStrings("DEVLIT-4056 ", "How Does The Python Interpreter Interpret My Python")
print(oneString)
1. Tokenize¶
A Python program is read by a parser. Input to the parser is a stream of tokens, generated by the lexical analyzer.
The first step that Python takes is to tokenize our source code and we can use the Tokenize module to view these tokens.
import tokenize, token
from tabulate import tabulate
if __name__ == '__main__':
outputTable = []
with open('example_app/example_app.py', 'rb') as f:
for line in tokenize.tokenize(f.readline):
outputTable.append([line.string,token.tok_name[line.type]])
print(tabulate(outputTable, headers=["Symbol","Type"]))
We need to import the Tokenize module, we then read our source code, and for each line we pass it to the tokenize module. We then print these tokens into a table on the screen.
When the program runs we can see a long list of tokens created from our source code.
2. Syntax Trees¶
- Parse source code into a parse tree (Parser/pgen.c)
- Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
Python uses the tokens created in the previous step as the nodes to build a couple of syntax trees. The first tree created is a Parse Tree, or Concrete Syntax Tree (CST).
A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar.
Parse trees concretely reflect the syntax of the input language, making them distinct from the abstract syntax trees used in computer programming
In other words a Parse Tree or CST contains every bit of information about the source code, including newlines, whitespaces, commas, parentheses, comments, and grammatical structure.
Here's an example of the example script that has been built using the LibCST module.
import libcst as cst
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
sourceTree = cst.parse_module(sourceCode)
print(sourceTree)
Python then converts this tree into an Abstract Syntax Tree(AST)
. An AST is a condensed version of the Parse Tree
and as you'll see shortly, only contains key information.
If you're wondering about the purpose of the AST, a
complete traversal of the [AST] allows verification of the correctness of the program
Here's an example of an AST for the statement, result = 10 + 20
. This example was built using the showast
project. Showast
is an IPython/Jupyter notebook plugin for visualizing abstract syntax trees.
https://github.com/hchasestevens/show_ast
As you can see, the tree has two numbers, 10
and 20
, which are added in a binary operation, and the output of this operation is stored in a variable named result.
To view the AST of our example script we can use the AST module, and from this module the "parse" function.
The ast module helps Python applications to process trees of the Python abstract syntax grammar
https://docs.python.org/3/library/ast.html
ast.parse() - Builds an ast from code stored as a string
https://greentreesnakes.readthedocs.io/en/latest/tofrom.html
Here is the source code we can use to build an AST. In this code we open the example_app.py
file, read it into a variable call sourceCode
, build a tree using ast.parse()
, and then pretty print this tree using astunparse.dump()
import ast
import astunparse
if __name__ == '__main__':
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
tree = ast.parse(sourceCode)
print(astunparse.dump(tree))
The resulting tree looks like this.
As you can see, the AST contains a lot less information than the CST, emitting all non critical elements. We can use the showast
project to build a graphical visualization of this AST. As it's quite a large tree I've zoomed in on the second image to see specific details.
What Can I do with the AST?¶
Sometimes it's not only useful to understand how Python works, but also have access to the underlying components such as the AST. Using the ast
module we can traverse the tree with or without a specified order.
Walk AST In No Specified Order - ast.walk(node)¶
The ast.walk(node)
function recursively yield all descendant nodes in the tree starting at node (including node itself), in no specified order. This is useful if you only want to modify nodes in place and don’t care about the context.
https://docs.python.org/3/library/ast.html
import ast
from tabulate import tabulate
if __name__ == '__main__':
outputTable = []
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
tree = ast.parse(sourceCode)
for node in ast.walk(tree):
outputTable.append([type(node).__name__ , node])
print(tabulate(outputTable, headers=["Type","Node"]))
Similar to the previous example, we read the file, build the tree ( ast.parse(sourcecode)
), and then read each node of the tree. We then add the type and address of the node to a variable called outputTable
, and finally print this table.
The result shown here.
Since it's the same example app there's no new information displayed, however by walking the tree we now have access to each node. This means we can filter and print out only the information we require.
We do this by using the Python function, "isinstance(object, classinfo)".
Return True if the object argument is an instance of the classinfo argument, or of a (direct, indirect or virtual) subclass thereof. If object is not an object of the given type, the function always returns False"
Let's walk the tree and only print out the node if it's an instance type ast.FunctionDef
, i.e. one of the two Function Definitions
in our example app.
import ast
import astunparse
from tabulate import tabulate
if __name__ == '__main__':
outputTable = []
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
tree = ast.parse(sourceCode)
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
print("Nodetype: {} {} ".format(type(node).__name__,node))
print(astunparse.unparse(node))
Our script also uses the helper function, astunparse.unparse()
, to print out the source code in each FunctionDef
node.
https://astunparse.readthedocs.io/en/latest/
By having access to the AST we can now statically analyse source code without ever having to run the program. For example instead of printing the node and the source code, we might just want to know how many functions exist within the code itself.
import ast
import astunparse
from tabulate import tabulate
if __name__ == '__main__':
outputTable = []
functionCount = 0
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
tree = ast.parse(sourceCode)
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
functionCount += 1
print("Function: {}".format(node.name))
print("\nTotal Functions: {}".format(functionCount))
The above examples are very basic but aim to demonstrate what's possible. Here's a collection of static analysis tools which provide linting, formatting, security checks, code quality checks, and more.
https://github.com/mre/awesome-static-analysis
Walk AST With An Order - ast.NodeVisitor¶
ast.NodeVisitor" is a base class that walks the abstract syntax tree and calls a visitor function for every node found.
import ast
import astunparse
class codeTree(ast.NodeVisitor):
def generic_visit(self, node):
print(" Name: {}".format(ast.dump(node)))
ast.NodeVisitor.generic_visit(self, node)
def visit_FunctionDef(self, node):
print("Function Name: {}".format(node.name))
print(astunparse.unparse(node))
self.generic_visit(node)
if __name__ == '__main__':
with open('example_app/example_app.py', 'rb') as f:
sourceCode = f.read()
tree = ast.parse(sourceCode)
codeTree().visit(tree)
In case this looks a little more complex than the walk we previously saw let's break it down into stages.
Starting towards the bottom we can see the same code as before. First read the file and build the tree. Next we can pass the tree into the "codeTree().visit()" function. "codeTree" is a class we've written however there's no "visit" function. This is made possible through inheritance.
https://docs.python.org/3/tutorial/classes.html
Inheritance allows us to define a class that inherits all the methods and properties from another class. To create a class that inherits the functionality from another class, send the parent class as a parameter when creating the child class.
In this example the class, codeTree
, is inheriting all the methods (e.g. visit()
) and properties from the class, NodeVisitor
. Therefore we have access to codeTree().visit(tree)
The final part to decipher is regarding the two functions, generic_visit(self, node)
, and visit_FunctionDef(self, node)
From the AST module document,
The default implementation [of Ast.NodeVisitor.visit(node)] calls the method called self.visit_classname where classname is the name of the node class, or generic_visit() if that method doesn’t exist
This is demonstrating the use of overriding methods in Python.
If you add a method in the child class with the same name as a function in the parent class, the inheritance of the parent method will be overridden.
So in our case, visit_FunctionDef(self, node)
already exists within the parent NodeVisitor
class, however since we've created our own visit_FunctionDef(self, node)
Python will instead use that one.
https://docs.python.org/3/tutorial/classes.html
https://www.w3schools.com/python/python_inheritance.asp
That means if we want to print the function name and source code like the ast.walk()
example, we simply have to create our own visit_FunctionDef
with the print statement and AST will do the rest.
In order to continue the traversal we need to call self.generic_visit(node)
, which will visit the remainder of the nodes.
When we run this script we see the exact same result from the previous ast.walk()
traversal, however this time in order of depth-first traversal. In some use cases an ordered traversal may be required so it's good to understand how to implement both options.
Info
The code which implements this class can be found starting from line 365 in the following repo.
3. Control Flow Graphs¶
From the Python developers guide below, once the AST has been built it is converted to Python Bytecode(more on that later) and this Bytecode is used to create a Control Flow Graph (CFG).
A control flow graph (often referenced by its acronym, CFG) is a directed graph that models the flow of a program using basic blocks that contain the intermediate representation (abbreviated “IR”, and in this case is Python bytecode) within the blocks.
CFGs are usually one step away from final code output. Code is directly generated from the basic blocks (with jump targets adjusted based on the output order) by doing a post-order depth-first search on the CFG following the edges.
To view the CFG of our example app we can use the pycfg
module.
https://www.geeksforgeeks.org/draw-control-flow-graph-using-pycfg-python/
# https://www.geeksforgeeks.org/draw-control-flow-graph-using-pycfg-python/
from pycfg import PyCFG, CFGNode, slurp
import argparse
import tkinter as tk
from PIL import ImageTk, Image
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('pythonfile', help ='The python file to be analyzed')
args = parser.parse_args()
arcs = []
cfg = PyCFG()
cfg.gen_cfg(slurp(args.pythonfile).strip())
g = CFGNode.to_graph(arcs)
g.draw(args.pythonfile + '.png', prog ='dot')
# Draw using tkinter.
root = tk.Tk()
root.title("Control Flow Graph")
img1 = Image.open(str(args.pythonfile) + ".png") # PIL solution
img1 = img1.resize((800, 600), Image.ANTIALIAS)
img = ImageTk.PhotoImage(img1)
background ="gray"
panel = tk.Label(root, height = 600, image = img)
panel.pack(side = "top", fill ="both", expand = "yes")
nodes = g.number_of_nodes() # no. of nodes.
edges = g.number_of_edges() # no. of Edges.
complexity = edges - nodes + 2 # Cyclomatic complexity
frame = tk.Frame(root, bg = background)
frame.pack(side ="bottom", fill ="both", expand = "yes")
#tk.Label(frame, text ="Nodes\t\t"+str(nodes), bg = background).pack()
#tk.Label(frame, text ="Edges\t\t"+str(edges), bg = background).pack()
#tk.Label(frame, text ="Cyclo Complexity\t"+ str(complexity), bg = background).pack()
root.mainloop()
As you can see from the output, Python now has a graph of each step taken to run our example app, from the comment at the start right through to the print statement at the end.
4. Bytecode¶
As per the following talk on Python Bytecode from PyCon 2018:
- Some languages compile directly to CPU instructions
- Some interpret source code directly while running
- Some compile to an intermediate set of instructions, and implement a virtual machine that turns those into CPU instructions while running. That's bytecode.
Another example of a programming language implementing bytecode is Java which implements the Java virtual machine (JVM).
The JVM is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode.
This is not to be confused with a system virtual machine or virtualization using a hypervisor such as VMWare Vsphere. Instead the Python and Java Virtual Machines are essentially middleware applications (CPython) that interpret and execute the bytecode.
In the case of CPython, it has its own set of bytecodes and also emulates a CPU with call stack to execute the bytecode.
https://en.wikipedia.org/wiki/Virtual_machine
https://en.wikipedia.org/wiki/Stack_machine
CPython compiles its source code into "byte code", and for performance reasons, it caches this byte code on the file system whenever the source file has changes. This makes loading of Python modules much faster because the compilation phase can be bypassed. When your source file is foo.py, CPython caches the byte code in a foo.pyc file right next to the source.
As of Python 3.2, the .pyc
files are now located in a subdirectory, __pycache__
.
https://docs.python.org/dev/whatsnew/3.2.html#pep-3147-pyc-repository-directories
We can use the dis
module to look at the bytecode in a human readable format.
https://docs.python.org/3/library/dis.html
import dis
if __name__ == '__main__':
with open('example_app/__pycache__/binary_subtract.cpython-37.pyc', 'rb') as f:
sourceCode = f.read()
dis.dis(sourceCode)
The script to read the Python Bytecode is much shorter than previous scripts and in this case we're reading a file ending with .pyc
.
The output from running this script is quite different from before. In this case we can see the bytecode that Python will use to run our example app. There are LOADs
, CALLs
, POPs
, STOREs
, and more.
We could end it there however we've just learnt that these aren't CPU instructions, rather they're interpreted by some intermediate application (the Python JVM). So as the final step let's have a look at what some of these code's actually do and how they're executed.
To do this we'll use a much smaller example of subtracting two numbers
import dis
def subtractNumbers():
a = 10
b = 20
result = a - b
return result
dis.dis(subtractNumbers)
print(subtractNumbers.__code__.co_consts)
print(subtractNumbers.__code__.co_varnames)
Running this script gives us some LOADs
, STOREs
, SUBTRACT
, and a RETURN
. The documentation for the dis
module documentation contains a lot of useful information we can use to understand what's taking place.
https://docs.python.org/3/library/dis.html
LOAD_CONST
pushes co_consts[consti]
onto the stack
co_consts
is a tuple of constants used in the bytecode
https://docs.python.org/3/library/inspect.html
If you want to see what constants Python knows about you can access these through the functions code object e.g. subtractNumbers.__code__.co_consts
. The code object is possibly a topic for a future post.
Remember it was mentioned that CPython implements a call stack and entries are pushed and popped off this stack.
So in this case, the first instruction is to push the value found co_consts[1] onto the stack. That value will be "10"
STORE_FAST
stores the value on the top of the stack in the local co_varnames[var_num]
co_varnames
is a tuple of names of arguments and local variables
Again you can access these arguments and local variables through the __code__
object. e.g. subtractNumbers.__code__.co_varnames
This instruction takes the value that was pushed, 10
, and stores in the variable found in co_varnames[0]
. That variable is a
.
LOAD_CONST
and STORE_FAST
These are repeats of the steps above however with different indexes, resulting in value 20
being stored in variable b
.
LOAD_FAST
pushes a reference to the local co_varnames[var_num]
onto the stack
Once the values are stored Python pushes references to a
and b
onto the stack to use them later.
BINARY_SUBTRACT
implements TOS = TOS1 - TOS
From the dis
documentation link above,
Binary operations remove the top of the stack (TOS) and the second top-most stack item (TOS1) from the stack. They perform the operation, and put the result back on the stack.
In our example, a
and b
, are removed from the top of the stack, subtracted (a - b)
and the result added back to the stack.
STORE_FAST
Since the outcome of BINARY_SUBTRACT
was added to the top of the stack Python can use STORE_FAST
to store this value in co_varnames[2]
which in our case is the variable, result
.
LOAD_FAST
Before returning Python first needs to push the reference to the variable, result
, to the top of the stack.
RETURN_VALUE
returns with TOS
to the caller of the function.
Now that the result is on the top of the stack Python can return from the function and process further instructions.
As you can see there are quite a number of instructions required to run even a small amount of Python code.
Since we've come this far we might as well go one level deeper and have a look at the actual source code which interprets and executes these instructions.
As the name suggests, CPython is primarily developed in C and you can find all the source code for the interpreter/compiler in the following Github repo.
https://github.com/python/cpython
For this example the file we're concerned with is ceval.c
.
Broken link
This link should have worked in the past. You may need to check the Wayback Machine or the Git commit history
https://github.com/python/cpython/blob/3./Python/ceval.c
You can probably tell by the description that the purpose of this file is to /* Execute compiled code */
. If you scroll down to line 740
you'll find the start of the /* Interpreter main loop */
.
Keep scrolling all the way down to line 1274
and you'll find switch (opcode) {
. This is the relevant section we're after and if you look through you'll see references to the instructions we've just learn about (e.g. LOAD_FAST
, LOAD_CONST
).
This is the actual source code that is implementing the virtual machine with call stack and executing those instructions.
I've included a screenshot of the BINARY_SUBTRACT
implementation and as you can see it takes a couple of values off the top of the stack (TOP
and POP
), subtracts those values, and then adds the difference back onto the stack before returning.
So going back to the original question, how does the Python interpreter interpret my Python? It's just a C application running our Python script (this is one implementation, others may vary).
How easy was that!
References¶
- https://github.com/python/cpython
- https://devguide.python.org/compiler/
- https://docs.python.org/3/library/ast.html
- https://docs.python.org/3/reference/lexical_analysis.html
- https://www.asmeurer.com/brown-water-python/tokens.html
- https://data-flair.training/blogs/python-interpreter/
- https://usi-pl.github.io/cc-2019/notes/03.ast.html
- https://ruslanspivak.com/lsbasi-part7/
- https://www.mattlayman.com/blog/2018/decipher-python-ast/
- https://github.com/adamtornhill/code-maat
- https://mvdwoord.github.io/exploration/2017/08/18/ast_explore.html
- https://opensource.com/article/18/4/introduction-python-bytecode
- https://nedbatchelder.com/blog/201803/is_python_interpreted_or_compiled_yes.html
- https://www.digitalocean.com/community/tutorials/understanding-class-inheritance-in-python-3
- https://hackernoon.com/modifying-the-python-language-in-7-minutes-b94b0a99ce14
- https://tomassetti.me/guide-parsing-algorithms-terminology
- http://flint.cs.yale.edu/cs421/lectureNotes/c04.pdf
- https://repositories.lib.utexas.edu/server/api/core/bitstreams/fd8b3347-6ae6-4e49-916f-c25ddcd36675/content
- https://github.com/coetaur0/staticfg
- https://www.youtube.com/watch?v=_w3a1ledmVE&t=1158s
Broken links
These links should have worked in the past. You may need to check the Wayback Machine or the Git commit history
- https://dev.to/btaskaya/lifecycle-of-a-python-code---cpythons-execution-model-85i