STAT 29000: Project 8 — Spring 2022

Motivation: Python is an interpreted language (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R — also an interpreted language), we can run and evaluate a line of code easily using a repl. In fact, this is the way you’ve been using Python to date — selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI’s (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks.

Context: This is the first (of two) projects where we will learn about creating and using Python scripts.

Scope: Python

Learning Objectives
  • Write a python script that accepts user inputs and returns something useful.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/coco/unlabeled2017/*.jpg

Questions

Question 1

Up until this point, we have been using Python from within the Jupyter Lab environment. It is pretty straightforward to use — type in some code into a cell and click run. This is similar to using a REPL (read eval print loop), in that you are essentially running one or more lines of Python code at once. Another way to run Python code is in a script form. A Python script is a .py file with Python code written inside of it that performs a series of actions.

Python scripts are easy to use and run. For example, if you had a script called my_script.py, you could run the script by executing the following in a terminal.

python3 my_script.py

This would use the first python3 interpreter in your $PATH environment variable to run the script.

Read this article about the main function in Python scripts. Also, please read this section, paying special attention to the shebang. Lastly, read this section.

In a Python cell in your notebook, determine which Python interpreter is being used by running the following code.

import sys
print(sys.executable)

If you want to find where Python looks for packages, you can use the following code.

import sys
print(sys.path)

Python will look for packages in those locations, in order.

Your output should read: /scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python. This is the absolute path to the Python interpreter we use with this course.

Create a Python script called question01.py in your $HOME directory with the following content.

question01.py
import sys

def main():
    print(f"{sys.executable}")

if __name__ == "__main__":
    main()

Open a terminal in Jupyter Lab, and run the following command to give execute permissions to the script.

chmod +x $HOME/question01.py

Finally, in the terminal, execute the script by running the following command.

python3 $HOME/question01.py

In addition, execute the same command in a bash cell in your notebook.

%%bash

python3 $HOME/question01.py

What is the output? Was this expected? Write 1-2 sentences explaining why the output was expected or not.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The question01.py script.

Question 2

In the previous question, we ran the script, question01.py, by passing it as input to our python3 interpreter. We learned that, depending on the first python3 interpreter in our $PATH environment variable, the output would differ. There is another way to run Python scripts that don’t pass our script as an input. The way to do this is with the shebang. You should have read about the shebang in the linked articles in quesiton (1).

If a shebang is included in the first line of your Python script, there is no longer a need to pass the script as input. Instead, you can simply execute the script by running ./question01.py. This tells the terminal to execute question01.py in the current directory. If you were to simply type question01.py in your terminal, you would get an error saying that the command "question01.py" was not found. The reason is that your shell will look in the directories in your $PATH environment variable in order to find commands. Since question01.py is not in your $PATH, you receive that error. By prepending the ./, this tells the shell to instead look in my current working directory and execute question01.py.

How does the shell know how to execute the script? The shebang! For example, try executing the following script by running ./question02.py (or $HOME/question02.py) in the terminal.

question02.py
#!/usr/bin/python3

import sys

def main():
    print(f"{sys.executable}")

if __name__ == "__main__":
    main()
chmod +x $HOME/question02.py

You’ll notice that the output matches the shebang! Now here is the real test, execute the script from within a bash cell in your notebook.

%%bash

$HOME/question02.py

Aha! Even though we are in our notebook, the shell respected the shebang and used /usr/bin/python3 to execute the script.

If you were to run python3 question02.py from within a bash cell, the output would rever to our course interpreter at /scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python — when passing the script to a particular interpreter, the shebang is ignored.

Okay, in this project, since the focus is on writing Python scripts, it is a good opportunity to have some fun and use some powerful, pre-built models to do fun things. The following code will use an image classification model to identify the content of a photo.

question02_2.py
#!/usr/bin/python3

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests


def main():

    image = Image.open("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg")

    feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
    model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

    inputs = feature_extractor(images=image, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits

    predicted_class_idx = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class_idx])


if __name__ == "__main__":
    main()
chmod +x $HOME/question02_2.py

In a bash cell, execute the script.

%%bash

$HOME/question02_2.py

What happens? Did you expect this? Write 1-2 sentences explaining why you expected the output to be different. Finally, correct the script so that it runs correctly. In another bash cell, run the updated script.

Please ignore any red warnings you receive as a part of your output from running the corrected question02_2.py script.

If you want to see whether or not the results are accurate, you can display the image with the following code inside a code cell.

from IPython import display
display.Image("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg")
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • question02.py script.

  • question02_2.py script.

Question 3

I hope that the previous two questions gave you a pretty solid understanding that if we want packages to be available when running a Python script, we need to use the appropriate shebang or Python interpreter.

Right now, our script, question02_2.py is not very useful. No matter what we do it will continue to load up and analyze the same old image. That is pretty boring. One of the primary things you can do with scripts is read arguments passed to the script and do something. For example, we passed the argument question01.py to the python3 program. In the same way, we could, for example, pass an absolute path to an image to our script and have it print out the output for the given image! Our script would be much more useful. We could do things like:

./my_script.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
./my_script.py /depot/datamine/data/coco/unlabeled2017/000000000013.jpg

Copy your question02_2.py script to a new script called question03.py. Modify question03.py to accept a single argument, an absolute path to an image, and use that argument in place of the default 000000000008.jpg image. Use sys.argv to accomplish this.

Test it out from within a bash cell in your notebook.

%%bash

$HOME/question03.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
$HOME/question03.py /depot/datamine/data/coco/unlabeled2017/000000000013.jpg
$HOME/question03.py

If no argument is passed to question03.py, use the 000000000008.jpg image as a default.

Make sure to import sys so you have access to the sys.argv variable.

import sys
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Typically, scripts or CLIs (command line interfaces) have a bunch of options. For example, you read about the various options of a tool like grep or awk by running the following in a terminal.

man awk
man grep

For example, grep has the option, -i which makes the search case insensitive.

grep -i 'ok' something.txt

In addition, often times options have both a long form or short form. For example grep -f somefile.txt is the same as grep --file=somefile.txt.

Create a new script called question04.py that accepts a single argument called --detailed or -d, in either short form or long form. If the flag is present, instead of using the "google/vit-base-patch16-224" model, which outputs 1 of 1000 classes, it will instead use the "microsoft/beit-base-patch16-224-pt22k-ft22k" model (see here) that will output 1 of the 21841 classes. Some examples of how the script should work are below.

Make sure to give your script executable permissions.

chmod +x $HOME/question04.py
%%bash

./question04.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg --detailed
./question04.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -d
./question04.py --detailed /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
./question04.py -d /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
Output
Predicted class: surfing, surfboarding, surfriding
Predicted class: surfing, surfboarding, surfriding
Predicted class: surfing, surfboarding, surfriding
Predicted class: surfing, surfboarding, surfriding
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

As you can imagine, adding flags and options can blow up the logic and size of your script pretty quickly! Luckily, there is the argparse package. This package takes care of parsing and handling program input. Here are the official docs and here is a decent tutorial on how to use it.

Use the argparse package to parse your arguments instead of using sys.argv. As long as the new version of your script accepts both the long and short forms of the --detailed flag, and uses argparse, you will receive full credit. However, please do you best to make the script as robust as possible! In addition, you can change the behavior of the arguments as long as the details flags work.

In bash cells, show at least 3 examples using your new script, question05.py, which uses argparse instead of sys.argv.

If you want a solid template, you may use the following code to start. You will need to tweak it in order to make it work, however, it is a decent place to start.

import argparse
import pandas as pd

def main():
	parser = argparse.ArgumentParser()
	subparsers = parser.add_subparsers(help="possible commands", dest="command")
	some_parser = subparsers.add_parser("something", help="")
	some_parser.add_argument("-o", "--output", help="directory to output file(s) to")

	if len(sys.argv) == 1:
		parser.print_help()
		sys.exit(1)

	args = parser.parse_args()

	if args.command == "something":
		something()

if __name__ == "__main__":
	main()

An accompanying set of tests would be:

%%bash

./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg --detailed
./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -d
Output
Predicted class: surfing, surfboarding, surfriding
Predicted class: seashore, coast, seacoast, sea-coast
Predicted class: seashore, coast, seacoast, sea-coast
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.