Analyzing Unsplash Photo Performance with Python | Blog

Photo-taking engineers, understanding how photos perform online can provide valuable insights into audience preferences and engagement. In this post, I'll walk you through a Python script I created to analyze the performance of my photos on Unsplash. This script fetches photos, calculates performance scores, identifies outliers, normalizes the scores, and generates a visualization.

You can find the full code in the repository available at lucavallin/uppa.

What Does the Script Do?

Here's a high-level overview of what the script does:

Fetch Photos from Unsplash: Retrieves your photos using the Unsplash API.
Calculate Performance Scores: Computes raw performance scores based on views per day and downloads per view.
Identify Upper Bound Outliers: Uses the Interquartile Range (IQR) method to calculate the upper bound for outliers.
Filter Upper Bound Outliers: Removes photos identified as upper bound outliers.
Recalculate Min and Max Scores: Recalculates the minimum and maximum raw scores after removing outliers.
Normalize Scores: Normalizes the scores of the remaining photos to a scale of 0 to 100.
Output Results: Prints the normalized scores, generates a bar chart, and optionally saves the results to a JSON file.

Why I Wrote This Script

Unsplash provides views and downloads statistics for each photo, but these numbers alone are not very meaningful. To gain deeper insights into the performance of my photos, I needed to consider additional factors:

How many views in how many days?: This helps to understand the rate at which a photo is being viewed.
How many downloads in how many days?: This provides insight into the rate at which a photo is being downloaded.
How many views turn into downloads?: This conversion rate reflects the appeal and effectiveness of a photo.

By taking these aspects into account, the script calculates a more comprehensive performance score, allowing me to better evaluate and compare the success of my photos on Unsplash.

Prerequisites

To get started, ensure you have Python 3.x installed along with the necessary libraries:

requests: This library is used to make HTTP requests to the Unsplash API to fetch your photos.
numpy: This library is used for numerical operations, such as calculating the Interquartile Range (IQR) for identifying outliers and normalizing scores.
matplotlib: This library is used to generate a bar chart for visualizing the normalized scores of your photos.

You can install the required libraries using pip:

pip install requests numpy matplotlib

You'll also need an Unsplash API access key, which you can get by registering as a developer on the Unsplash website.

Setting Up Environment Variables

Before running the script, set up the environment variables for your Unsplash access key and username.

On macOS/Linux

Add the following lines to your .bashrc or .bash_profile:

export UNSPLASH_ACCESS_KEY='your_unsplash_access_key'
export UNSPLASH_USERNAME='your_unsplash_username'

Then, source the file to apply the changes:

source ~/.bashrc  # or source ~/.bash_profile

On Windows

Set the environment variables using the command line or through the System Properties dialog:

setx UNSPLASH_ACCESS_KEY "your_unsplash_access_key"
setx UNSPLASH_USERNAME "your_unsplash_username"

The Script

Here's the full Python script with explanations:

Step 1: Import Libraries and Read Environment Variables

First, we import the necessary libraries and read the access key and username from environment variables.

import os
import requests
import numpy as np
import json
import matplotlib.pyplot as plt

# Read ACCESS_KEY and USERNAME from environment variables
ACCESS_KEY = os.getenv('UNSPLASH_ACCESS_KEY')
USERNAME = os.getenv('UNSPLASH_USERNAME')

# Check if ACCESS_KEY and USERNAME are set
if not ACCESS_KEY or not USERNAME:
    print("Please set the UNSPLASH_ACCESS_KEY and UNSPLASH_USERNAME environment variables.")
    exit()

PHOTOS_FILENAME = 'photos.json'

Step 2: Fetch Photos from Unsplash

We define a function to fetch photos from the Unsplash API with pagination.

def get_unsplash_photos(username, access_key):
    url = f"https://api.unsplash.com/users/{username}/photos"
    all_photos = []
    page = 1
    per_page = 30

    while True:
        params = {
            'client_id': access_key,
            'page': page,
            'per_page': per_page,
            'stats': 'true'
        }
        response = requests.get(url, params=params)
        if response.status_code != 200:
            print(f"Error: {response.status_code}")
            break

        photos = response.json()
        if not photos:
            break

        all_photos.extend(photos)
        page += 1

    return all_photos

Step 3: Save and Load Photos

We create functions to save photos to a JSON file and load them from the file if it exists.

def save_photos_to_file(photos, filename):
    with open(filename, 'w') as f:
        json.dump(photos, f, indent=4)

def load_photos_from_file(filename):
    with open(filename, 'r') as f:
        return json.load(f)

Step 4: Load or Fetch Photos

To save on API requests, which can be limited and take time, we first check if a local file (photos.json) exists. If it does, we load the photos from this file. If not, we fetch the photos from Unsplash and save them to the local file for future use.

if (os.path.exists(PHOTOS_FILENAME)):
    photos = load_photos_from_file(PHOTOS_FILENAME)
    print("Loaded photos from photos.json")
else:
    photos = get_unsplash_photos(USERNAME, ACCESS_KEY)
    if not photos:
        print("No photos retrieved. Please check your username and access key.")
        exit()
    save_photos_to_file(photos, PHOTOS_FILENAME)
    print("Fetched photos from Unsplash API and saved to photos.json")

Step 5: Calculate Raw Performance Scores

Next, we calculate raw performance scores for each photo. This is done by considering both the number of views and the number of downloads the photo has received, and how long the photo has been online. The formula used here combines views per day and downloads per view to provide a balanced measure of performance.

We focus on photos that have been online for at least 30 days to ensure a meaningful analysis, as very recent photos might not have enough data to provide a reliable performance score.

W_v = 1.5
W_d = 0.5
filtered_photos = []

for photo in photos:
    days_online = (np.datetime64('now') - np.datetime64(photo['created_at'])).astype(int) / (24 * 60 * 60)
    if days_online >= 30:
        views = photo['statistics']['views']['total'] or 1  # Avoid division by zero
        downloads = photo['statistics']['downloads']['total']
        raw_score = (views / days_online) * W_v + (downloads / views) * W_d
        photo['days_online'] = days_online
        photo['raw_score'] = raw_score
        filtered_photos.append(photo)

Step 6: Identify and Filter Upper Bound Outliers

To ensure that our analysis is not skewed by exceptionally high-performing photos, we identify and filter out these outliers. We use the Interquartile Range (IQR) method to find the upper bound. The IQR is a measure of statistical dispersion and is used here to detect outliers.

By removing photos with scores above this upper bound, we focus on the majority of photos, providing a more accurate representation of typical performance.

raw_scores = np.array([photo['raw_score'] for photo in filtered_photos])
Q1 = np.percentile(raw_scores, 25)
Q3 = np.percentile(raw_scores, 75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

filtered_photos = [photo for photo in filtered_photos if photo['raw_score'] <= upper_bound]

Step 7: Normalize Scores

After filtering outliers, we normalize the raw scores to a range of 0 to 100. Normalization allows us to compare the performance of photos on a common scale, making it easier to identify top performers.

We first recalculate the minimum and maximum raw scores from the filtered data. Then, we apply the normalization formula to each photo's raw score.

filtered_raw_scores = np.array([photo['raw_score'] for photo in filtered_photos])
min_raw = filtered_raw_scores.min()
max_raw = filtered_raw_scores.max()

for photo in filtered_photos:
    photo['normalized_score'] = ((photo['raw_score'] - min_raw) / (max_raw - min_raw)) * 100
    photo['days_online'] = int(photo['days_online'])

Step 8: Sort and Display Results

We sort the photos by their normalized scores and print the results. This provides an easy way to see which photos are performing the best.

sorted_photos = sorted(filtered_photos, key=lambda x: x['normalized_score'], reverse=True)
for photo in sorted_photos:
    print(f"Photo URL: https://unsplash.com/photos/{photo['id']}, Days Online: {photo['days_online'] }, Views: {photo['statistics']['views']['total']}, Downloads: {photo['statistics']['downloads']['total']}, Normalized Score: {photo['normalized_score']:.2f}")

The output will look something like this:

Photo URL: https://unsplash.com/photos/kEOpys1hHkc, Days Online: 38, Views: 5035, Downloads: 58, Normalized Score: 100.00
Photo URL: https://unsplash.com/photos/wimuq4MFt34, Days Online: 38, Views: 5004, Downloads: 69, Normalized Score: 99.34
Photo URL: https://unsplash.com/photos/fNb_pAywgts, Days Online: 38, Views: 3869, Downloads: 17, Normalized Score: 75.09
Photo URL: https://unsplash.com/photos/D8ZVdmF8glM, Days Online: 114, Views: 11183, Downloads: 97, Normalized Score: 71.92
Photo URL: https://unsplash.com/photos/kHeTz2vhBQM, Days Online: 37, Views: 3395, Downloads: 27, Normalized Score: 67.27
....

Step 9: Generate a Bar Chart

Finally, we generate a bar chart to visualize the normalized scores of the photos. This helps to quickly identify the top-performing photos.

photo_ids = [photo['id'] for photo in sorted_photos]
normalized_scores = [photo['normalized_score'] for photo in sorted_photos]

plt.figure(figsize=(10, 8))
plt.barh(photo_ids, normalized_scores, color='skyblue')
plt.xlabel('Normalized Score')
plt.ylabel('Photo ID')
plt.title('Normalized Scores of Unsplash Photos')
plt.gca().invert_yaxis()  # Invert y-axis to have the highest score at the top
plt.show()

This is one of the charts generated by the script:

Running the Script

To run the script, ensure you have set the environment variables and then execute:

python uppa.py

You can find the full code in the repository available at lucavallin/uppa.

Conclusion

By analyzing the performance of your photos on Unsplash, you can get extra insights into which photos do best on the website. This script provides a fairly good way (I think?) to check photo performance, helping you make "data-driven" decisions for your photography portfolio.