Website Feature Engineering at Scale: PySpark, Python & Snowflake

Introduction and Problem Imagine you’re staring at a database containing thousands of merchants across multiple countries, each with its own website. Your goal? Identify the top candidates to partner with in a new business proposal. Manually browsing each site is impossible at scale, so you need an automated way to gauge “how good” each merchant’s […] The post Website Feature Engineering at Scale: PySpark, Python & Snowflake appeared first on Towards Data Science.

May 5, 2025 - 19:46
 0
Website Feature Engineering at Scale: PySpark, Python & Snowflake

Introduction and Problem

Imagine you’re staring at a database containing thousands of merchants across multiple countries, each with its own website. Your goal? Identify the top candidates to partner with in a new business proposal. Manually browsing each site is impossible at scale, so you need an automated way to gauge “how good” each merchant’s online presence is. Enter the website quality score: a numeric feature (0-10) that captures key aspects of a site’s professionalism, content depth, navigability, and visible product listings with prices. By integrating this score into your machine learning pipeline, you gain a powerful signal that helps your model distinguish the highest-quality merchants and dramatically improve selection accuracy.

Table of Contents

  • Introduction and Problem
  • Technical Implementation
    • Legal & Ethical Considerations
    • Getting Started
    • Fetch HTML Script in Python
    • Assign a Quality Score Script in Pyspark
  • Conclusion
    • Disclaimer

Technical Implementation

Legal & Ethical Considerations

Be a good citizen of the web.

  • This scraper only counts words, links, images, scripts and simple “contact/about/price” flags, it does not extract or store any private or sensitive data.
  • Throttle responsibly: use modest concurrency (e.g. CONCURRENT_REQUESTS ≤ 10), insert small pauses between batches, and avoid hammering the same domain.
  • Retention policy: once you’ve computed your features or scores, purge raw HTML within a reasonable window (e.g. after 7-14 days).
  • For very large runs, or if you plan to share extracted HTML, consider reaching out to site owners for permission or notifying them of your usage.

Getting Started

Here’s your folder structure once you clone the repository https://github.com/lucasbraga461/feat-eng-websites/ :

Code block 1. GitHub repository folder structure

├── src
│   ├── helpers
│   │   └── snowflake_data_fetch.py
│   ├── p1_fetch_html_from_websites.py
│   └── process_data
│       ├── s1_gather_initial_table.sql
│       └── s2_create_table_with_website_feature.sql
├── notebooks
│   └── ps_website_quality_score.ipynb
├── data
│   └── websites_initial_table.csv
├── README.md
├── requirements.txt
└── venv
└── .gitignore
└── .env

Your dataset should be ideally in Snowflake, here’s to give an idea on how you should prepare it, in case it comes from different tables, refer to src/process_data/s1_gather_initial_table.sql, here’s a snippet of it:

Code block 2. s1_gather_initial_table.sql

CREATE OR REPLACE TABLE DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE AS
(
SELECT
   DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_ARG_DATASET
WHERE WEBSITE_URL IS NOT NULL
) UNION ALL (
  
SELECT
   DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_BRA_DATASET
WHERE WEBSITE_URL IS NOT NULL
) UNION ALL (
[...]
SELECT
   DISTINCT COUNTRY, WEBSITE_URL
FROM DATABASE.SCHEMA.COUNTRY_JAM_DATASET
WHERE WEBSITE_URL IS NOT NULL
)
;

Here’s what this initial table should look like:

Figure 1. Initial table

Fetch HTML Script in Python

Having the data ready, this is how you call it, let’s say you have your data in Snowflake:

Code block 3. p1_fetch_html_from_websites.py using Snowflake dataset

cd ~/Document/GitHub/feat-eng-websites
python3 src/p1_fetch_html_from_websites.py -c BRA --use_snowflake
  • The python script is expecting the snowflake table to be in DATABASE.SCHEMA.WEBSITES_INITIAL_TABLE which can be adjusted for your use case on the code itself.

That will open a window on your browser asking you to authenticate to Snowflake. Once you authenticate it, it’ll pull the data from the designated table and proceed with fetching the website content.

If you choose to pull this data from a CSV file then don’t use the flag at the end and call it this way:

Code block 4. p1_fetch_html_from_websites.py using CSV dataset

cd ~/Document/GitHub/feat-eng-websites
python3 src/p1_fetch_html_from_websites.py -c BRA

GIF 1. Running p1_fetch_html_from_websites.py

Here’s why this script is powerful at fetching website content comparing to a more basic approach, see Table 1:

Table 1. Advantages of this Fetch HTML script comparing with a basic implementation

TechniqueBasic ApproachThis script p1_fetch_html_from_websites.py
HTTP fetchingBlocking requests.get() calls one‐by‐oneAsync I/O with asyncio + aiohttp to issue many requests in parallel and overlap network waits
User-AgentSingle default UA header for all requestsRotate through a list of real browser UA strings to evade basic bot‐detection and throttling
BatchingLoad & process the entire URL list in one goSplit into chunks via BATCH_SIZE so you can checkpoint, limit memory use, and recover mid-run
Retries & TimeoutsRely on library defaults or crash on slow/unresponsive serversExplicit MAX_RETRIES and TIMEOUT settings to retry transient failures and bound per-request wait times
Concurrency limitSequential or unbounded parallel calls (risking overload)CONCURRENT_REQUESTS + aiohttp.TCPConnector + asyncio.Semaphore to throttle max in-flight connections
Event loopSingle loop reused, can hit “bound to different loop” errors when restarting useCreate a fresh asyncio event loop per batch to avoid loop/semaphore binding errors and ensure isolation

It’s generally better to store raw HTML in a proper database (Snowflake, BigQuery, Redshift, Postgres, etc.) rather than in CSV files. A single page’s HTML can easily exceed spreadsheet limits (e.g. Google Sheets caps at 50,000 characters per cell), and managing hundreds of pages would bloat and slow down CSVs. While we include a CSV option here for quick demos or minimal setups, large‐scale scraping and Feature Engineering are far more reliable and performant when run in a scalable data warehouse like Snowflake.

Once you run it for say BRA, ARG and JAM this is how your data folder will look like

Code block 5. Folder structure once you ran it for ARG, BRA and JAM

├── data
│   ├── website_scraped_data_ARG.csv
│   ├── website_scraped_data_BRA.csv
│   ├── website_scraped_data_JAM.csv
│   └── websites_initial_table.csv

Refer to Figure 2 to visualize what the output of the first script generates, i.e. visualize the table website_scraped_data_BRA. Note that one of the columns is html_content, which is a very large field since it takes the whole HTML content of the website.

Figure 2. Example of table website_scraped_data_BRA generated with first python script

Assign a Quality Score Script in Pyspark

Because each page’s HTML can be massive, and you’ll have hundreds or thousands of pages, you can’t efficiently process or store all that raw text in flat files. Instead, we hand off to Spark via Snowpark (Snowflake’s Pyspark engine) for scalable feature extraction. See notebooks/ps_website_quality_score.ipynb for a ready-to-run example: just select the Python kernel in Snowflake and import the built-in Snowpark libraries to spin up your Spark session (see Code Block 6).

Code block 6. Folder structure once you ran it for ARG, BRA and JAM

import pandas as pd
from bs4 import BeautifulSoup
import re
from tqdm import tqdm

import Snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, lit, udf
from snowflake.snowpark.context import get_active_session
session = get_active_session()

Each market speaks its own language and follows different conventions, so we bundle all those rules into a simple country-specific config. For each country we define the contact/about keywords and price‐pattern regexes that signal a “good” merchant site, then point the script at the corresponding Snowflake input and output tables. This makes the feature extractor fully data-driven, reusing the same code for every region with just a change of config.

Code block 7. Config file

country_configs = {
   "ARG": {
       "name": "Argentina",
       "contact_keywords": ["contacto", "contáctenos", "observaciones"],
       "about_keywords": ["acerca de", "sobre nosotros", "quiénes somos"],
       "price_patterns": [r'ARS\s?\d+', r'\$\s?\d+', r'\d+\.\d{2}\s?\$'],
       "input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_ARG",
       "output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_ARG"
   },
   "BRA": {
       "name": "Brazil",
       "contact_keywords": ["contato", "fale conosco", "entre em contato"],
       "about_keywords": ["sobre", "quem somos", "informações"],
       "price_patterns": [r'R\$\s?\d+', r'\d+\.\d{2}\s?R\$'],
       "input_table": "DATABASE.SCHEMA.WEBSITE_SCRAPED_DATA_BRA",
       "output_table": "DATABASE.SCHEMA.WEBSITE_QUALITY_SCORES_BRA"
   },
[...]

Before we can register and use our Python scraper logic inside Snowflake, we first create a stage, a persistent storage area, by running the DDL in Code Block 8. This creates a named location @STAGE_WEBSITES under your DATABASE.SCHEMA, where we’ll upload the UDF package (including dependencies like BeautifulSoup and lxml). Once the stage exists, we deploy the extract_features_udf there, making it available to any Snowflake session for HTML parsing and feature extraction. Finally, we set the country_code variable to kick off the pipeline for a specific country, before looping through other country codes as needed.

Code block 8. Create a stage folder to keep the UDFs created

-- CREATE STAGE DATABASE.SCHEMA.STAGE_WEBSITES;

country_code = "BRA"

Now at this part of the code, refer to Code block 9, we’ll define the UDF function ‘extract_features_udf’ that will extract information from the HTML content, here’s what this part of the code does: