The process in a nutshell
This scraping pipeline was developed to collect ALL the data on the official Italian postal codes (referred to as CAP, Codici di Avviamento Postale) from Poste Italiane web page.
The CAP boundaries of the so-called “multi-CAP” cities (i.e., cities divided into several postal areas) were reconstructed by first cross-referencing the information extracted from Poste Italiane (i.e., the list of addresses and house numbers belonging to a specific CAP code) with the ANNCSU database. Next, the house addresses were grouped into Voronoi Polygons based on the assigned CAP. The resulting polygons were aggregated using the dissolve() operator, cleaned up by removing holes, sliver polygons, and overlaps. Finally, the resulting CAP zones were clipped to the municipality-level boundaries from ISTAT.
It is built on the following libraries:
- Playwright: An open-source automation library for browser testing and web scraping developed by Microsoft;
- Playwright-reCAPTCHA: A Python library for solving Google reCAPTCHA v2 and v3 with Playwright;
- PolyFuzz: PolyFuzz performs fuzzy string matching between text strings that match partially;
- Shapely: A specialized library for manipulation and analysis of planar geometric objects;
- Geopandas: An open source project that adds support for geographic data to pandas objects.