Scraping the official Italian Addresses from ANNCSU
In this blog post, we will explore the scraping workflow I’ve developed to collect, process and store the official Italian addresses from ANNCSU - Archivio Nazionale dei Numeri Civici delle Strade Urbane.
The toolkit includes an Android smartphone , Python 3.11 (via Termux terminal) and Caffeine Android app. The latter ensures that the device remains awake during the scraping process, preventing it from going into sleep mode. The full pipeline is shown below:
Key Steps
1. JSESSIONID and CAPTCHA bypass (mandatory)
Before diving into the scraping pipeline, it is necessary to solve the CAPTCHA challenge and grab the JSESSIONID value to be used in the subsequent requests. Follow these steps:
- Capture the JSESSIONID value from the Application ► Cookie tab of the browser’s DevTools or from the Chrome/Edge WebDriver (if using Selenium or Playwright, which is preferred);
-
Solve the CAPTCHA challenge manually, or automatically with Selenium or Playwright. In the latter case, resolve the audio challenge and send a POST request to the c.json API with the following payload:
{ code: "RESOLVED_CAPTCHA_TEXT" }
2. Obtaining Municipality Data
The second step is to retrieve the list of all Italian municipalities along with their cadastral codes (codcom). You can achieve this by sending a POST request to the get.json endpoint with the following payload:
{
req: "comdenomparz",
denom: "Magic Word"
}
The Magic Word (undisclosed) cleverly bypasses the autocomplete mechanism, allowing us to obtain all the data in a single POST request!
Note: remember to include the JSESSIONID value in the request headers!
3. Accessing Street Directory of the municipality
To access street directory for a specific municipality, you need to send a single POST request to the get.json endpoint with the following input payload:
{
req: "listodocodcom2",
denom: "Magic Word",
codcom: e.g. H501
}
Note: remember to include the JSESSIONID value in the request headers!
4. Collecting House Addresses
For efficient collection of all the house addresses within the street directory, I recommend sending multiple asynchronous POST requests to the get.json[^5] endpoint with the following parameters:
{
req: "listaccodocodcom2",
denom: e.g. VIA ROMA,
codcom: e.g. H501,
accparz: "Magic Word"
}
Note: remember to include the JSESSIONID value in the request headers!
5. Data cleaning
After retrieving all the house addresses for a specific municipality, perform the following cleaning steps:
- Remove the duplicates;
- Eliminate empty geometries;
- Exclude the points outside the Italian boundaries (Bounding Box);
Finally, write the obtained data into a geoparquet
file in “append” mode with ZSTD compression enabled.
Conclusion
By following these steps, you will be able to obtain the entire dataset (+26M data points!!!) of the Italian addresses into a compact geoparquet
file (~180mb), ready for exploration and mapping.
Click HERE to explore the leaflet map of Florence!!!
Happy scraping!