Project Overview I am building a grocery app, an Australian grocery price comparison platform that compares product prices across Woolworths, Coles. There is an existing open-source project that collects grocery price data: https://github.com/tjhowse/aus_grocery_price_database I would like a developer to extend and improve this project so that it reliably collects, cleans, and structures grocery price data suitable for a product comparison application. The goal is to produce a clean, structured dataset where the same product across different supermarkets can be compared easily. Scope of Work 1. Improve Data Collection - Review and extend the existing spiders/modules in the repository. - Ensure reliable scraping of Woolworths, Coles product listings. - Ensure the scraper can run automatically on a weekly schedule (this may extend to daily). - Handle pagination, category traversal, and basic anti-bot protections where necessary. 2. Product Normalisation & Matching - Implement a data cleaning pipeline that standardises product information. - Normalise: -Product names -Brand spelling -Size formats (e.g. `2L`, `2000ml`, `2 litre`) -Units (`g`, `kg`, `ml`, `L`) -Create canonical product records so identical products across stores share one product ID. Example: Canonical Product Dairy Farmers Full Cream Milk 2L Store Variants • Woolworths – $3.35 • Coles – $3.20 This step is critical so the data can power a price comparison app. 3. Data Storage - Persist each scrape as a compressed JSON dataset. - Store files in a versioned folder structure such as: data/ 2026-03-01/ 2026-03-08/ 2026-03-15/ Each dataset should contain: * canonical_product_id * store * product_name * brand * size * price * promo_price (if available) * unit_price * scrape_timestamp 4. CLI / Workflow Automation Provide a simple command to run the entire workflow. Example: ``` make run-scrape ``` or ``` python run_pipeline.py ``` The command should: 1. Run all store scrapers 2. Clean and normalise the data 3. Match products into canonical records 4. Output the dataset 5. Log errors or failures This command will eventually be scheduled using cron. 5. Change Detection Generate a differential report between runs showing: * price changes * new products * removed products Example output: ``` Price Changes ------------- Milk 2L (Coles) $3.40 → $3.20 Bread 700g (Woolworths) $2.90 → $3.10 ``` **6. Documentation** Update the repository README to include: * setup instructions * Python package requirements * environment variables * how to run the pipeline * troubleshooting notes --- Acceptance Criteria 1. Running the pipeline command produces a clean JSON dataset containing: ``` canonical_product_id store product_name brand size price promo_price unit_price scrape_timestamp ``` 2. A price change report is generated comparing the latest dataset with the previous run. 3. The scraper completes all three supermarkets without manual CAPTCHA steps. 4. Code integrates cleanly with the existing repository structure. --- Technical Requirements * Python only * Reuse existing repository structure where possible * Keep dependencies lightweight Allowed libraries: * requests * beautifulsoup4 * pandas Avoid heavy frameworks. --- Nice to Have (Optional) * Barcode extraction if available * Product similarity matching using NLP * Unit price normalisation --- Collaboration I am happy to: * review pull requests regularly * test interim builds * provide quick feedback The goal is to build a reliable grocery price dataset that will power the Grocery comparison app.