🕷️ Nightcrawler

Nightcrawler is an asynchronous web crawler designed to extract email addresses from websites by recursively navigating through linked pages while staying within the same domain.

🛠️ Tech Stack

Python: Core programming language
Selenium: Web automation and interaction
undetected_chromedriver: Browser automation that bypasses anti-bot measures
MongoDB: Database for storing extracted emails
motor: Asynchronous MongoDB driver for Python
asyncio: Python's asynchronous I/O framework
googlesearch-python: For initial website discovery

✨ Features

🔍 Intelligent Web Crawling: Recursively navigates through website pages while staying within the same domain
📧 Email Extraction: Captures email addresses using both direct link scanning and JavaScript content analysis
🧹 Smart Link Filtering: Automatically skips CDNs, static assets, and irrelevant links
🔄 Duplicate Prevention: Prevents re-crawling of visited URLs and storing duplicate emails
⚡ Asynchronous Operation: Uses Python's asyncio for efficient concurrent operations
💾 Database Integration: Stores extracted emails in MongoDB with website source information
🛡️ Anti-Detection Measures: Uses undetected_chromedriver to avoid bot detection

📁 File Structure

nightcrawler/
│
├── app.py              # Main application file with crawler logic
├── .env                # Environment variables configuration
├── README.md           # Project documentation
└── requirements.txt    # Project dependencies

🔧 Installation

Clone the repository:

git clone https://github.com/yourusername/nightcrawler.git
cd nightcrawler

Install dependencies:

pip install -r requirements.txt

Install Brave Browser if not already installed.

⚙️ Environment Configuration

Create a .env file in the root directory with the following variables:

# MongoDB connection string
MONGO_URL=mongodb://username:password@host:port/database

# Optional: Custom browser paths
# BRAVE_PATH=C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe
# USER_DATA_DIR=C:/Users/username/AppData/Local/BraveSoftware/Brave-Browser/User Data

Required environment variables:

MONGO_URL: Your MongoDB connection string

🔌 Configuration

Before running the application, you need to:

Create and configure the .env file (see above)
Verify the Brave browser path is correct for your system
Check the user data directory path is valid for your system

You can customize these paths directly in the code or via environment variables:

brave_path = os.getenv("BRAVE_PATH", "C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe")
userdatadir = os.getenv("USER_DATA_DIR", r"C:\Users\yourusername\AppData\Local\BraveSoftware\Brave-Browser\User Data")

🚀 Usage

Run the application:

python app.py

The application will:
- Search for the target website using the term "SITE NAME Website"
- Start crawling from the first search result
- Extract and store email addresses
- Display a count of pages visited when complete