Back to Blog










Nightcrawler
March 2025
โข
Harshit Raj
Web CrawlerEmail ExtractionPythonSeleniumScrapingAutomationData Mining
๐ท๏ธ Nightcrawler
Nightcrawler is an asynchronous web crawler designed to extract email addresses from websites by recursively navigating through linked pages while staying within the same domain.
๐ ๏ธ Tech Stack
- Python: Core programming language
- Selenium: Web automation and interaction
- undetected_chromedriver: Browser automation that bypasses anti-bot measures
- MongoDB: Database for storing extracted emails
- motor: Asynchronous MongoDB driver for Python
- asyncio: Python's asynchronous I/O framework
- googlesearch-python: For initial website discovery
โจ Features
- ๐ Intelligent Web Crawling: Recursively navigates through website pages while staying within the same domain
- ๐ง Email Extraction: Captures email addresses using both direct link scanning and JavaScript content analysis
- ๐งน Smart Link Filtering: Automatically skips CDNs, static assets, and irrelevant links
- ๐ Duplicate Prevention: Prevents re-crawling of visited URLs and storing duplicate emails
- โก Asynchronous Operation: Uses Python's asyncio for efficient concurrent operations
- ๐พ Database Integration: Stores extracted emails in MongoDB with website source information
- ๐ก๏ธ Anti-Detection Measures: Uses undetected_chromedriver to avoid bot detection
๐ File Structure
nightcrawler/
โ
โโโ app.py # Main application file with crawler logic
โโโ .env # Environment variables configuration
โโโ README.md # Project documentation
โโโ requirements.txt # Project dependencies
๐ง Installation
- Clone the repository:
git clone https://github.com/yourusername/nightcrawler.git
cd nightcrawler
- Install dependencies:
pip install -r requirements.txt
- Install Brave Browser if not already installed.
โ๏ธ Environment Configuration
Create a .env file in the root directory with the following variables:
# MongoDB connection string
MONGO_URL=mongodb://username:password@host:port/database
# Optional: Custom browser paths
# BRAVE_PATH=C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe
# USER_DATA_DIR=C:/Users/username/AppData/Local/BraveSoftware/Brave-Browser/User Data
Required environment variables:
MONGO_URL
: Your MongoDB connection string
๐ Configuration
Before running the application, you need to:
- Create and configure the .env file (see above)
- Verify the Brave browser path is correct for your system
- Check the user data directory path is valid for your system
You can customize these paths directly in the code or via environment variables:
brave_path = os.getenv("BRAVE_PATH", "C:/Program Files/BraveSoftware/Brave-Browser/Application/brave.exe")
userdatadir = os.getenv("USER_DATA_DIR", r"C:\Users\yourusername\AppData\Local\BraveSoftware\Brave-Browser\User Data")
๐ Usage
- Run the application:
python app.py
- The application will:
- Search for the target website using the term "SITE NAME Website"
- Start crawling from the first search result
- Extract and store email addresses
- Display a count of pages visited when complete
๐ ๏ธ Customization
- Modify the
getWebsite()
function to change how target websites are selected - Adjust
CDN_KEYWORDS
andSTATIC_EXTENSIONS
lists to refine link filtering - Update browser options and preferences in the configuration section
๐ Deployment
๐ป Local Deployment
- Ensure Python 3.7+ is installed
- Install Brave browser
- Set up a MongoDB database (local or cloud-based)
- Create and configure the .env file
- Run with
python app.py
โ๏ธ Cloud Deployment
For cloud deployment, consider:
- Containerizing the application with Docker
- Setting up environment variables for sensitive information
- Using a cloud service that supports Python applications
- Ensuring your deployment environment has access to a compatible browser
๐ Ethics and Legal Considerations
When using this tool, please:
- Respect website terms of service
- Follow robots.txt guidelines
- Implement reasonable rate limiting
- Consider privacy laws regarding email collection and storage
๐ License
[Add your license information here]
More Stories
April
Anton
Gen Ai+3
April
BeatsNotFound
Node.js+2
April
The Tea Project
Tea Tourism+7
January
Freelynce
Next.js+3
December
No captcha
Web Security+3

August
MetaFrazo Landing Page
Landing Page+4
May
Project-Redbull
Three.js+8
May
SeoVew
SEO+6
February
EDI Dashobard
React.js+2

December
Hamburger
3D+6
Discover more content