RustCrawler is designed following Rust best practices with a focus on modularity, testability, and maintainability.
- Client Module: HTTP client configuration isolated from business logic
- Models Module: Data structures and validation separate from implementation
- Crawlers Module: Each crawler type has its own module
- Utils Module: Reusable utility functions for I/O and display
The Crawler trait defines a common interface for all crawlers:
pub trait Crawler {
fn analyze(&self, client: &HttpClient, url: &str) -> Result<CrawlerResults, Box<dyn std::error::Error>>;
fn name(&self) -> &str;
}This allows:
- Easy addition of new crawler types
- Polymorphic handling of crawlers
- Consistent API across all crawlers
- Uses
Result<T, E>for all fallible operations - Custom error types where appropriate
- Clear error messages for user-facing operations
- Unit tests in each module
- Test coverage for:
- Model validation
- URL parsing
- Crawler creation
- Data structure operations
- 13 tests currently passing
Purpose: HTTP client configuration and management
Key Components:
HttpClient: Wrapper aroundreqwest::blocking::Client- 30-second timeout configuration
- Reusable across multiple requests
Tests: Client creation and configuration
Purpose: Core data structures and validation
Key Components:
UrlInfo: Server response metadataCrawlerResults: Analysis results containerCrawlerSelection: User's crawler choicesvalidate_url(): URL validation function
Tests: URL validation, selection logic, results manipulation
Purpose: I/O and display utilities
Key Components:
get_url_input(): User input for URLget_yes_no_input(): Boolean promptsdisplay_results(): Formatted output of crawler results
Tests: Input logic verification
Purpose: Crawler trait and shared functionality
Key Components:
Crawlertrait definitionfetch_page_content(): Common HTTP GET operationfetch_page_with_timing(): GET with performance metrics
Purpose: SEO analysis implementation
Features:
- Title tag validation (length, presence)
- Meta description checking
- H1 heading verification
- Canonical URL detection
- Robots meta tag analysis
- Internal link validation (up to 10 links)
Tests: Crawler creation, link extraction
Purpose: Performance analysis implementation
Features:
- Response time measurement
- Compression detection (brotli, gzip, deflate)
- Page size analysis
- Script and stylesheet counting
Tests: Crawler creation
Purpose: Accessibility analysis implementation
Features:
- HTML lang attribute checking
- Image alt attribute validation
- ARIA attribute detection
- Semantic HTML5 tag verification
- Form label association
- Skip navigation link detection
Tests: Crawler creation, alt attribute checking
Purpose: Application entry point
Responsibilities:
- User interaction flow
- Orchestration of crawlers
- Display coordination
Design: Thin layer that uses library modules
- Create a new file in
src/crawlers/ - Implement the
Crawlertrait - Add module declaration in
src/crawlers/mod.rs - Export in
src/lib.rs - Update
main.rsto include in selection menu
Example:
pub struct SecurityCrawler;
impl Crawler for SecurityCrawler {
fn name(&self) -> &str {
"Security Crawler"
}
fn analyze(&self, client: &HttpClient, url: &str)
-> Result<CrawlerResults, Box<dyn std::error::Error>> {
// Implementation
}
}- Rust Naming Conventions: snake_case for functions, PascalCase for types
- Documentation: Inline documentation for all public APIs
- Module Organization: Clear hierarchy and logical grouping
- Error Handling: Proper Result types, no unwrap in production paths
- Testing: Unit tests for testable logic
- Type Safety: Strong typing, minimal use of
Stringwhere specific types work - Ownership: Proper use of references vs. owned values
- Trait Usage: Polymorphism through traits rather than inheritance
- HTTP Client Reuse: Single client instance for all requests
- Timeout Configuration: 30-second timeout to prevent hangs
- Limited Link Checking: Only checks first 10 internal links to avoid excessive requests
- Blocking I/O: Uses blocking client for simplicity (async could be added later)
- Async/Await: Convert to async for better concurrency
- Parallel Crawling: Run multiple crawlers concurrently
- HTML Parsing: Use proper HTML parser (e.g.,
scrapercrate) instead of string matching - Configuration: External config file for timeouts, limits, etc.
- Reporting: JSON/HTML output options
- Integration Tests: End-to-end tests with mock servers
- Error Recovery: Retry logic for transient failures