ByteBasket
A distributed data pipeline demonstrating real-time processing and analytics of e-commerce events using modern big data technologies.
flowchart LR
subgraph Sources
Web["Website Events"]
App["Mobile App Events"]
Cart["Shopping Cart Events"]
Orders["Order Processing"]
end
subgraph Kafka["Apache Kafka"]
Stream1["User Activity Stream"]
Stream2["Transaction Stream"]
end
subgraph Processing
RT["Real-time Processing"]
Batch["Batch Processing"]
ML["ML Pipeline"]
end
subgraph Storage
Cassandra["Cassandra"]
HDFS["HDFS"]
end
Web --> Stream1
App --> Stream1
Cart --> Stream1
Orders --> Stream2
Stream1 --> RT
Stream2 --> RT
Stream1 --> Batch
Stream2 --> Batch
RT --> Cassandra
Batch --> HDFS
HDFS --> ML
ML --> Cassandra
subgraph Analytics
direction TB
A["Real-time Dashboards"]
B["Recommendation Engine"]
C["Fraud Detection"]
D["Analytics"]
end
Cassandra --> Analytics
Tech Stack
- Apache Kafka: Event streaming and message queuing
- Apache Spark: Distributed data processing
- Apache Cassandra: NoSQL database for real-time queries
- HDFS: Distributed storage for large datasets
- Python: Primary implementation language
System Components
Event Producers
- User activity tracking
- Shopping cart events
- Transaction processing
- Mock data generation for testing
Data Processing
- Real-time stream processing with Spark Streaming
- Basic ML pipeline for recommendations
- Fraud detection system
- Analytics computations
Storage Layer
- Cassandra tables optimized for real-time queries
- HDFS structure for efficient data access
- Data retention policies
Analytics
- Real-time metrics dashboard
- Historical trend analysis
- Product recommendations
- Transaction monitoring
Setup and Development
Prerequisites
- Python 3.12+
- Docker & Docker Compose
- uv package manager
Project Initialization
# Create new project
uv init bytebasket
cd bytebasket
# Install dependencies
uv add kafka-python==2.0.2
uv add pyspark
uv add cassandra-driver
uv add grpcio-tools==1.66.1
uv add protobuf==5.27.2
uv add pytest
Kafka Setup
# Start Kafka cluster (KRaft mode)
docker compose up -d
# Verify Kafka cluster status
docker exec kafka1 kafka-topics.sh --bootstrap-server localhost:9092 --list
Project Structure
bytebasket/
├── .venv/
├── .python-version
├── pyproject.toml
├── uv.lock
├── docker-compose.yml
├── config/
│ ├── kafka.yml
│ ├── spark.yml
│ └── cassandra.yml
├── src/
│ ├── producers/
│ │ ├── __init__.py
│ │ ├── user_events.py
│ │ ├── cart_events.py
│ │ └── transaction_events.py
│ ├── processors/
│ │ ├── __init__.py
│ │ ├── stream_processor.py
│ │ ├── fraud_detector.py
│ │ └── recommender.py
│ ├── storage/
│ │ ├── __init__.py
│ │ ├── cassandra_client.py
│ │ └── hdfs_client.py
│ └── analytics/
│ ├── __init__.py
│ └── metrics.py
├── tests/
│ ├── test_producers.py
│ ├── test_processors.py
│ └── test_storage.py
└── notebooks/
└── analysis.ipynb
Kafka Configuration
- 4 partitions per topic (scalable)
- Replication factor: 1 (development) / 3 (production)
- Key-based partitioning for user consistency
- Exactly-once semantics enabled
Development Guidelines
- All messages use Protocol Buffers serialization
- Producers must enable idempotence and
acks=all
- Consumers implement at-least-once processing
- Use atomic writes for output files
Development Status
Proof-of-concept project demonstrating distributed systems architecture.
Production Considerations
- Increase replication factor to 3
- Enable SSL/TLS encryption
- Implement proper ACLs
- Configure message retention policies
- Set up monitoring and alerting