An intelligent content aggregation and analysis platform that automatically collects, processes, and generates insights from multiple sources.
English | ็ฎไฝไธญๆ
- Multi-source data collection (Twitter, WeChat, Podcasts, Videos)
- Automatic content extraction and cleaning
- Duplicate content detection
- Support for RSS feeds
- Intelligent content tagging and categorization
- Content quality scoring
- Automated summarization
- Daily report generation
- In-depth research report creation
- Podcast script generation
- Text-to-Speech conversion
- uv installed
- Clone the repository
git clone https://github.com/yuanzhi-code/extractor.git
cd extractor- init venv and install project dependencies
uv venv && uv sync --group dev- Setup database
uv run alembic upgrade head- Configure your sources
cp data/rss_sources.json.example data/rss_sources.json
# Edit rss_sources.json with your sourcesTODO
Add your RSS sources in data/rss_sources.json:
{
"sources": [
{
"name": "Example Tech Blog",
"url": "https://example.com/feed",
"description": "Tech news and updates"
}
]
}Refer the .env.example and c reate a .env file:
To make the project run probably, you need to setup the MODEL_PROVIDER and relevant env showed in the .env.example,for example, you choose deepseek as model provider
MODEL_PROVIDER="deepseek"
DEEPSEEK_API_KEY="sk-xxxxxxx"
DEEPSEEK_MODEL="deepseek-chat"- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request