This project scrapes the arxiv site for papers, and then uses LLM (Qwen 2.5:0.5b model) to filter papers and show the user a word cloud and bar chart of extracted keywords. Additionally the top 5 papers are shown based on score they get.
- anyang-scrapy: includes the scrapy codebase to scrape the target site
- Arxiv scrapping API: http://export.arxiv.org/api/query?search_query=all:`<add_keyword_here>
&start=0&max_results=Example: keyword =>machine learning` http://export.arxiv.org/api/query?search_query=all:machine+learning&start=0&max_results=50
- Arxiv scrapping API: http://export.arxiv.org/api/query?search_query=all:`<add_keyword_here>
- anyang-flask: includes the flask backend to manage services and integrate the system components
- anyang-llm: include initial code to interact with Qwen LLM.
Team Members:
- Duguma Yeshitla
- 천영학
- 김현주
- 이채림
- Apache Kafka
- Zookeeper
- Redis
- Celery
- Ollama
- Maria DB
- Docker
- Kafka UI
- Flower for Celery
- DBeaver
- Install Ollama on the device following the steps provided in the following link.
https://ollama.com/download. - The default Ollama instance running is not accessible from external IPs, so it has to be stopped. Here are the steps to achieve that.
- Make sure the default ollama instance is not running. It will default to run in localhost with the port of 11434. This may block the new ollama instance you will be running.
sudo systemctl status ollama sudo systemctl stop ollama sudo systemctl disable ollama - Also make sure no other Ollama instance is running and if there is kill it.
# check for existing ollama instance ps aux | grep ollama # stop existing ollama instances if there are any sudo kill -9 <process_id> - Run Ollama serve in the ubiquitous IP and PORT (i.e. accessible from anywhere)
OLLAMA_HOST=0.0.0.0:11500 ollama serve - Check if the model needed are available in the ubiquitous IP access (0.0.0.0) address for Ollama. If not pull it in this mode. Ollama serve command needs to be run for the following to work.
# check the list of models OLLAMA_HOST=0.0.0.0:11500 ollama list # pull the model if it does not exist (eg: qwen3:0.6b) OLLAMA_HOST=0.0.0.0:11500 ollama pull <model:size> # for this project we need Qwen 2.5:0.5b model OLLAMA_HOST=0.0.0.0:11500 ollama pull qwen2.5:0.5b
- Make sure the default ollama instance is not running. It will default to run in localhost with the port of 11434. This may block the new ollama instance you will be running.
- Make sure that docker is installed first. Or else follow the following link (for ubuntu):
https://docs.docker.com/engine/install/ubuntu/ - Create the directory for the deployment files.
mkdir mariadb cd mariadb - Save the docker compose and environment variable files.
docker-compose.yml
services: mariadb: image: mariadb environment: MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD} MYSQL_USER: ${MYSQL_USER} MYSQL_PASSWORD: ${MYSQL_PASSWORD} MYSQL_DATABASE: ${MYSQL_DATABASE} ports: - "3376:3306" volumes: - "./data.sql:/docker-entrypoint-initdb.d/1.sql"
.env
MYSQL_ROOT_PASSWORD = scrapy_root_pwd MYSQL_USER = scrapy_user MYSQL_PASSWORD = scrapy_user_pwd MYSQL_DATABASE = scrapy_db
- Deploy MariaDB docker instance. Based on the settings above, it will create all the database and user needed for the project.
docker compose up -d
-
Make sure Java JDK is installed.
sudo apt update sudo apt install default-jdk
-
Download & extract Kafka server instance from official site (https://kafka.apache.org/downloads).
wget https://dlcdn.apache.org/kafka/3.9.0/kafka_{version}.tgz tar -xvf kafka_{version}.tgz mv kafka_{version} kafka # update folder permissions chmod 777 -R kafka -
Create data directory inside
kafkaone.cd kafka mkdir -p data/{zookeeper,kafka} chmod 777 -R data -
Edit zookeeper and Kafka configuration files
vim config/zookeeper.properties dataDir=/home/user/kafka/data/zookeepervim config/server.properties log.dirs=/home/user/kafka/data/kafka listeners=PLAINTEXT://<ip>:9092 -
Run zookeeper and Kafka instances
cd /kafka/bin # use screen to run in background ./zookeeper-server-start.sh ../config/zookeeper.properties
cd /kafka/bin # use screen to run in background ./kafka-server-start.sh ../config/server.properties
-
Create topics
cd /kafka/bin ./kafka-topics.sh --create --bootstrap-server <ip>:9092 --replication-factor 1 --partitions 1 --topic <topic-name> # check list of kafka topics ./kafka-topics.sh --list --bootstrap-server <ip>:9092
-
Run producers and consumers to test Kafka setup
# producer cd /kafka/bin ./kafka-console-producer.sh --bootstrap-server <ip>:9092 --topic <topic-name>
# consumer cd /kafka/bin ./kafka-console-consumer.sh --bootstrap-server <ip>:9092 --topic <topic-name>
-
Use the following commands to stop Zookeeper and Kafka
cd /kafka/bin ./zookeeper-server-stop.sh ./kafka-server-stop.sh
- Make sure that docker is installed first.
- Create the directory for the deployment files.
mkdir redis cd redis - Save the docker compose file.
docker-compose.yml
services: redis: image: redis:latest command: ["redis-server"] ports: - 6379:6379 - Deploy Redis docker instance.
docker compose up -d
- Make sure that docker is installed first.
- Create the directory for the deployment files.
mkdir kafka_ui cd kafka_ui - Save the docker compose file.
docker-compose.yml
services: kafka-ui: container_name: kafka-ui image: provectuslabs/kafka-ui:master ports: - 8080:8080 environment: KAFKA_CLUSTERS_0_NAME: testing_cluster KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: 192.168.0.223:9092
- Deploy Kafka-UI docker instance.
docker compose up -d
- Kafka-UI can be accessed on port 8080 with the IP it's deployed with.

- Make sure that docker is installed first.
- Create the directory for the deployment files.
mkdir flower cd flower - Save the docker compose and environment variable files.
docker-compose.yml
services: flower: image: mher/flower container_name: flower env_file: - ${ENV_FILE:-.env} environment: - FLOWER_PORT=5555 - FLOWER_PERSISTENT=True - FLOWER_STATE_SAVE_INTERVAL=10000 - FLOWER_DB=/etc/db/flower.db - TZ=Asia/Seoul ports: - "5555:5555" volumes: - ./flower/storage:/etc/db/ user: root
.env
# CELERY_BROKER_URL="redis://<redis_server_ip>:<port>/0" CELERY_BROKER_URL="redis://192.168.0.223:6379/0"
- Deploy Flower docker instance.
docker compose up -d
- Flower can be accessed on port 5555 with the IP it's deployed with.

- Make sure that docker is installed first.
- Copy and unzip the project code zip file in the device.
unzip anyang-project-main.zip cd anyang-project-main - The only change that needs to be maid is to set the Ollama, Redis & Kafka server IPs & ports in the
.envfile.# DB_HOST='<maria_db_server_ip>' DB_HOST='192.168.0.222' DB_USER='scrapy_user' DB_PWD='scrapy_user_pwd' DB_NAME='scrapy_db' DB_PORT=3376 # KAFKA_BROKER_IP='<kafka_broker_server_ip>' KAFKA_BROKER_IP='192.168.0.223' KAFKA_BROKER_PORT=9092 # CELERY_REDIS_IP='<redis_server_ip>' CELERY_REDIS_IP='192.168.0.223' CELERY_REDIS_PORT=6379 # OLLAMA_SERVER_IP='<ollama_server_ip>' OLLAMA_SERVER_IP='192.168.0.224' OLLAMA_SERVER_PORT=11500
- Deploy Anyang Flask & Celery docker instances.
# need to build this one because its a custom image docker compose up -d --build - The system can be accessed on port 5004 with the IP it's deployed with.
- Database tables creation race condition between Flask & Celery in the first deployment.
- This occurs because both
Flask&Celerywork with the same application context (which is needed for integration) and there is a code which creates all the Tables on the Database if they don't exist. So in the initial deployment of the project there could be a race condition betweenFlask&Celeryto create the tables. And this may fail either the Flask or Celery docker containers. The simple fix is to start them up again using the following command.docker compose up -d
- This occurs because both
_system_ui_screenshot_word_cloud.png)