Fantasia
Functional ANnoTAtion based on embedding space SImilArity#
FANTASIA is an advanced pipeline designed for automatic functional annotation of protein sequences using state-of-the-art protein language models. It integrates deep learning embeddings and similarity searches in vector databases to associate Gene Ontology (GO) terms with proteins.
FANTASIA v4#
Script#
The following script is an adaptation for IFB of the one called cesga_array.sh provided by Francisco Miguel Perez Canales:
π ifb_array.sh
#!/bin/bash
#SBATCH -p gpu
#SBATCH -c 32
#SBATCH --gres={{ gpu_profile_ex }}:1
#SBATCH --mem=64G
#SBATCH -t 12:00:00
#SBATCH --job-name=fantasia_job
#SBATCH --output=fantasia_%j.out
#SBATCH --error=fantasia_%j.err
################################################################################
# FANTASIA job script for CESGA (SLURM GPU partition)
#
# This script launches the full FANTASIA functional annotation pipeline using:
#
# - PostgreSQL with pgvector (run via Apptainer container)
# - RabbitMQ (run via Apptainer container)
# - FANTASIA (run via Apptainer container with GPU support)
#
# All services are isolated and launched locally within the job node,
# using bind mounts to persistent storage on LUSTRE.
#
# NOTE: This script is intended to be submitted via `sbatch` on CESGA's SLURM system.
################################################################################
# =======================
# Load input parameters
# =======================
PARAM_FILE="./experiments.tsv"
IFS=$'\t' read -r INPUT OUTPUT EXTRA <<< "$(sed -n "${SLURM_ARRAY_TASK_ID}p" "$PARAM_FILE")"
if [ -z "$INPUT" ] || [ -z "$OUTPUT" ]; then
echo "β ERROR: Entrada no vΓ‘lida en lΓnea $SLURM_ARRAY_TASK_ID del fichero de parΓ‘metros"
exit 1
fi
echo "π₯ Input file: $INPUT"
echo "π€ Output name: $OUTPUT"
echo "βοΈ Extra params: $EXTRA"
STORE=$PWD
# =======================
# Define working directories
# =======================
PROJECT_DIR="$STORE/FANTASIA"
EXECUTION_DIR="$STORE/fantasia"
SHARED_MEM_DIR="/tmp/fantasia"
POSTGRESQL_DATA="$SHARED_MEM_DIR/data"
POSTGRESQL_SOCKET="$SHARED_MEM_DIR/socket"
RABBITMQ_DATA_DIR="$STORE/fantasia_rabbitmq"
CONFIG_FILE="$PROJECT_DIR/fantasia/config.yaml"
# Apptainer container images
PGVECTOR_IMAGE="$PROJECT_DIR/pgvector.sif"
RABBITMQ_IMAGE="$PROJECT_DIR/rabbitmq.sif"
FANTASIA_IMAGE="$PROJECT_DIR/fantasia.sif"
DB_NAME="BioData"
PG_PORT=5432
# =======================
# Environment sanity check
# =======================
echo "π Current working directory:"
pwd
ls -l
# Clone FANTASIA repo if missing
if [ ! -d "$PROJECT_DIR" ]; then
echo "π Cloning FANTASIA from GitHub..."
git clone https://github.com/CBBIO/FANTASIA.git "$PROJECT_DIR"
fi
cd "$PROJECT_DIR" || exit 1
# =======================
# Build containers if not present
# =======================
if [ ! -f "$PGVECTOR_IMAGE" ]; then
echo "π¨ Building pgvector container..."
apptainer build "$PGVECTOR_IMAGE" docker://pgvector/pgvector:pg16
fi
if [ ! -f "$RABBITMQ_IMAGE" ]; then
echo "π¨ Building RabbitMQ container..."
apptainer build "$RABBITMQ_IMAGE" docker://rabbitmq:management
fi
if [ ! -f "$FANTASIA_IMAGE" ]; then
echo "π¨ Building FANTASIA container..."
apptainer build --disable-cache "$FANTASIA_IMAGE" docker://frapercan/fantasia:latest
fi
# =======================
# Launch PostgreSQL (pgvector)
# =======================
echo "π Starting PostgreSQL (pgvector)..."
PGDATA_HOST="$STORE/pgdata_pgvector"
PG_RUN="$STORE/pg_run"
rm -rf "$PGDATA_HOST"
mkdir -p "$PGDATA_HOST" "$PG_RUN"
nohup apptainer run \
--env POSTGRES_USER=usuario \
--env POSTGRES_PASSWORD=clave \
--env POSTGRES_DB="$DB_NAME" \
--bind "$PGDATA_HOST:/var/lib/postgresql/data" \
--bind "$PG_RUN:/var/run/postgresql" \
"$PGVECTOR_IMAGE" > postgres.log 2>&1 &
# =======================
# Launch RabbitMQ
# =======================
echo "π Starting RabbitMQ..."
rm -rf "$RABBITMQ_DATA_DIR"
mkdir -p "$RABBITMQ_DATA_DIR"
nohup apptainer run \
--bind "$RABBITMQ_DATA_DIR:/var/lib/rabbitmq" \
"$RABBITMQ_IMAGE" > rabbitmq.log 2>&1 &
# =======================
# Wait for services to become available
# =======================
echo "β³ Waiting for services to initialize..."
sleep 30
# =======================
# Run FANTASIA pipeline
# =======================
mkdir -p "$EXECUTION_DIR"
if [ ! -f "$CONFIG_FILE" ]; then
echo "β ERROR: Configuration file not found at $CONFIG_FILE"
exit 1
fi
echo "β
Found configuration file: $CONFIG_FILE"
echo "π Listing FANTASIA contents:"
ls -l "$PROJECT_DIR/fantasia/"
apptainer exec --nv --bind "$EXECUTION_DIR:/fantasia" "$FANTASIA_IMAGE" \
fantasia initialize
apptainer exec --nv --bind "$EXECUTION_DIR:/fantasia" "$FANTASIA_IMAGE" \
fantasia run --input "$INPUT" --prefix "$OUTPUT" $EXTRA
# =======================
# Cleanup services on exit
# =======================
cleanup() {
echo "π§Ή Cleaning up background services..."
if pgrep -f "$POSTGRESQL_DATA" > /dev/null; then
echo "π Stopping PostgreSQL..."
pkill -f "$POSTGRESQL_DATA"
fi
if pgrep -f "rabbitmq-server" > /dev/null; then
echo "π Stopping RabbitMQ..."
pkill -f "rabbitmq-server"
fi
if [[ -d "$SHARED_MEM_DIR" ]]; then
echo "ποΈ Removing shared memory directory: $SHARED_MEM_DIR"
rm -rf "$SHARED_MEM_DIR"
fi
}
Full example of usage#
Prerequisites#
You will need a SLURM account with an access to the GPU SLURM partition.
Inputs#
sample.fasta from github.com/CBBIO/FANTASIA
>DFRU1_DN0_c0_g1_i4.p1
MSGVPVPVPTLVDQKIAGKKVVVFSKTSCGYCTKVKNVLKQYILKPDDYEVVEMDTDYSSDMSAMQDYLQELTGARSVPR
VFVNGKSIGGHDDTVARHKAGKLGLEKKI
>Q9CQV8 10090
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLILNATQAESKVFY
LKMKGDYFRYLSEVASGENKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN
>P62259 10090
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW
RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF
YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF
YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE
EQNKEALQDVEDENQ
π experiments.tsv
/shared/projects/admin/2025-10_test_fantasia/sbatch_strategy/sample.fasta sample
Run#
sbatch ifb_array.sh
π