CHEESE On-Prem

Either you are worried about the privacy of your databases or search queries, or you want to find new insights in an interesting molecular space, we provide you with the On-Prem option that makes running the CHEESE pipeline possible on your premises and with your own data.

For on-prem deployment there are two major steps :

Inference : Which involves starting with a database of molecules, and creating molecular representations and search indexes.
App creation : Which involves transforming the inference outputs to an API and a UI.

CHEESE Inference

This part is about computing the embeddings and building the indexes required for building the API. You can either run the inference script on your machine/server or create an AWS EC2 instance for it.

How to create an EC2 instance for inference?

Go to your AWS account and launch a new instance.
Choose an AWS instance that has a minimum equivalent resources to the following
Amazon Machine Image (AMI) : Select the following deep learning AMI if your account allows it Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.1.0 (Ubuntu 20.04) 20240116.

Please note that the deep learning AMI have pre-installed CUDA drivers. In case the above AMI is not available on your AWS account you can select the following one : Ubuntu Server 22.04 LTS (HVM), SSD Volume Type

Instance Type : g4dn.xlarge or similar --> 4vCPUs, 16 GB RAM, 1 GPU with 16 GB of GPU RAM
Storage : 64 GB minimum, gp2
Networking (Optional) : Choose a VPC and a Subnet for a private VPN.
Security : Create a key pair for login into the instance or choose a pre-existing one.
Download the private key and save it somewhere securely and test the connectivity using SSH using the following steps from AWS :

In case you have problems with SSH, test the connectivity to your instance by choosing the instance in the "Instances" dashboard and clicking connect, choose "connect with EC2 Instance Connect", then you'll have a shell in AWS GUI to test the instance. There you can also reset the ubuntu user password.

For more information please refer to the AWS Documentation.

How to setup the environment for inference?

In your instance home directory (/home/ubuntu by default), create a folder and name it cheese.
Download the assets folder to your local machine.
Copy the assets to the instance together with your input files by running the bash script assets/copy-files.sh in your local machine. Please modify the necessary file locations and instance IP address accordingly.
In case you are working with the following AMI : Ubuntu Server 22.04 LTS (HVM), SSD Volume Type , you need to install Docker and CUDA drivers.
- Go to the assets folder by running cd /home/ubuntu/cheese/assets
- Change the permissions of the assets folder by running : chmod -R 777 .
- To install Docker run : ./install-docker.sh
- To install CUDA Drivers run : ./install-cuda-drivers.sh
- Reboot the machine : sudo reboot

Important !! : Please run all below commands in the created cheese folder. Please run cd /home/ubuntu/cheese

How to get a license file?

In your instance run the bash generate-key.sh script inside the cheese/assets folder to get your license key.
Copy the license key and send it to us.
We will give you a JSON license file that you should put inside the cheese folder.

How to run CHEESE inference?

You can just run the following bash script bash run-cheese-inference.sh --password <password> --input_file <input_file> --license_file <license_file> inside the cheese/assets folder in the instance where :

<password> : a password to download the Docker images from our Azure CR (e.g XXXX)
<input_file> : Your input filename containing your custom database of molecules in .csv, .smi or .txt format (e.g data.csv). The input file should contain lines of molecules in SMILES format and their IDs in the following format : SMILES<sep>ID where <sep> is the delimiter. The chosen delimiter in inference should be consitent with the delimiter specified later in the CHEESE app. Here is an example of an input CSV file.

smiles,id
C[C@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC,Z5348285396
CC(NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396
C[C@@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396

<license_file> : Your license file (e.g test_license.json)

Please note that both the <input_file> and <license_file> are the names of your files that should lie in the cheese folder. You should expect the results in the cheese/output folder.

CHEESE App

This part is about converting the outputs of the inference to a running API and UI. Please note that this step requires only the output files from the CHEESE Inference.

How to setup the environment for the CHEESE App?

Follow the same steps in the inference guide to create an instance for the API and UI. The instance doesn't need GPUs in principle but you could use the same previous instance or any instance similar in CPU, RAM and storage requirements.
For the assigned security group for the instance, create custom inbound and outbound rules for the ports 9001, 9002, 9003 (or any ports that you will use for deployment).

In your instance home directory (/home/ubuntu by default), create a folder and name it cheese.
Download the assets folder to your local machine.
Copy the assets to the instance together with your input files by running the bash script bash copy-files.sh in your local machine. Please modify the necessary file locations and instance IP address accordingly.

How to get a license file?

You can get the license file using the same procedure explained in the CHEESE inference or use the same license file.

How to run CHEESE App?

The cheese App consists of running 3 docker containers :

cheese-database : A local database server for the app
cheese-ui : The CHEESE UI
cheese-api : The CHEESE API

Follow these steps to run the App :

Create a /data folder
Copy the contents of the output folder from cheese-inference result there for your custom database (Optional) e.g /data/custom_database_output
To enable ZINC and Enamine search, you should have the following folders as well : /data/enamine_real and /data/zinc15. For optimal search speed of ZINC and Enamine, please note that the /data folder should be mounted to an EBS General Purpose SSD (e.g gp2 or gp3) with 1.7 TB of memory. Please contact us to provide you with the processed data.
Modify the template YAML configuration file /home/cheese/assets/config_file.yaml folder :
The paths to ZINC and Enamine folders are specified by default. You can modify them if you need to
Specify the name and path of your custom database, as well as the delimiter and index_type used in the cheese-inference. For example :

OUTPUT_DIRECTORIES: 
  ENAMINE-REAL: "/data/enamine_real" 
  ZINC15: "/data/zinc15"
  MyDatabase: "/data/custom_database_output"

DELIMITERS:
  ENAMINE-REAL: "\t" 
  ZINC15: ","
  MyDatabase: ","

INDEX_TYPES:
  ENAMINE-REAL: "clustered"
  ZINC15: "clustered"
  MyDatabase: "in_memory"

Specify the device used for embeddings computation : cpu or cuda, as well as the API and UI URLs. For example :

DEVICE: "cuda"
API_URL: "http://10.196.1.1:9001"
UI_URL: "http://10.196.1.1:9003"

Move the license and configuration files to the /data folder
Modify the bash script assets/run-cheese.sh to specify :
- The deployment PORTS for the database server (Default 9001), API (9002) and UI (Default 9003 )
- The path to the license file (e.g /data/test_license.json)
- The path to the configuration file (e.g /data/config_file.json)
Run the following bash script bash run-cheese.sh --password <password> --ip <ip_address> located inside the cheese/assets folder in the instance where :
- <password> : a password to download the Docker images from our Azure CR. (e.g XXXX)
- <ip_address> : Your instance IP address (e.g 10.196.1.1)

You should expect the UI running in port 9003 and the API running in port 9002. Example : http://10.196.1.1:9003 and http://10.196.1.1:9002