This is page describes the course project that will serve as the final exam for this course. Please submit all your efforts for this project (all files and data and results) in `ECL2017S/exams/final/`

directory in your private repository for this course. Don’t forget to push your answers to your remote Github repository by the end of the semester.

Inside the directory for the project (`ECL2017S/exams/final/`

) create three other folders: `data`

, `src`

, `results`

. The `data`

folder contains the input data for this project. The `src`

folder should contain all your codes that you write for this project, and the `results`

folder should contain all the results generated by your code.

Our goal in this project is to fit a mathemtical model of the growth to living cells to data for the growth of a tumor mass in the brain of a rat. You can download the MATLAB data file for this project from here. Write a Python code, set of separate codes that performs the following tasks one after the other, and output all the results to the `results`

folder described above. If you have multiple Python codes each in a separate file, then write a `main.py`

Python code, such that if the user runs

```
./main.py ../data/cells.mat ../results/
```

then all the necessary Python codes to generate all the results will be called by this `main.py`

code. The first command line argument to this code is the path to the input MATLAB data file containing data for this project, and the second command line tells the code where to write all the output and results of the project.

### Data structure of the input MATLAB file

The input file, is a 4-dimensional double-precision MATLAB matrix `cells(:,:,:,:)`

, corresponding to dimensions `cells(y,x,z,time)`

. This data is collected from MRI imaging of the rat’s brain almost every other day for a period of two weeks. For example, `cells(:,:,:,1)`

contains the number of cells at each point in space (y,x,z) at the first time point, or, `cells(:,:,10,1)`

represents a (XY) slice of MRI at $z=1$ and $t=1$.

Now write a set of Python codes that perform the following tasks.

### Data reduction and visualization

**1. ** First write a code that reads the input MATLAB file and converts the data to a 4-D NumPy array.

**2. ** Write Python codes that generate figures as similar as possible to the following figures (specific color-codes of the curves and figures do not matter, focus more on the format of the plots and its parts).

and finally, a plot that shows the time evolution of the total number of tumor cells at all time points available in the input data. The time points are $T=[10, 12, 14, 15, 16, 18, 20]$ in units of days.

**Note**. There is no need to regenerate the error bars in the plot above for this project.

### The mathematical model of tumor growth

**3. ** Now our goal is to fit the time evolution of the growth of this tumor, using a mathematical model, and use the maximum likelihood approach and Markov Chain Monte Carlo Technique to find the best-fit parameters of the model. The model we use is called the Gompertizan growth model,

where $N(t)$ is the number of tumor cells at time $t$, and $a$, $b$, and $c$ are the parameters that we would like to find their best values given the input tumor cell data. This Gompertzian growth model is our **physical model** for this problem.

#### Combining the mathematical model with a regression model

Now, if the model was ideally perfect in describing the data, the curve of the model predicion would pass through all the points in the growth curve plot in the above, providing a prefect description of data. This is however, never the case, as it is famously said **all models are wrong, but some are useful**. In other words, the model prediction never matches observation perfectly. Therefore, we have to seek for the parameter values of the model that can get us as close as possible to data. To do so, we define a **statistical model** in addition to the **physical model** described above. In other words, we have to define a statistical regression model (the renowned least-squares method) that gives us the probability $\pi(N_{obs}|N(t))$ of observing individual data points at each of the given times,

Note that our statistical model given above is a Gaussian probability density function, with its mean parameter represented by the output of our physical model, $N(t|a,b,c)$, and its standard deviation represented by $\sigma$, which is unknown, and we seek to extremize it.

We have seven data points, so the overall probability of observing all of data $\mathcal{D}$ together given the parameters of the model, $\mathcal{L}(\mathcal{D}|a,b,c,\sigma)$, is the product of their invidiual probabilities of observations given by the above equation,

More often, you would want to work with $\log\mathcal{L}$ instead of $\mathcal{L}$, so the above equation becomes,

**4. ** Now the goal is to use an optimization algorithm, such as Markov Chain Monte Carlo available in Python via PyMc package, to find the most likely set of parameters of the model $a,b,c,\sigma$ that give the best prediction of the available data. Use the pymc package, or any other method you wish to obtein the best parameters, then redraw the above tumor evolution curve and show the result from the model as well. You can also alternatively use my own package for MCMC sampling, if which case, please inform me and I will instruct you how to use it.