Iterative Knockout for Feature Importance in Neural Networks [Python] [Keras]

Python package/scripts for determining feature importance and high order interactions in a recurrent neural network in Keras.

Iterative Knockout – A Method for Finding Feature Importance and High-Order interactions in (Recurrent) Neural Networks [Python] [Keras]¶

Description¶

This repo contains python scripts for determining feature importance and high order interactions in a recurrent neural network in Keras. The idea is to train a a model and then selectively nullify features and recalculate accuracy. If the accuracy changes significantly, then that feature(s) is considered to be important. These scripts contain functions to calculate accuarcy for by nullifying features 1 at a time, by a determined gram size – that is n-sized chunks of features next to each other in sequence-, or by high order- feature nullification for all combinations of features. These functions will work for data formatted for standard timestep processing i.e. (sample size, # timesteps, # feature categories). Later, I may add this method for Convnets to find feature importance in images, but I think that will be too computationally demanding.

Pipline¶

As stated earlier, the data should be in the format (n, # timesteps, # feature categories) e.g. if you were modelling text it would be (# passages, length of passages, one-hot of the word(s) that appears at that timestep). The Data script contains some functions for generating random data in this format. The model script currently builds a simple bi-directional LSTM with binary output, so if you use this model your labels will need to be binary and in the format (n,).

There are two methods for knocking out data, one is to knockout out data and retrain the model after each knockout and compare model accuracy. Because this requires retraining for each knockout, it is not feasible for high-order interactions as a sample with 10 timesteps will require over 1000 iterations. Instead, I propose that you split your data (use sklearn.cross_validation or some shit) into two numpy arrays. Use one of the arrays to train the model.

After the model has been trained, use the second part for the knockout process. First, use the trained model to generate predictions. Use Mean_Log_Loss from the metrics script on the predictions to calculate accuracy of the model. Then (also n the metrics script), use your desired feature knockout- Single_Iterative_Knockout, N-Gram_Iterative_Knockout, or High_Order_Iterative_Knockout. This will perform the knockout process accordingly and generate predictions for each iteration. Then it will calculate change in accuracy using the same cost function (mean log loss) from before and return a list of accuracies and a list of indices (higher number means that the iteration had a more significant effect on the model’s accuracy).

Usage¶

Metrics¶

Mean_Log_Loss(predictions, labels, limit=10)

Description: calculates accuracy of predictions using mean log loss for cost.
predictions: A list of predictions as outputted from the model.
labels: An array of labels that correspond, sequentially to the predictions.
limit: An integer 10^-limit gets added to difference between predictions and labels to avoid taking a log of 0 which happens when keras predicts too similar to the label.

Single_Iterative_Knockout(features_knockout, model, labels, baseline)

Description: Calculates feature importance, one feature at a time.
features_knockout: An array of features that have been set aside to perform the knockout on.
model: A trained Keras model for predicting features_knockout.
labels: An array of labels to correspond with features_knockout.
baseline: A float outputted from Mean_Log_Loss.

N_Gram_Iterative_Knockout(features_knockout, model, labels, baseline, gram_size=2)

Description: Calculates feature importance in sequential chunks of features of a determined length.
features_knockout: An array of features that have been set aside to perform the knockout on.
model: A trained Keras model for predicting features_knockout.
labels: An array of labels to correspond with features_knockout.
baseline: A float outputted from Mean_Log_Loss.
gram_size: number of features to group together.

High_Order_Iterative_Knockout(features_knockout, model, labels, baseline,)

Description: Calculates feature importance for all possible combinations of features.
features_knockout: An array of features that have been set aside to perform the knockout on.
model: A trained Keras model for predicting features_knockout.
labels: An array of labels to correspond with features_knockout.
baseline: A float outputted from Mean_Log_Loss.

Data¶

Create_Features(sample_size, sequence_length, feature_length)

Description: Creates random features based on parameters.
sample_size: Desired sample size.
sequence_length: Desired sequence length.
feature_length: Desired feature length.

Create_Labels(sample_size)

Description: Creates random, binary labels.
sample_size: Desired sample size.

Sample_Data(sample_size,sequence_length, target)

Description: A function to generate data that emulates DNA and targets a specified site for association with a positive label.
sample_size: Desired sample size.
sequence_length: Desired sequence length.
target: An integer of a timestep to be highly associated with a positive label.

Models¶

Create_RNN(input_shape)

Description: Creates a simple bidirectional LSTM.
input_shape: a tuple with a length of 3 specifying input shape.

Example¶

Here is an example of how to perform the iterative knockout using some sample data that I made to emulate DNA. The features have n samples, x sequence lengths, and 4 possible nucleotides (A, G, C, U). The labels are binary, think of it as having a disease (0) or not having a disease (1). In this example I will target a particular site that if a “G” appears at that site, the person will automatically be labelled as a 1. Conversely, if the sequence does not contain a “G” at the target site, it will be labelled a 0. Thus the target site should be highly important to the accuracy of the model.

In [1]:

from Data import Sample_Data
from Models import Create_RNN
from keras.callbacks import EarlyStopping
from Metrics import Mean_Log_Loss, Single_Iterative_Knockout, N_Gram_Iterative_Knockout, High_Order_Iterative_Knockout
import numpy as np


# Set some parameters - index 5 (the 6th feature) will be highly associated with a positive label.
sample_size = 1000
sequence_length = 10
target = 5

# Generate data that will be used to train the model.
features, phenotype = Sample_Data(sample_size = sample_size, sequence_length = sequence_length, target = target)
print("features shape: ", features.shape)
print("labels shape: ", phenotype.shape)

# Generate some seperate data that will be used in predictions.
features_test, phenotype_test = Sample_Data(sample_size = sample_size, sequence_length = sequence_length, target = target)

# Train the model
model = Create_RNN(features.shape[1:])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['accuracy'])
model.summary()
model.fit(features, phenotype,
                 batch_size=32,
                 validation_split=0.2,
                 epochs=50,
                 callbacks=[EarlyStopping(patience=4, monitor='val_loss')],
                 verbose=0)

# Get a baseline accuracy for the model
predictions = model.predict(features_test).reshape(sample_size,).tolist()
accuracy = Mean_Log_Loss(predictions = predictions, labels = phenotype_test)
print("baseline accuracy: ",accuracy,"\n")

Using TensorFlow backend.
C:\Users\James\AppData\Local\Programs\Python\Python36\lib\site-packages\h5py\__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

features shape:  (1000, 10, 4)
labels shape:  (1000,)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional_1 (Bidirection (None, 10, 256)           136192    
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 256)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 10, 512)           1050624   
_________________________________________________________________
dropout_2 (Dropout)          (None, 10, 512)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 10, 512)           1574912   
_________________________________________________________________
dropout_3 (Dropout)          (None, 10, 512)           0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 10, 256)           656384    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 256)               0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448     
_________________________________________________________________
dropout_5 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_6 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 8)                 264       
_________________________________________________________________
dropout_7 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 9         
=================================================================
Total params: 3,436,913
Trainable params: 3,436,913
Non-trainable params: 0
_________________________________________________________________
baseline accuracy:  9.527926866054859

In [2]:

# Generate single knockout predictions (the 5th index should be significantly higher than others).

single_iterative_knockout = Single_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test)
print("single iterative knockout accuracy change: ")
print(single_iterative_knockout)

single iterative knockout accuracy change: 
[0.035108970935823436, 0.03437247355481432, 0.0013010734957124015, 0.0008233878212315915, 0.1768184358479452, 3.798454558710884, 0.014289971848889138, 0.0628687005193882, 0.0910105793758742, 0.06518626163800967]

Notice that the target feature, the 6th one is about 3 orders of magnitude higher than the others.

In [3]:

# Generate knockout predictions for a specified gram size - I'll use 3.

n_gram_knockout, index = N_Gram_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test, gram_size = 3)
print("n-gram knockout accuracy change: ")
print( n_gram_knockout,"\n")
print("index: ")
print(index)

n-gram knockout accuracy change: 
[0.04639988736460943, 0.027510534638317097, 0.20111947926918639, 2.9743778062897963, 4.689957969618205, 5.048124117883249, 0.31644427887241555] 

index: 
['0:2', '1:3', '2:4', '3:5', '4:6', '5:7', '6:8']

In the n-gram knockout, iterations knocking out the target feature- ‘3:5’, ‘4:6’, and ‘5:7’- are about 1 order of magnitude higher than the others. Though, in this case, increasing gram size will decrease decrease the significance of containing the target feature as there is no association between other features and a positive phenotype in my sample data.

In [4]:

# Generate high-order iterative knockouts

high_order_knockout, index = High_Order_Iterative_Knockout(features_knockout = features_test, model = model, baseline = accuracy, labels = phenotype_test)
print("high-order knockout accuracy change: ")
print( high_order_knockout,"\n")
print("index: ")
print(index)

high-order knockout accuracy change: 
[0.04639988736460943, 0.04639988736460943, 0.04639988736460943, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.04639988736460943, 0.04639988736460943, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.04639988736460943, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 0.04639988736460943, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 0.027510534638317097, 0.19488173720568724, 3.374308469595647, 0.052744375314214054, 0.021825068367524736, 0.08098510601248954, 0.037281014078338615, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 0.1843035110784097, 2.8088940414296495, 0.07287416234685118, 0.04626442049646151, 0.11138933160063935, 0.10936076886751422, 3.2698276957628627, 0.21968805385313317, 0.3666712373227057, 0.22230182448790003, 0.1713302152511229, 4.734180187251919, 3.587362501682242, 3.8294607064979944, 3.1624288135366916, 0.058666419199772335, 0.2443440964969703, 0.20717623728584655, 0.12392009172628704, 0.09646807570000782, 0.19132259936976403, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 2.8547430207062945, 0.22634134145111062, 0.4258830698531426, 0.17796891004468307, 0.12885996909805364, 4.139879116486767, 3.101583556579527, 3.192691031955686, 2.7153131343598096, 0.10326163841709324, 0.21587628772912737, 0.23772187707904102, 0.20051989744865395, 0.19039203963900242, 0.2543550890908328, 4.589429408365294, 3.7375721227714562, 3.8310065575323113, 3.0131735230689243, 0.47874715854687366, 0.1147516393373742, 0.3312577684624962, 0.42477877108902184, 0.3489568684626274, 0.18006912643432926, 5.08718343631219, 5.223274698710005, 4.294880209017686, 4.138878833943083, 3.412256736956966, 3.4387810790804547, 0.23199265140118186, 0.2389331534427157, 0.31208246678238183, 0.37878847948599415, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.011326184333655, 3.3117475583521054, 3.339840921294705, 2.655441854777508, 0.48810310907683174, 0.15497940659520104, 0.3360603418561823, 0.42916747713236525, 0.32766921702108753, 0.05705086328592124, 4.558700492244825, 4.669377151267385, 3.8570187767339146, 3.5944074116594242, 2.912443437564617, 2.9390324634322456, 0.14171135538763302, 0.28671645392682876, 0.2958867779738412, 0.6366051749635133, 5.179024168056652, 5.07305140439398, 4.261627682482974, 4.422894010316348, 3.446866642270577, 3.5197406973039413, 0.21811468689800861, 0.44897167215508205, 0.3011154605226807, 0.3034038331691189, 5.426047095767712, 4.434496848332272, 4.570317689284098, 3.8020605047401324, 0.3434449264755397, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.479078899831331, 4.595816747894877, 4.346907651406751, 3.7668426144412477, 3.9010662657712603, 3.006074844714525, 3.083025112948973, 0.2622363171505402, 0.5417426421013278, 0.42989756112238986, 0.2147564552259169, 4.963530222873048, 4.0619534732783436, 4.150392074036729, 3.1149282590369136, 0.6926671256785912, 5.613381572963104, 4.720526925571363, 4.613128077309473, 3.956539323891146, 0.44631103406578276, 4.509252502431731, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.479078899831331, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.479078899831331, 4.479078899831331, 5.0122864081563305, 4.238077980583448, 4.068178043485734, 3.474812298397164, 0.01389200829499515, 4.16032057025792, 4.911473800911488, 4.479078899831331, 4.479078899831331, 4.479078899831331] 

index: 
[[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [0, 1], [0, 2], [0, 3], [0, 4], [0, 5], [0, 6], [0, 7], [0, 8], [0, 9], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [1, 9], [2, 3], [2, 4], [2, 5], [2, 6], [2, 7], [2, 8], [2, 9], [3, 4], [3, 5], [3, 6], [3, 7], [3, 8], [3, 9], [4, 5], [4, 6], [4, 7], [4, 8], [4, 9], [5, 6], [5, 7], [5, 8], [5, 9], [6, 7], [6, 8], [6, 9], [7, 8], [7, 9], [8, 9], [0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 1, 5], [0, 1, 6], [0, 1, 7], [0, 1, 8], [0, 1, 9], [0, 2, 3], [0, 2, 4], [0, 2, 5], [0, 2, 6], [0, 2, 7], [0, 2, 8], [0, 2, 9], [0, 3, 4], [0, 3, 5], [0, 3, 6], [0, 3, 7], [0, 3, 8], [0, 3, 9], [0, 4, 5], [0, 4, 6], [0, 4, 7], [0, 4, 8], [0, 4, 9], [0, 5, 6], [0, 5, 7], [0, 5, 8], [0, 5, 9], [0, 6, 7], [0, 6, 8], [0, 6, 9], [0, 7, 8], [0, 7, 9], [0, 8, 9], [1, 2, 3], [1, 2, 4], [1, 2, 5], [1, 2, 6], [1, 2, 7], [1, 2, 8], [1, 2, 9], [1, 3, 4], [1, 3, 5], [1, 3, 6], [1, 3, 7], [1, 3, 8], [1, 3, 9], [1, 4, 5], [1, 4, 6], [1, 4, 7], [1, 4, 8], [1, 4, 9], [1, 5, 6], [1, 5, 7], [1, 5, 8], [1, 5, 9], [1, 6, 7], [1, 6, 8], [1, 6, 9], [1, 7, 8], [1, 7, 9], [1, 8, 9], [2, 3, 4], [2, 3, 5], [2, 3, 6], [2, 3, 7], [2, 3, 8], [2, 3, 9], [2, 4, 5], [2, 4, 6], [2, 4, 7], [2, 4, 8], [2, 4, 9], [2, 5, 6], [2, 5, 7], [2, 5, 8], [2, 5, 9], [2, 6, 7], [2, 6, 8], [2, 6, 9], [2, 7, 8], [2, 7, 9], [2, 8, 9], [3, 4, 5], [3, 4, 6], [3, 4, 7], [3, 4, 8], [3, 4, 9], [3, 5, 6], [3, 5, 7], [3, 5, 8], [3, 5, 9], [3, 6, 7], [3, 6, 8], [3, 6, 9], [3, 7, 8], [3, 7, 9], [3, 8, 9], [4, 5, 6], [4, 5, 7], [4, 5, 8], [4, 5, 9], [4, 6, 7], [4, 6, 8], [4, 6, 9], [4, 7, 8], [4, 7, 9], [4, 8, 9], [5, 6, 7], [5, 6, 8], [5, 6, 9], [5, 7, 8], [5, 7, 9], [5, 8, 9], [6, 7, 8], [6, 7, 9], [6, 8, 9], [7, 8, 9], [0, 1, 2, 3], [0, 1, 2, 4], [0, 1, 2, 5], [0, 1, 2, 6], [0, 1, 2, 7], [0, 1, 2, 8], [0, 1, 2, 9], [0, 1, 3, 4], [0, 1, 3, 5], [0, 1, 3, 6], [0, 1, 3, 7], [0, 1, 3, 8], [0, 1, 3, 9], [0, 1, 4, 5], [0, 1, 4, 6], [0, 1, 4, 7], [0, 1, 4, 8], [0, 1, 4, 9], [0, 1, 5, 6], [0, 1, 5, 7], [0, 1, 5, 8], [0, 1, 5, 9], [0, 1, 6, 7], [0, 1, 6, 8], [0, 1, 6, 9], [0, 1, 7, 8], [0, 1, 7, 9], [0, 1, 8, 9], [0, 2, 3, 4], [0, 2, 3, 5], [0, 2, 3, 6], [0, 2, 3, 7], [0, 2, 3, 8], [0, 2, 3, 9], [0, 2, 4, 5], [0, 2, 4, 6], [0, 2, 4, 7], [0, 2, 4, 8], [0, 2, 4, 9], [0, 2, 5, 6], [0, 2, 5, 7], [0, 2, 5, 8], [0, 2, 5, 9], [0, 2, 6, 7], [0, 2, 6, 8], [0, 2, 6, 9], [0, 2, 7, 8], [0, 2, 7, 9], [0, 2, 8, 9], [0, 3, 4, 5], [0, 3, 4, 6], [0, 3, 4, 7], [0, 3, 4, 8], [0, 3, 4, 9], [0, 3, 5, 6], [0, 3, 5, 7], [0, 3, 5, 8], [0, 3, 5, 9], [0, 3, 6, 7], [0, 3, 6, 8], [0, 3, 6, 9], [0, 3, 7, 8], [0, 3, 7, 9], [0, 3, 8, 9], [0, 4, 5, 6], [0, 4, 5, 7], [0, 4, 5, 8], [0, 4, 5, 9], [0, 4, 6, 7], [0, 4, 6, 8], [0, 4, 6, 9], [0, 4, 7, 8], [0, 4, 7, 9], [0, 4, 8, 9], [0, 5, 6, 7], [0, 5, 6, 8], [0, 5, 6, 9], [0, 5, 7, 8], [0, 5, 7, 9], [0, 5, 8, 9], [0, 6, 7, 8], [0, 6, 7, 9], [0, 6, 8, 9], [0, 7, 8, 9], [1, 2, 3, 4], [1, 2, 3, 5], [1, 2, 3, 6], [1, 2, 3, 7], [1, 2, 3, 8], [1, 2, 3, 9], [1, 2, 4, 5], [1, 2, 4, 6], [1, 2, 4, 7], [1, 2, 4, 8], [1, 2, 4, 9], [1, 2, 5, 6], [1, 2, 5, 7], [1, 2, 5, 8], [1, 2, 5, 9], [1, 2, 6, 7], [1, 2, 6, 8], [1, 2, 6, 9], [1, 2, 7, 8], [1, 2, 7, 9], [1, 2, 8, 9], [1, 3, 4, 5], [1, 3, 4, 6], [1, 3, 4, 7], [1, 3, 4, 8], [1, 3, 4, 9], [1, 3, 5, 6], [1, 3, 5, 7], [1, 3, 5, 8], [1, 3, 5, 9], [1, 3, 6, 7], [1, 3, 6, 8], [1, 3, 6, 9], [1, 3, 7, 8], [1, 3, 7, 9], [1, 3, 8, 9], [1, 4, 5, 6], [1, 4, 5, 7], [1, 4, 5, 8], [1, 4, 5, 9], [1, 4, 6, 7], [1, 4, 6, 8], [1, 4, 6, 9], [1, 4, 7, 8], [1, 4, 7, 9], [1, 4, 8, 9], [1, 5, 6, 7], [1, 5, 6, 8], [1, 5, 6, 9], [1, 5, 7, 8], [1, 5, 7, 9], [1, 5, 8, 9], [1, 6, 7, 8], [1, 6, 7, 9], [1, 6, 8, 9], [1, 7, 8, 9], [2, 3, 4, 5], [2, 3, 4, 6], [2, 3, 4, 7], [2, 3, 4, 8], [2, 3, 4, 9], [2, 3, 5, 6], [2, 3, 5, 7], [2, 3, 5, 8], [2, 3, 5, 9], [2, 3, 6, 7], [2, 3, 6, 8], [2, 3, 6, 9], [2, 3, 7, 8], [2, 3, 7, 9], [2, 3, 8, 9], [2, 4, 5, 6], [2, 4, 5, 7], [2, 4, 5, 8], [2, 4, 5, 9], [2, 4, 6, 7], [2, 4, 6, 8], [2, 4, 6, 9], [2, 4, 7, 8], [2, 4, 7, 9], [2, 4, 8, 9], [2, 5, 6, 7], [2, 5, 6, 8], [2, 5, 6, 9], [2, 5, 7, 8], [2, 5, 7, 9], [2, 5, 8, 9], [2, 6, 7, 8], [2, 6, 7, 9], [2, 6, 8, 9], [2, 7, 8, 9], [3, 4, 5, 6], [3, 4, 5, 7], [3, 4, 5, 8], [3, 4, 5, 9], [3, 4, 6, 7], [3, 4, 6, 8], [3, 4, 6, 9], [3, 4, 7, 8], [3, 4, 7, 9], [3, 4, 8, 9], [3, 5, 6, 7], [3, 5, 6, 8], [3, 5, 6, 9], [3, 5, 7, 8], [3, 5, 7, 9], [3, 5, 8, 9], [3, 6, 7, 8], [3, 6, 7, 9], [3, 6, 8, 9], [3, 7, 8, 9], [4, 5, 6, 7], [4, 5, 6, 8], [4, 5, 6, 9], [4, 5, 7, 8], [4, 5, 7, 9], [4, 5, 8, 9], [4, 6, 7, 8], [4, 6, 7, 9], [4, 6, 8, 9], [4, 7, 8, 9], [5, 6, 7, 8], [5, 6, 7, 9], [5, 6, 8, 9], [5, 7, 8, 9], [6, 7, 8, 9], [0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 3, 6], [0, 1, 2, 3, 7], [0, 1, 2, 3, 8], [0, 1, 2, 3, 9], [0, 1, 2, 4, 5], [0, 1, 2, 4, 6], [0, 1, 2, 4, 7], [0, 1, 2, 4, 8], [0, 1, 2, 4, 9], [0, 1, 2, 5, 6], [0, 1, 2, 5, 7], [0, 1, 2, 5, 8], [0, 1, 2, 5, 9], [0, 1, 2, 6, 7], [0, 1, 2, 6, 8], [0, 1, 2, 6, 9], [0, 1, 2, 7, 8], [0, 1, 2, 7, 9], [0, 1, 2, 8, 9], [0, 1, 3, 4, 5], [0, 1, 3, 4, 6], [0, 1, 3, 4, 7], [0, 1, 3, 4, 8], [0, 1, 3, 4, 9], [0, 1, 3, 5, 6], [0, 1, 3, 5, 7], [0, 1, 3, 5, 8], [0, 1, 3, 5, 9], [0, 1, 3, 6, 7], [0, 1, 3, 6, 8], [0, 1, 3, 6, 9], [0, 1, 3, 7, 8], [0, 1, 3, 7, 9], [0, 1, 3, 8, 9], [0, 1, 4, 5, 6], [0, 1, 4, 5, 7], [0, 1, 4, 5, 8], [0, 1, 4, 5, 9], [0, 1, 4, 6, 7], [0, 1, 4, 6, 8], [0, 1, 4, 6, 9], [0, 1, 4, 7, 8], [0, 1, 4, 7, 9], [0, 1, 4, 8, 9], [0, 1, 5, 6, 7], [0, 1, 5, 6, 8], [0, 1, 5, 6, 9], [0, 1, 5, 7, 8], [0, 1, 5, 7, 9], [0, 1, 5, 8, 9], [0, 1, 6, 7, 8], [0, 1, 6, 7, 9], [0, 1, 6, 8, 9], [0, 1, 7, 8, 9], [0, 2, 3, 4, 5], [0, 2, 3, 4, 6], [0, 2, 3, 4, 7], [0, 2, 3, 4, 8], [0, 2, 3, 4, 9], [0, 2, 3, 5, 6], [0, 2, 3, 5, 7], [0, 2, 3, 5, 8], [0, 2, 3, 5, 9], [0, 2, 3, 6, 7], [0, 2, 3, 6, 8], [0, 2, 3, 6, 9], [0, 2, 3, 7, 8], [0, 2, 3, 7, 9], [0, 2, 3, 8, 9], [0, 2, 4, 5, 6], [0, 2, 4, 5, 7], [0, 2, 4, 5, 8], [0, 2, 4, 5, 9], [0, 2, 4, 6, 7], [0, 2, 4, 6, 8], [0, 2, 4, 6, 9], [0, 2, 4, 7, 8], [0, 2, 4, 7, 9], [0, 2, 4, 8, 9], [0, 2, 5, 6, 7], [0, 2, 5, 6, 8], [0, 2, 5, 6, 9], [0, 2, 5, 7, 8], [0, 2, 5, 7, 9], [0, 2, 5, 8, 9], [0, 2, 6, 7, 8], [0, 2, 6, 7, 9], [0, 2, 6, 8, 9], [0, 2, 7, 8, 9], [0, 3, 4, 5, 6], [0, 3, 4, 5, 7], [0, 3, 4, 5, 8], [0, 3, 4, 5, 9], [0, 3, 4, 6, 7], [0, 3, 4, 6, 8], [0, 3, 4, 6, 9], [0, 3, 4, 7, 8], [0, 3, 4, 7, 9], [0, 3, 4, 8, 9], [0, 3, 5, 6, 7], [0, 3, 5, 6, 8], [0, 3, 5, 6, 9], [0, 3, 5, 7, 8], [0, 3, 5, 7, 9], [0, 3, 5, 8, 9], [0, 3, 6, 7, 8], [0, 3, 6, 7, 9], [0, 3, 6, 8, 9], [0, 3, 7, 8, 9], [0, 4, 5, 6, 7], [0, 4, 5, 6, 8], [0, 4, 5, 6, 9], [0, 4, 5, 7, 8], [0, 4, 5, 7, 9], [0, 4, 5, 8, 9], [0, 4, 6, 7, 8], [0, 4, 6, 7, 9], [0, 4, 6, 8, 9], [0, 4, 7, 8, 9], [0, 5, 6, 7, 8], [0, 5, 6, 7, 9], [0, 5, 6, 8, 9], [0, 5, 7, 8, 9], [0, 6, 7, 8, 9], [1, 2, 3, 4, 5], [1, 2, 3, 4, 6], [1, 2, 3, 4, 7], [1, 2, 3, 4, 8], [1, 2, 3, 4, 9], [1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 5, 8], [1, 2, 3, 5, 9], [1, 2, 3, 6, 7], [1, 2, 3, 6, 8], [1, 2, 3, 6, 9], [1, 2, 3, 7, 8], [1, 2, 3, 7, 9], [1, 2, 3, 8, 9], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 5, 8], [1, 2, 4, 5, 9], [1, 2, 4, 6, 7], [1, 2, 4, 6, 8], [1, 2, 4, 6, 9], [1, 2, 4, 7, 8], [1, 2, 4, 7, 9], [1, 2, 4, 8, 9], [1, 2, 5, 6, 7], [1, 2, 5, 6, 8], [1, 2, 5, 6, 9], [1, 2, 5, 7, 8], [1, 2, 5, 7, 9], [1, 2, 5, 8, 9], [1, 2, 6, 7, 8], [1, 2, 6, 7, 9], [1, 2, 6, 8, 9], [1, 2, 7, 8, 9], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 5, 8], [1, 3, 4, 5, 9], [1, 3, 4, 6, 7], [1, 3, 4, 6, 8], [1, 3, 4, 6, 9], [1, 3, 4, 7, 8], [1, 3, 4, 7, 9], [1, 3, 4, 8, 9], [1, 3, 5, 6, 7], [1, 3, 5, 6, 8], [1, 3, 5, 6, 9], [1, 3, 5, 7, 8], [1, 3, 5, 7, 9], [1, 3, 5, 8, 9], [1, 3, 6, 7, 8], [1, 3, 6, 7, 9], [1, 3, 6, 8, 9], [1, 3, 7, 8, 9], [1, 4, 5, 6, 7], [1, 4, 5, 6, 8], [1, 4, 5, 6, 9], [1, 4, 5, 7, 8], [1, 4, 5, 7, 9], [1, 4, 5, 8, 9], [1, 4, 6, 7, 8], [1, 4, 6, 7, 9], [1, 4, 6, 8, 9], [1, 4, 7, 8, 9], [1, 5, 6, 7, 8], [1, 5, 6, 7, 9], [1, 5, 6, 8, 9], [1, 5, 7, 8, 9], [1, 6, 7, 8, 9], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 5, 8], [2, 3, 4, 5, 9], [2, 3, 4, 6, 7], [2, 3, 4, 6, 8], [2, 3, 4, 6, 9], [2, 3, 4, 7, 8], [2, 3, 4, 7, 9], [2, 3, 4, 8, 9], [2, 3, 5, 6, 7], [2, 3, 5, 6, 8], [2, 3, 5, 6, 9], [2, 3, 5, 7, 8], [2, 3, 5, 7, 9], [2, 3, 5, 8, 9], [2, 3, 6, 7, 8], [2, 3, 6, 7, 9], [2, 3, 6, 8, 9], [2, 3, 7, 8, 9], [2, 4, 5, 6, 7], [2, 4, 5, 6, 8], [2, 4, 5, 6, 9], [2, 4, 5, 7, 8], [2, 4, 5, 7, 9], [2, 4, 5, 8, 9], [2, 4, 6, 7, 8], [2, 4, 6, 7, 9], [2, 4, 6, 8, 9], [2, 4, 7, 8, 9], [2, 5, 6, 7, 8], [2, 5, 6, 7, 9], [2, 5, 6, 8, 9], [2, 5, 7, 8, 9], [2, 6, 7, 8, 9], [3, 4, 5, 6, 7], [3, 4, 5, 6, 8], [3, 4, 5, 6, 9], [3, 4, 5, 7, 8], [3, 4, 5, 7, 9], [3, 4, 5, 8, 9], [3, 4, 6, 7, 8], [3, 4, 6, 7, 9], [3, 4, 6, 8, 9], [3, 4, 7, 8, 9], [3, 5, 6, 7, 8], [3, 5, 6, 7, 9], [3, 5, 6, 8, 9], [3, 5, 7, 8, 9], [3, 6, 7, 8, 9], [4, 5, 6, 7, 8], [4, 5, 6, 7, 9], [4, 5, 6, 8, 9], [4, 5, 7, 8, 9], [4, 6, 7, 8, 9], [5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 6], [0, 1, 2, 3, 4, 7], [0, 1, 2, 3, 4, 8], [0, 1, 2, 3, 4, 9], [0, 1, 2, 3, 5, 6], [0, 1, 2, 3, 5, 7], [0, 1, 2, 3, 5, 8], [0, 1, 2, 3, 5, 9], [0, 1, 2, 3, 6, 7], [0, 1, 2, 3, 6, 8], [0, 1, 2, 3, 6, 9], [0, 1, 2, 3, 7, 8], [0, 1, 2, 3, 7, 9], [0, 1, 2, 3, 8, 9], [0, 1, 2, 4, 5, 6], [0, 1, 2, 4, 5, 7], [0, 1, 2, 4, 5, 8], [0, 1, 2, 4, 5, 9], [0, 1, 2, 4, 6, 7], [0, 1, 2, 4, 6, 8], [0, 1, 2, 4, 6, 9], [0, 1, 2, 4, 7, 8], [0, 1, 2, 4, 7, 9], [0, 1, 2, 4, 8, 9], [0, 1, 2, 5, 6, 7], [0, 1, 2, 5, 6, 8], [0, 1, 2, 5, 6, 9], [0, 1, 2, 5, 7, 8], [0, 1, 2, 5, 7, 9], [0, 1, 2, 5, 8, 9], [0, 1, 2, 6, 7, 8], [0, 1, 2, 6, 7, 9], [0, 1, 2, 6, 8, 9], [0, 1, 2, 7, 8, 9], [0, 1, 3, 4, 5, 6], [0, 1, 3, 4, 5, 7], [0, 1, 3, 4, 5, 8], [0, 1, 3, 4, 5, 9], [0, 1, 3, 4, 6, 7], [0, 1, 3, 4, 6, 8], [0, 1, 3, 4, 6, 9], [0, 1, 3, 4, 7, 8], [0, 1, 3, 4, 7, 9], [0, 1, 3, 4, 8, 9], [0, 1, 3, 5, 6, 7], [0, 1, 3, 5, 6, 8], [0, 1, 3, 5, 6, 9], [0, 1, 3, 5, 7, 8], [0, 1, 3, 5, 7, 9], [0, 1, 3, 5, 8, 9], [0, 1, 3, 6, 7, 8], [0, 1, 3, 6, 7, 9], [0, 1, 3, 6, 8, 9], [0, 1, 3, 7, 8, 9], [0, 1, 4, 5, 6, 7], [0, 1, 4, 5, 6, 8], [0, 1, 4, 5, 6, 9], [0, 1, 4, 5, 7, 8], [0, 1, 4, 5, 7, 9], [0, 1, 4, 5, 8, 9], [0, 1, 4, 6, 7, 8], [0, 1, 4, 6, 7, 9], [0, 1, 4, 6, 8, 9], [0, 1, 4, 7, 8, 9], [0, 1, 5, 6, 7, 8], [0, 1, 5, 6, 7, 9], [0, 1, 5, 6, 8, 9], [0, 1, 5, 7, 8, 9], [0, 1, 6, 7, 8, 9], [0, 2, 3, 4, 5, 6], [0, 2, 3, 4, 5, 7], [0, 2, 3, 4, 5, 8], [0, 2, 3, 4, 5, 9], [0, 2, 3, 4, 6, 7], [0, 2, 3, 4, 6, 8], [0, 2, 3, 4, 6, 9], [0, 2, 3, 4, 7, 8], [0, 2, 3, 4, 7, 9], [0, 2, 3, 4, 8, 9], [0, 2, 3, 5, 6, 7], [0, 2, 3, 5, 6, 8], [0, 2, 3, 5, 6, 9], [0, 2, 3, 5, 7, 8], [0, 2, 3, 5, 7, 9], [0, 2, 3, 5, 8, 9], [0, 2, 3, 6, 7, 8], [0, 2, 3, 6, 7, 9], [0, 2, 3, 6, 8, 9], [0, 2, 3, 7, 8, 9], [0, 2, 4, 5, 6, 7], [0, 2, 4, 5, 6, 8], [0, 2, 4, 5, 6, 9], [0, 2, 4, 5, 7, 8], [0, 2, 4, 5, 7, 9], [0, 2, 4, 5, 8, 9], [0, 2, 4, 6, 7, 8], [0, 2, 4, 6, 7, 9], [0, 2, 4, 6, 8, 9], [0, 2, 4, 7, 8, 9], [0, 2, 5, 6, 7, 8], [0, 2, 5, 6, 7, 9], [0, 2, 5, 6, 8, 9], [0, 2, 5, 7, 8, 9], [0, 2, 6, 7, 8, 9], [0, 3, 4, 5, 6, 7], [0, 3, 4, 5, 6, 8], [0, 3, 4, 5, 6, 9], [0, 3, 4, 5, 7, 8], [0, 3, 4, 5, 7, 9], [0, 3, 4, 5, 8, 9], [0, 3, 4, 6, 7, 8], [0, 3, 4, 6, 7, 9], [0, 3, 4, 6, 8, 9], [0, 3, 4, 7, 8, 9], [0, 3, 5, 6, 7, 8], [0, 3, 5, 6, 7, 9], [0, 3, 5, 6, 8, 9], [0, 3, 5, 7, 8, 9], [0, 3, 6, 7, 8, 9], [0, 4, 5, 6, 7, 8], [0, 4, 5, 6, 7, 9], [0, 4, 5, 6, 8, 9], [0, 4, 5, 7, 8, 9], [0, 4, 6, 7, 8, 9], [0, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 7], [1, 2, 3, 4, 5, 8], [1, 2, 3, 4, 5, 9], [1, 2, 3, 4, 6, 7], [1, 2, 3, 4, 6, 8], [1, 2, 3, 4, 6, 9], [1, 2, 3, 4, 7, 8], [1, 2, 3, 4, 7, 9], [1, 2, 3, 4, 8, 9], [1, 2, 3, 5, 6, 7], [1, 2, 3, 5, 6, 8], [1, 2, 3, 5, 6, 9], [1, 2, 3, 5, 7, 8], [1, 2, 3, 5, 7, 9], [1, 2, 3, 5, 8, 9], [1, 2, 3, 6, 7, 8], [1, 2, 3, 6, 7, 9], [1, 2, 3, 6, 8, 9], [1, 2, 3, 7, 8, 9], [1, 2, 4, 5, 6, 7], [1, 2, 4, 5, 6, 8], [1, 2, 4, 5, 6, 9], [1, 2, 4, 5, 7, 8], [1, 2, 4, 5, 7, 9], [1, 2, 4, 5, 8, 9], [1, 2, 4, 6, 7, 8], [1, 2, 4, 6, 7, 9], [1, 2, 4, 6, 8, 9], [1, 2, 4, 7, 8, 9], [1, 2, 5, 6, 7, 8], [1, 2, 5, 6, 7, 9], [1, 2, 5, 6, 8, 9], [1, 2, 5, 7, 8, 9], [1, 2, 6, 7, 8, 9], [1, 3, 4, 5, 6, 7], [1, 3, 4, 5, 6, 8], [1, 3, 4, 5, 6, 9], [1, 3, 4, 5, 7, 8], [1, 3, 4, 5, 7, 9], [1, 3, 4, 5, 8, 9], [1, 3, 4, 6, 7, 8], [1, 3, 4, 6, 7, 9], [1, 3, 4, 6, 8, 9], [1, 3, 4, 7, 8, 9], [1, 3, 5, 6, 7, 8], [1, 3, 5, 6, 7, 9], [1, 3, 5, 6, 8, 9], [1, 3, 5, 7, 8, 9], [1, 3, 6, 7, 8, 9], [1, 4, 5, 6, 7, 8], [1, 4, 5, 6, 7, 9], [1, 4, 5, 6, 8, 9], [1, 4, 5, 7, 8, 9], [1, 4, 6, 7, 8, 9], [1, 5, 6, 7, 8, 9], [2, 3, 4, 5, 6, 7], [2, 3, 4, 5, 6, 8], [2, 3, 4, 5, 6, 9], [2, 3, 4, 5, 7, 8], [2, 3, 4, 5, 7, 9], [2, 3, 4, 5, 8, 9], [2, 3, 4, 6, 7, 8], [2, 3, 4, 6, 7, 9], [2, 3, 4, 6, 8, 9], [2, 3, 4, 7, 8, 9], [2, 3, 5, 6, 7, 8], [2, 3, 5, 6, 7, 9], [2, 3, 5, 6, 8, 9], [2, 3, 5, 7, 8, 9], [2, 3, 6, 7, 8, 9], [2, 4, 5, 6, 7, 8], [2, 4, 5, 6, 7, 9], [2, 4, 5, 6, 8, 9], [2, 4, 5, 7, 8, 9], [2, 4, 6, 7, 8, 9], [2, 5, 6, 7, 8, 9], [3, 4, 5, 6, 7, 8], [3, 4, 5, 6, 7, 9], [3, 4, 5, 6, 8, 9], [3, 4, 5, 7, 8, 9], [3, 4, 6, 7, 8, 9], [3, 5, 6, 7, 8, 9], [4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6], [0, 1, 2, 3, 4, 5, 7], [0, 1, 2, 3, 4, 5, 8], [0, 1, 2, 3, 4, 5, 9], [0, 1, 2, 3, 4, 6, 7], [0, 1, 2, 3, 4, 6, 8], [0, 1, 2, 3, 4, 6, 9], [0, 1, 2, 3, 4, 7, 8], [0, 1, 2, 3, 4, 7, 9], [0, 1, 2, 3, 4, 8, 9], [0, 1, 2, 3, 5, 6, 7], [0, 1, 2, 3, 5, 6, 8], [0, 1, 2, 3, 5, 6, 9], [0, 1, 2, 3, 5, 7, 8], [0, 1, 2, 3, 5, 7, 9], [0, 1, 2, 3, 5, 8, 9], [0, 1, 2, 3, 6, 7, 8], [0, 1, 2, 3, 6, 7, 9], [0, 1, 2, 3, 6, 8, 9], [0, 1, 2, 3, 7, 8, 9], [0, 1, 2, 4, 5, 6, 7], [0, 1, 2, 4, 5, 6, 8], [0, 1, 2, 4, 5, 6, 9], [0, 1, 2, 4, 5, 7, 8], [0, 1, 2, 4, 5, 7, 9], [0, 1, 2, 4, 5, 8, 9], [0, 1, 2, 4, 6, 7, 8], [0, 1, 2, 4, 6, 7, 9], [0, 1, 2, 4, 6, 8, 9], [0, 1, 2, 4, 7, 8, 9], [0, 1, 2, 5, 6, 7, 8], [0, 1, 2, 5, 6, 7, 9], [0, 1, 2, 5, 6, 8, 9], [0, 1, 2, 5, 7, 8, 9], [0, 1, 2, 6, 7, 8, 9], [0, 1, 3, 4, 5, 6, 7], [0, 1, 3, 4, 5, 6, 8], [0, 1, 3, 4, 5, 6, 9], [0, 1, 3, 4, 5, 7, 8], [0, 1, 3, 4, 5, 7, 9], [0, 1, 3, 4, 5, 8, 9], [0, 1, 3, 4, 6, 7, 8], [0, 1, 3, 4, 6, 7, 9], [0, 1, 3, 4, 6, 8, 9], [0, 1, 3, 4, 7, 8, 9], [0, 1, 3, 5, 6, 7, 8], [0, 1, 3, 5, 6, 7, 9], [0, 1, 3, 5, 6, 8, 9], [0, 1, 3, 5, 7, 8, 9], [0, 1, 3, 6, 7, 8, 9], [0, 1, 4, 5, 6, 7, 8], [0, 1, 4, 5, 6, 7, 9], [0, 1, 4, 5, 6, 8, 9], [0, 1, 4, 5, 7, 8, 9], [0, 1, 4, 6, 7, 8, 9], [0, 1, 5, 6, 7, 8, 9], [0, 2, 3, 4, 5, 6, 7], [0, 2, 3, 4, 5, 6, 8], [0, 2, 3, 4, 5, 6, 9], [0, 2, 3, 4, 5, 7, 8], [0, 2, 3, 4, 5, 7, 9], [0, 2, 3, 4, 5, 8, 9], [0, 2, 3, 4, 6, 7, 8], [0, 2, 3, 4, 6, 7, 9], [0, 2, 3, 4, 6, 8, 9], [0, 2, 3, 4, 7, 8, 9], [0, 2, 3, 5, 6, 7, 8], [0, 2, 3, 5, 6, 7, 9], [0, 2, 3, 5, 6, 8, 9], [0, 2, 3, 5, 7, 8, 9], [0, 2, 3, 6, 7, 8, 9], [0, 2, 4, 5, 6, 7, 8], [0, 2, 4, 5, 6, 7, 9], [0, 2, 4, 5, 6, 8, 9], [0, 2, 4, 5, 7, 8, 9], [0, 2, 4, 6, 7, 8, 9], [0, 2, 5, 6, 7, 8, 9], [0, 3, 4, 5, 6, 7, 8], [0, 3, 4, 5, 6, 7, 9], [0, 3, 4, 5, 6, 8, 9], [0, 3, 4, 5, 7, 8, 9], [0, 3, 4, 6, 7, 8, 9], [0, 3, 5, 6, 7, 8, 9], [0, 4, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 8], [1, 2, 3, 4, 5, 6, 9], [1, 2, 3, 4, 5, 7, 8], [1, 2, 3, 4, 5, 7, 9], [1, 2, 3, 4, 5, 8, 9], [1, 2, 3, 4, 6, 7, 8], [1, 2, 3, 4, 6, 7, 9], [1, 2, 3, 4, 6, 8, 9], [1, 2, 3, 4, 7, 8, 9], [1, 2, 3, 5, 6, 7, 8], [1, 2, 3, 5, 6, 7, 9], [1, 2, 3, 5, 6, 8, 9], [1, 2, 3, 5, 7, 8, 9], [1, 2, 3, 6, 7, 8, 9], [1, 2, 4, 5, 6, 7, 8], [1, 2, 4, 5, 6, 7, 9], [1, 2, 4, 5, 6, 8, 9], [1, 2, 4, 5, 7, 8, 9], [1, 2, 4, 6, 7, 8, 9], [1, 2, 5, 6, 7, 8, 9], [1, 3, 4, 5, 6, 7, 8], [1, 3, 4, 5, 6, 7, 9], [1, 3, 4, 5, 6, 8, 9], [1, 3, 4, 5, 7, 8, 9], [1, 3, 4, 6, 7, 8, 9], [1, 3, 5, 6, 7, 8, 9], [1, 4, 5, 6, 7, 8, 9], [2, 3, 4, 5, 6, 7, 8], [2, 3, 4, 5, 6, 7, 9], [2, 3, 4, 5, 6, 8, 9], [2, 3, 4, 5, 7, 8, 9], [2, 3, 4, 6, 7, 8, 9], [2, 3, 5, 6, 7, 8, 9], [2, 4, 5, 6, 7, 8, 9], [3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4, 5, 6, 8], [0, 1, 2, 3, 4, 5, 6, 9], [0, 1, 2, 3, 4, 5, 7, 8], [0, 1, 2, 3, 4, 5, 7, 9], [0, 1, 2, 3, 4, 5, 8, 9], [0, 1, 2, 3, 4, 6, 7, 8], [0, 1, 2, 3, 4, 6, 7, 9], [0, 1, 2, 3, 4, 6, 8, 9], [0, 1, 2, 3, 4, 7, 8, 9], [0, 1, 2, 3, 5, 6, 7, 8], [0, 1, 2, 3, 5, 6, 7, 9], [0, 1, 2, 3, 5, 6, 8, 9], [0, 1, 2, 3, 5, 7, 8, 9], [0, 1, 2, 3, 6, 7, 8, 9], [0, 1, 2, 4, 5, 6, 7, 8], [0, 1, 2, 4, 5, 6, 7, 9], [0, 1, 2, 4, 5, 6, 8, 9], [0, 1, 2, 4, 5, 7, 8, 9], [0, 1, 2, 4, 6, 7, 8, 9], [0, 1, 2, 5, 6, 7, 8, 9], [0, 1, 3, 4, 5, 6, 7, 8], [0, 1, 3, 4, 5, 6, 7, 9], [0, 1, 3, 4, 5, 6, 8, 9], [0, 1, 3, 4, 5, 7, 8, 9], [0, 1, 3, 4, 6, 7, 8, 9], [0, 1, 3, 5, 6, 7, 8, 9], [0, 1, 4, 5, 6, 7, 8, 9], [0, 2, 3, 4, 5, 6, 7, 8], [0, 2, 3, 4, 5, 6, 7, 9], [0, 2, 3, 4, 5, 6, 8, 9], [0, 2, 3, 4, 5, 7, 8, 9], [0, 2, 3, 4, 6, 7, 8, 9], [0, 2, 3, 5, 6, 7, 8, 9], [0, 2, 4, 5, 6, 7, 8, 9], [0, 3, 4, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 9], [1, 2, 3, 4, 5, 6, 8, 9], [1, 2, 3, 4, 5, 7, 8, 9], [1, 2, 3, 4, 6, 7, 8, 9], [1, 2, 3, 5, 6, 7, 8, 9], [1, 2, 4, 5, 6, 7, 8, 9], [1, 3, 4, 5, 6, 7, 8, 9], [2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8], [0, 1, 2, 3, 4, 5, 6, 7, 9], [0, 1, 2, 3, 4, 5, 6, 8, 9], [0, 1, 2, 3, 4, 5, 7, 8, 9], [0, 1, 2, 3, 4, 6, 7, 8, 9], [0, 1, 2, 3, 5, 6, 7, 8, 9], [0, 1, 2, 4, 5, 6, 7, 8, 9], [0, 1, 3, 4, 5, 6, 7, 8, 9], [0, 2, 3, 4, 5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9]]

In the high-order knockout, iterations knocking out the target feature should have a higher change in accuracy. Since, in my sample data there is no high order interactions, this association should dwindle with higher order interactions e.g. a 3-5 couplet will have a higher value than a 3-5-7-8-9 pentuplet.

Installation¶

I might build a wheel later, but for now just download the dependencies- Numpy, Keras, and Tensorflow and import the scripts.

Genome Wide Association Study & Manhattan Plot Tutorial

Using Gemma to calculate association between phenotypes and genetic variants. Manhattan plots in Python as well.

Overview

A genome wide association study (GWAS) is a statistical analysis that identifies links between genetic variances and a given phenotype. This could be a disease^[1], morphology^[2], intelligence^[3], etc. Essentially, we are trying to find regions on chromosomes across the entire genome that may be causal or indicative of a trait of interest.

There are many statistical methods for modelling GWA^[4], but in this case, I will stick a simple linear model- a Wald test- using the software Gemma^[5].

Data

The data necessary to conduct a Wald test is fairly simple – you need genetic data preferably in plink, bimbam, or VCF format, and a phenotypic measurement. Databases like NCBI’s dbGap and the European Genome Phenome archive have what we need, but they restrict access to researchers only. In this example, I’ll use some data I found from a study published in Nature that is available here^[6]. It contains a some other data, but the one I will use is the Chinese pharmacogenic data in the ./Public/Genomics folder (arbitrarily selected tbh). This data is in plink binary format, thus containing a .bim (binary genomic data), .bed (SNP site data), and .fam (phenotype and familial data).

The data contains 4,032 SNPS of 106 Chinese individuals. The phenotype is a binary classification for “adverse drug reaction.”^[7] One thing to note is that Gemma reads the phenotype from the 6th column of the .fam file, but the data comes with it in the 5th column, so I wrote a python script to flip it. Another important detail is that the data should be one-hot encoded, i.e. 0 and 1 instead of 1 and 2 as it comes downloaded.

In [1]:

import pandas as pd

df =pd.read_table('/Users/macuser/Desktop/GWASStuff/China_Pharm.fam',sep=' ',header=None)

print(df.head(10))

df = df[[0,1,2,3,5,4]]

print(df.head(10))

df.to_csv('/Users/macuser/Desktop/China_PharmFam.csv',sep=' ',index=False,header=False)
# Change the csv extension to a text extension manually

           0          1  2  3  4  5
0  M11072707  M11072707  0  0 -9  2
1  M11072306  M11072306  0  0 -9  2
2  M11081312  M11081312  0  0 -9  2
3  M11061605  M11061605  0  0 -9  1
4  M11071301  M11071301  0  0 -9  2
5  M11081715  M11081715  0  0 -9  2
6  M11080306  M11080306  0  0 -9  2
7  M11072311  M11072311  0  0 -9  2
8  M11081304  M11081304  0  0 -9  2
9  M11060903  M11060903  0  0 -9  1
           0          1  2  3  5  4
0  M11072707  M11072707  0  0  2 -9
1  M11072306  M11072306  0  0  2 -9
2  M11081312  M11081312  0  0  2 -9
3  M11061605  M11061605  0  0  1 -9
4  M11071301  M11071301  0  0  2 -9
5  M11081715  M11081715  0  0  2 -9
6  M11080306  M11080306  0  0  2 -9
7  M11072311  M11072311  0  0  2 -9
8  M11081304  M11081304  0  0  2 -9
9  M11060903  M11060903  0  0  1 -9

Running Gemma

After cleaning the data, move to the folder that contains the data of interest, call the Gemma executable from wherever its stored, and run the following line of code:

To break it down:

-bfile loads in plink binary data follow this with the prefix of the data (that means all three files have to have the same prefix)
-lm performs a linear model on the data, 1 denotes a Wald test
-o denotes output file prefix for results which will include a text file with p-values and a log file

This will perform a Wald test on each site to test if it is significantly associated with a given phenotype. In this case, adverse reaction to drugs. It will output a text file that contains p-values for every SNP.

Manhattan Plot

A Manhattan plot is a type of scatter plot that visualizes the p-value of a SNP, or rather its -log10 against its position on the chromosome. Here is a python script to generate a Manhattan plot with matplotlib that I tweaked from a stack user, here:

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('expand_frame_repr', False)
df = pd.read_table('/Users/macuser/Desktop/GWASStuff/output/China_Pharm_Out.assoc.txt', sep = '\t')
# print(df.head(10))

df['p_adj'] = -np.log10(df.p_wald)
df.chr = df.chr.astype('category')

# print(df.head(10))

df['ind'] = range(len(df))
df_grouped = df.groupby(('chr'))

# print(df_grouped.head(10))

fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['#E24E42','#008F95']
x_labels = []
x_labels_pos = []

for num, (name, group) in enumerate(df_grouped):
    group.plot(kind='scatter', x='ind', y='p_adj',color=colors[num % len(colors)], ax=ax, s=5)
    x_labels.append(name)
    x_labels_pos.append((group['ind'].iloc[-1] - (group['ind'].iloc[-1] - group['ind'].iloc[0])/2))

ax.set_xticks(x_labels_pos)
ax.set_xticklabels(x_labels)
ax.set_xlim([0, len(df)])
ax.set_ylim([0, 6])
ax.set_xlabel('Chromosome')
plt.xticks(fontsize = 8,rotation=60)
plt.yticks(fontsize = 8)

# xticks = ax.xaxis.get_major_ticks()
# xticks[0].set_visible(False)

plt.show()

As you can see, there are some areas around the beginning of the chromosome 7 and in the middle of the chromosome 6 that seem to have a larger relative contribution to adverse drug reactions (tbh a p-value of 0.001 is not that significant in GWAS terms- you’re really looking for something along the order of 10^-6). The significant SNPs at chromosome 7 correspond to the gene ABCB5 – a protein that participates in ATP-dependent transmembrane transport.

Outro

Alright, that’s all for today.

References

Sud A, et al. Genome-wide association studies of cancer: current insights and future perspectives. Nature Reviews Cancer 17, pages 692–704 (2017).
Liu F, et al. A Genome-Wide Association Study Identifies Five Loci Influencing Facial Morphology in Europeans. PLOS Genetics 8(9): e1002932.
Davies G, Tenesa A, Payton A, et al. Genome-wide association studies establish that human intelligence is highly heritable and polygenic. Molecular psychiatry. 2011;16(10):996-1005.
Hayes B. Overview of Statistical Methods for Genome-Wide Association Studies (GWAS). Methods Mol Bio 2013;1019:149-69.
Xiang Zhou and Matthew Stephens (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics. 44: 821–824.
Woie-Yuh S, et al. Establishing multiple omics baselines for three Southeast Asian populations in the Singapore Integrative Omics Study. Nature Comms 8: 653 2017.
Brunham, L. R. et al. Pharmacogenomic diversity in Singaporean populations and Europeans. Pharmacogenomics J. 14, 555–563 (2014).

Competing Endogenous RNA: A Mechanism for Cancer to Regulate Gene Expression

Competing endogenous RNA (ceRNA) is a post-transcriptional regulatory mechanism that is an emerging focal point in the field of molecular biology. As our knowledge on ceRNA grows, we also uncover tumorigenic exploitations of this mechanism. In this write-up, I’ll explain ceRNA and its role in cancer.

Overview

Understanding ceRNA

ceRNA is a recently discovered phenomenon by which the expression of one RNA influences the expression of another by competitively reacting with limited amounts of micro RNA’s (miRNA)^[1].

miRNA’s are small (~20 nucleotides in length) RNA’s that bind to messenger RNA’s (those destined to encode for proteins) and result in their degradation or limited translation^[2]. The miRNA is incorporated into a larger protein complex- RISC- with the ability to cleave mRNA that has a region corresponding to the incorporated miRNA^[3]. These corresponding regions are refered to as miRNA recognition elements (MRE).

Multiple mRNA’s may have the same MRE’s, thus creating a competitive environment between mRNA’s over which one will bind to the miRNA and get destroyed. This is the process underlying ceRNA. Essentially, there is a limited number of miRNA’s and a limited number of mRNA’s in the cytoplasm. So, if a miRNA is used in the degradation of one mRNA, called mRNA-A, then it cannot be used in the degradation of another mRNA, say mRNA-B. It should follow then, that if mRNA-A is upregulated at the transcriptional level, then mRNA-B will be upregulated at the translational level due to the miRNA being consumed by mRNA-A not being able to degrade mRNA-B. In this example, mRNA-A and mRNA-B would be ceRNA’s as they compete for degradation amidst a limited amount of miRNA’s.

ceRNA in Cancer

As with many normal cellular mechanisms, cancerous cells have been found to take advantage of ceRNA to increase expression of potentially tumorigenic pathways such as angiogenesis or endothelial-mesenchymal transitioning^[4,5]. Hypothetically, a tumor cell could develop a mutation in a gene corresponding to a VEGF- a signal protein in angiogenesis- ceRNA. the VEGF ceRNA would then compete with the VEGF mRNA for miRNA’s, resulting in less degradation of VEGF mRNA and more VEGF proteins and more angiogenesis

In this hypothetical example, the cancer cell is able to indirectly upregulate angiogensis by increasing expression of a ceRNA. Note also that cancer can regulate gene expression through ceRNA in the opposite manner as well. That is, reducing the expression of tumor suppressor ceRNA so that more miRNA’s are present to limit the tumor suppressor’s expression. It is important to note that miRNA’s can be both tumorigenic and tumor suppressive^[6].

The role of ceRNA in cancer has garnered a lot of attention in recent years as sequencing and computational power has increased to be able to identify such interactions.

Some actual ceRNA interactions in cancer include:

PTEN and CNOTL6, VAPA, and ZEB2^[7]
BRAF and BRAFP1^[8]
KRAS and KRAS1P^[9]

miRNA’s in Cancer Therapy

Just as cancer exploits ceRNA to control gene expression so can medicine. A relatively new class of medicine called antimiR’s. AntimiR’s act similar to ceRNA’s in cells in that they “sponge” miRNA’s that may be lead to tumorigenesis by binding to and resulting in the degradation of tumor suppressor mRNA’s. One such example of an antimir is anti-mir-21. Its target, mir21, is a micro RNA that targets a number of tumor suppressors, such as PTEN (mentioned above)^[10] and, consequently, is commonly found in a variety of cancer types^[11]. Intuitively, antimir21 works by sponging up mir21 in cells, preventing them from binding to tumor suppressors which allows them to do their job.

Outro

The role of ceRNA in cancer is just emerging, and more studies are identifying networks. With each discovery of a ceRNA network, a potential therapeutic target is also identified. Understanding how cancer can take advantage of cellular pathways leads to more precision and a better prognosis for personalized treatment of cancer.

References

Kartha RV, Subramanian S. Competing endogenous RNAs (ceRNAs): new entrants to the intricacies of gene regulation. Frontiers in Genetics. 2014;5:8. doi:10.3389/fgene.2014.00008.
Eulalio , et al. Getting to the root of miRNA-mediated gene silencing. Cell. 2008;Jan 11;132(1):9-14. DOI:10.1016/j.cell.2007.12.024.
Pratt AJ, et al. The RNA-induced silencing complex: A versatile gene-silencing machine. J of Bio Chem. 2009;284 (27):17897-17901. doi:10.1074/jbc.R900012200.
Gao S, et al. IGF1 3’UTR functions as a ceRNA in promoting angiogenesis by sponging miR-29 family in osteocarcinoma. J mol histo. 2016 Apr;47(2):135-43. doi: 10.1007/s10735-016-9659-2.
Chang H, Liu Y, Xue M, et al. Synergistic action of master transcription factors controls epithelial-to-mesenchymal transition. Nucleic Acids Research. 2016;44(6):2514-2527. doi:10.1093/nar/gkw126.
Baohong Z, et al. microRNAs as oncogenes and tumor suppressors. Dev Bio. 2007 Feb; 302:1 1-12. doi: doi.org/10.1016/j.ydbio.2006.08.028.
Tay Y, et al. Coding-independent regulation of the tumor suppressor PREN by competing endogenous mRNAs. Cell 2011;147 (2): 344-357. doi: 10.1016/j.cell.2011.09.029.
Florian K, et al. The BRAF pseudogene functions as a competitive endogenous RNA andinduces lymphoma in vivo. Cell 2015; 161 (2): 319-331. doi:10.1016/j.cell.2015.02.043.
Poliseno L, et al. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 2012; 465 (7301): 1033-1038. doi: 10.1038/nature09144.
Meng F, et al. MicroRNA-21 regulates expression of the PTEN tumor suppressor gene in human hepatocellular cancer. Gastroenterology 2007; 133 (2): 647-58. doi:10.1053/j.gastro.2007.05.022.
Feng Y-H, Tsao C-J. Emerging role of microRNA-21 in cancer. Biomedical Reports. 2016;5(4):395-402. doi:10.3892/br.2016.747.

Linkage Disequilibrium, Human Population Diversity, and the Hemoglobin Gene

Introduction

Linkage disequilibrium is a somewhat rudimentary, but effective way to measure of the evolution of a population. It is, essentially, a quantification of the association between alleles (which, in this case, includes single nucleotide variants) as they exist in a population. In a fixed, non-evolving population, there should, theoretically be no association, and all sites should assimilate randomly into the gamete pool. However, if there is selective pressure on a population, one, or many sites, may confer an advantage. These would appear more frequently and a statistical “dependence” would observed. By observing this link, one can assume, apriori, that the population is evolving- that some selective pressure is acting on the population to elicit non-random assorting of alleles.

One factor that may pose such a selection in human populations is disease^[1]; which is why there is so much variance at the HLA locus^[2]. Malaria is one such disease that has lead to an increased rates of sickle cell disease in African populations^[3]. What malaria (or, really Plasmodium falciparum) does is infect red blood cells where it multiplies until it ruptures^[4]. As a result, African populations developed mutations that in their hemaglobin that prevent malaria from infecting RBC’s, but cause sickle cell disease^[5]. I have a theory that an African population will have more linkage disequilibrium among SNP’s in the hemoglobin beta gene than, say, a European population.

Phenom-Malaria-parasite-631 — Photo Credit: Smithsonian Magazine

In this markdown, I will be walking through the progress of mapping linkage disequilibrium using an open data source, Ensembl (1000 genomes)^[6] and R.

The Data

I got this data from ensemble using their data slicer app. The gene of interest, HBB, is located on chromosome 6 at 29909037-29913639. In this study, I will use Kenya for the African population as it is the population with the highest malaria infection rate that is included in Ensembl^[7] and Finland for the comparative population because it seems pretty far from Kenya. After downloading a variant call format file, I used vcftools^[8] to convert the files to a “0-1-2” matrix. This software vectorizes each site- instead of “A/A”, or “A/C”, it’s 0, or 1, or 2.

From here, I can use R to map the disequilibrium.

Mapping Linkage Disequilibrium

Loading some packages that will help with the analysis:

library(snpStats)
library(Matrix)
library(LDheatmap)

Afterwards, convert to a SNPMatrix that the LDHeatmap can read:

Dat.Fin = read.table("Data/FIN/HLAA_FIN_Matrix.012") # Read Matrix as table
Dat.Fin = Dat.Fin[,-1] # Remove Index Column
Dat.Fin = as.matrix(Dat.Fin) #Convert table back to Matrix
Dat.Fin = as(Dat.Fin,"SnpMatrix") #Convert Matrix to SNPMatrix

# Repeat for Kenyan Population

Dat.Kenya = read.table("Data/LWK/HLAA_LWK_Matrix.012")
Dat.Kenya = Dat.Kenya[,-1]
Dat.Kenya = as.matrix(Dat.Kenya)
Dat.Kenya = as(Dat.Kenya,"SnpMatrix")

The vcftools software also outputs position information in a different file, which I will use as a reference for genetic distance (Should be the same sites for both populations, so I just have to do this once. You can use this to label the SNP’s, but it would look like clutter, I think).

labels = readLines("Data/FIN/HLAA_FIN_Matrix.012.pos") #Read position information
labels = substr(labels, start=3, stop=11) #Parse out ASCII
labels = strtoi(labels) # Convert to integer

Now Generate the heatmaps:

color_spectrum = colorRampPalette(c("Red", "Yellow")) #Creat a color spectrum

LDheatmap(Dat.Fin,
          genetic.distances = labels,
          color = color_spectrum(5),
          title = "Finnish HBB LD")

LDheatmap(Dat.Kenya,
          genetic.distances = labels,
          color = color_spectrum(5),
          title = "Kenyan HBB LD")

Discussion

Overall, I see less difference than I thought I would. There is more low/moderately linked sites (yellow) in the Kenyan population, but high association sites (red) are about the same. Maybe the hemoglobin gene is so well conserved that the sickle cell disease is the result of a single point mutation…? Perhaps you can try this out with some other loci and populations and see if you find anything interesting.

References

Karlsson, Elinor K., Dominic P. Kwiatkowski, and Pardis C. Sabeti. “Natural Selection and Infectious Disease in Human Populations.” Nature reviews. Genetics 15.6 (2014): 379–393.
Dendrou C. “HLA variation and diease.” Nature reviews. Immunology (2018).
Grosse SD, Odame I, Atrash HK, Amendah DD, Piel FB, Williams TN. Sickle Cell Disease in Africa: A Neglected Cause of Early Childhood Mortality. American Journal of Preventive Medicine. 2011;41(6):S398-S405. doi:10.1016/j.amepre.2011.09.013.
Mohandas N, An X. Malaria and Human Red Blood Cells. Medical microbiology and immunology. 201(4):593-598 2012.
Gouagna L, et al. “Genetic variation in human HBB is associated with Plasmodium Falciparum transmission.” Nature Genetics. 42, 328-331 2010.
Ensembl 2017. Nucleic Acids Research 45 Database issue:D635-D642 2017.
Center for Disease Control and Prevention. “Malaria Maps.” Malaria and Travelers.
Petr Danecek, 1000 Genomes Project Analysis Group, et al. Bioinformatics, 2011

Environmental DNA Detection: New Method for Monitoring Manatees

Researchers use new wildlife sampling technique to identify site occupancy of Manatees in a more accurate, less intrusive, cheaper manner.

Listed by the IUCN as a vulnerable species, the West Indian Manatee is a prime candidate for conservation efforts^[1]. However, designating protected areas for manatees can be difficult if you don’t know where they are. Aerial surveys currently conducted by the Florida Fish and Wildlife Conservation Committee are successful at near-shore population counts^[2], but have trouble identifying manatees in murky waters, like brackish streams, and rivers obstructed by foliage.

In response to this, scientists from the United States Geological Survey have developed a new method for detecting the presence of manatees- environmental DNA (eDNA) detection.

The pipeline for eDNA detection is as follows:

a species of interest sheds tissue in their environment
these tissues contain DNA
The DNA contains unique signatures that can identify the existence of the species of interest

In this case, researchers identified regions in the cytochrome b gene (involved in the electron transport chain) as a unique marker for the manatee genus (Trichechus), and developed a corresponding primer. Subsequent amplification via digital droplet PCR and its corresponding results were fed through a site-occupancy model to determine the statistical probability that a manatee had been through the sampled water based on the amount of target DNA amplified.

A common problem in the field of ecology is the link between detecting a species and a species occupying a particular area. For instance, just because you did not find a species at a site, does not mean that it is absent there. Instead of modelling a site as a binary presence/absence label, site-occupancy models estimate a fluid probability that a species occupies a site based on repeated surveying.

In this case, the researchers found the West Indian Manatee occupyting a variety of habitats including Wakulla River, Lake Wimico, and Guantanamo Bay. More importantly, in my opinion, they established eDNA detection as an effective method for surveying wildlife populations.

References

IUCN 2017. The IUCN Red List of Threatened Species. Version 2017-3. available from: http://www.iucnredlist.org

Florida Fish and Wildlife Conservation Commission. Manatee Aerial Surveys. Population Monitoring. Available From: http://myfwc.com/research/manatee/research/population-monitoring/aerial-surveys/

Original Study

Hunter M, et al. Surveys of environmental DNA (eDNA): a new approach to estimate occurrence in Vulnerable manatee populations. Endangered Species Research. March, 2018. https://doi.org/10.3354/esr00880.

How Genetic Similarity is Calculated and a Little on What the NASA Twin Study Really Means

You’ve probably heard statistics claiming that DNA between humans and chimpanzees are over 99% similar, or that humans share 50% of there DNA with a banana, or whatever-the-case. More recently, you might have heard news articles from sources such as CNN^[1] or Time^[2] claim that an astronaut’s (Scott Kelly), DNA changed 7% compared to his twin after spending a year in space. Obviously, this isn’t true. What the articles meant to, or should have said was that 7% of the astronauts genes were differentially expressed compared to his twin.

Essentially, environmental factors, such as light or chemicals, can cause the body – or its cells – to produce more of a particular gene in response. For instance, turtle embryos upregulate sox9 when exposed to heat, which is involved in sex determination^[3]. The NASA twin study write up (the full study is not published yet, I believe) in particular noted IGF-1, a protein related to bone and muscle density, steadily increased over the 1 year stay in space^[4].

This study, from what I can gather without having access to the full methods, looked at the astronauts’ expression levels for every gene, compared them to the expression levels of his twin, and noted that 7% of the genes had differential expression levels after landing.

The buzz that these headlines garnered made me think that an explanation as to how genetic similarity is calculated may be interesting if not useful. So here’s a write-up on how researchers come up with statistics like I mentioned in the opening paragraph.

Gene Orthology and Alignment

When comparing Sequences between two species, you cannot it approach like you were reading a book. Over the course of speciation, genes can move (shoutout to Barbara)^[5], chromosomes can fuse and break, etc. One cannot just grab a chromosome from a chimp and start comparing base pairs to another arbitrary chromosome from a human.

It’d be like a school question that asks you to list an order of events and you mess one up, causing the subsequent answers to be off by one, and then the teacher marking all of them wrong. Or, a more biological example would be to say that humans are only 23/24ths similar to chimps because we have one less chromosome than they do. Even though our chromosome 2 became fused and is present, almost identically, in chimps as 2 separate chromosomes^[6].

There has to be some sort of alignment such that you are reading off base pairs from orthologous genes- or, genes in different species that evolved from a common ancestral gene by speciation. This is accomplished through the use of an alignment algorithm, like BLAST^[7], which finds regions of similarity between DNA sequences.

vpbDk — Simplified diagram of homology subtypes

Know Which Parts of the Genome is Considered

Another thing to consider is what genes are present in the analysis. Is it just transcribed regions? What about just protein-coding-regions? Over 99% of the genome is not translated^[8]– converted to proteins- and only ca. 10% even has biological function(at least, we haven’t found out what it does yet)^[9]. Are the researchers taking into consideration the entire genome, or just parts of it. There is no correct way of doing this, it’s just something that needs to be considered. For instance, the mouse genome is 85% similar to humans when considering protein-coding regions only, but only 50% similar when considering non-coding regions^[10].

Consensus and Polymorphism

As we know, there is genetic variation within species, not just between them. Not every human is alike, nor is every chimp, or mouse, or whatever. So do geneticists just pick a random individual from each species of interest and use their genomes for comparison? Typically, no. They generate a consensus sequence- a sort of sampling whereby the modal nucleotide for each site is considered the “correct”, or consensus nucleotide for the sample entire sample.

Once a consensus sequence is derived and the genes are aligned, then you can start reading the DNA, nucleotide by nucleotide, and identifying single nucleotide polymorphisms (SNPs). A SNP is a nucleotide that varies from another reference position. Imagine you’re reading across a sequence and you see a Guanine. You look at the same spot on another sequence and see a Cytosine- BOOM! you’ve got a SNP and a tally for genetic dissimilarity.

Looking at Some Frequently Cited Statistics

Here I’m going to take a look at the methods behind the studies of some popular statistics. Let’s begin:

“Humans and Chimpanzees share 98% of Their DNA.” – This study from The Chimpanzee Sequence and Analysis Consortium^[11] used BLASTZ aligned sequences from a couple chimps. They removed small (<15kb) insertions and deletions, and only compared 13,454 orthologous genes out of the 19,277 annotated human genes.
“Humans are 99.9% Genetically Identical to Each Other.”^[12] – This statistic comes from The Thousand Genomes Project Consortium and compares 2,504 individuals from 26 populations. The entire genome is sampled and they found, that an average individual has about 4.1 million SNPS This works out to around 99.87% similarity considering the human genome is about 3,000,000,000 base pairs in length.
“Humans are 50% of their DNA with bananas.” I cannot find a primary source on this .

In Summary

Hopefully, this write-up gave some insight into what it means for species to be genetically similar to each other. Concerning the NASA twin study, No, Kelly’s DNA did not change after being in space for 1 year. Outside of an insignificant number of mutagenic events, which happen at the cellular level and not the genomic level, his DNA remained the same. One interesting finding from the study, however is that his telomerase became significantly longer while in space, but returned to normal after landing. Overall, a 7% difference in gene expression is what we saw, but I guess that not as interesting as suggesting that he is no longer twins with his twin^[13], .

References

https://www.cnn.com/2018/03/14/health/scott-kelly-dna-nasa-twins-study/index.html
http://time.com/5201064/scott-kelly-mark-nasa-dna-study/
Gilbert SF. “Environmental Sex Determination”. Developmental Biology. 6th edition. 2000. Available from: https://www.ncbi.nlm.nih.gov/books/NBK9989/
Edwards M. “NASA Twin Study Confirms Preliminary Findings”. NASA Human Research Strategic Communications. 2018 Available From: https://www.nasa.gov/feature/nasa-twins-study-confirms-preliminary-findings
Ravindran S. “Barbara McClinktock and the Discovery of Jumping Genes”. Proceeedings of the National Academy of Science. 2012. https://doi.org/10.1073/pnas.1219372109
Yunis JJ. “The Origin of Man: A Chromosome Pictoral Legacy”. American Association for the Advancement of Science. 1982. https://doi.org/10.1126/science.7063861
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. 1990.
International Human Genome Sequencing Consortium (Feb 2001). “Initial sequencing and analysis of the human genome”. Nature. 409 (6822): 860–921. doi:10.1038/35057062.
Ponting CP. “What Fraction of the Human Genome is Functional?”. Genome Research. 2011. 10.1101/gr.116814.110
National Human Genome Research Institute. “Why Mouse Matters”. Mouse Sequencing Consortium. 2000. Available From: https://www.genome.gov/10001345/importance-of-mouse-genome/
The Chimpanzee Sequence and Analysis Consortium. “Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome”. Nature 437:69-87. 2005. doi:10.1038/nature04072
The 1000 Genomes Project Consortium. “A global reference for human genetic variation.” Nature. 2015;526(7571):68-74. doi:10.1038/nature15393.
https://www.independent.ie/world-news/north-america/nasa-twins-no-longer-identical-after-space-flight-alters-dna-36711004.html

Photo Credit

R.A. Jensen, Orthologs and paralogs – we need to get it right, Genome Biology, 2001.Figure 1. Available From: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138949/

Tracking the Flu with Twitter & Machine Learning

Scraping, geolocating, and classifying tweets to survey cases of the flu.

Overview¶

With the amount of shit people feel the need to share with the internet, I am wondering if I can use Twitter to survey cases of the flu. To do this, I can use a Twitter API, tweepy, to scrape tweets and their locations based on key words. However, I will need a way to determine if the tweet is referencing a case of the flu, or is using the word in some other context. That’s where the machine learining comes in. In this notebook, I use a multinomial naive bayes classifier to pinpoint cases of the flu self-reported over Twitter.

network-3154913_1920

Training Data¶

Here, I’m going stream all tweets with the word “flu” in it, cut out some of the baggage attached to those tweets, then append the tweets to a CSV that I’ll later go through and classify.
Once the tweets start rolling in, you can get an idea for how you want to do your classifications.

I ended up having three categories: “Accept”, “Reject”, and “Other”. The “Accept” category is for the tweets I want, the tweets that admit to having the flu. I made the “Other” category because I noticed that swine and bird flu came up very frequently. I thought that the high frequency of unaccapted tweets having bird or swine flu in it would result in the acceptance of basically any tweet that did not have bird or swine in them.
The below script shows how I scraped/parsed the data.

In [1]:

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
import re
import time
import os
import sys

module_path = os.path.abspath(os.path.join('/Users/macuser/PycharmProjects/td2'))

if module_path not in sys.path:
    sys.path.append(module_path)
    
from myAccess import consumerKey, consumerSecret, accessToken, accessSecret

#AccessCodes
consumerKey = consumerKey
consumerSecret = consumerSecret
accessToken = accessToken
accessSecret = accessSecret

#Scraping/Parsing Tweets for Training
class Listener(StreamListener):  
    def on_data(self, raw_data):
        try:
            jsonData = json.loads(raw_data) #Convert Tweet data to json object
            text = jsonData['text'] #Parse out the tweet from the json object
            if not jsonData['retweeted'] and 'RT @' not in text and jsonData['lang'] == 'en': #Excludes Retweets
                text = re.sub(r"(http\S+|#\b)", "", text) #Gets rid of links and the # in front of words
                text = " ".join(filter(lambda x:x[0]!='@', text.split())) #Gets rid of the @ in front of mentions
                text = text.encode('unicode-escape').decode(encoding='utf-8').replace('\\', ' ') #Converts Emojis to a String
                text = text.replace('u2026','') #Gets rid of ellipses
                text = text.replace('"',"'") #Replaces " with ' so that it doesn't break the field when read from CSV
                print(text)
                with open('tdTraining.csv', 'a') as file: #Write to CSV
                     file.write('\n'+'"'+text+'"'+',') #Adds quotes around tweet to set the field
        except BaseException as e:
            print("Failed ondata,", str(e))
            time.sleep(0.1)
    def on_error(self, status_code):
        print(status_code)

#Access
auth = OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessSecret)

#Initiate Streaming
twitterStream = Stream(auth, Listener())
twitterStream.filter(track = ["flu"])

/Users/macuser
High fever and flu is not good at 15 xb0C, i have deceased
I have got hay fever + I've got a cold + I've got the flu
RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting older 
Could flu during pregnancy raise risk for autism? - Researchers found no evidence that laboratory-diagnosis alo...
About damn time. Now you don't have open the box and put your bird flu cover hands im the carton 
AP: RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting o 
80+ but do it in flu season and wait til he gets the flu where he kills himself after like 3 turns
Could flu during pregnancy raise risk for autism?
AP: RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting o

Training the Classifier¶

I ended up classifying about 3,000 tweets over the course of about a week. Generally, your corpus should contain around 50,000 documents. Proabably more for tweets since they are so small. Nonetheless, I got sick of classifying and decided to move on.

First thing to do is load our tweets and their classifications. I’ll store them in a pandas dataframe.

In [2]:

import pandas as pd
trainingData = pd.read_csv('/Users/macuser/PycharmProjects/td2/tdTraining.csv', quotechar='"')
print(trainingData.head(10))

                                                text     cat
0  Last night's thread on cat flu. For all catsof...   Other
1      Fucking flu!!!  U0001f620 U0001f612 U0001f637  Accept
2                                       this flu gtg  Accept
3  Ohhhh...its hapend to me also...flu really dis...  Accept
4  Diabetes Treatments - Top 7 Diabetes Friendly ...  Reject
5  Liked on YouTube: Number 1 Natural Remedy for ...  Reject
6  My nose is red. Flu, go away please. I need to...  Accept
7  I wish I can sleep, but this stomach flu ain't...  Accept
8  Late summer and early fall before Chef was mur...  Reject
9  No problem ever went away because it was ignor...  Reject

Now we need to build a pipeline. First, the text needs to be tokenized, which we can do with scikit’s countvectorizer. The gram size is the number of words that get chunked together e.g.

“The sky is blue”

unigrams: “The”, “sky”, “is”, “blue”
bigrams: “The sky”, “is blue”, “sky is”
trigrams: “The sky is”, “sky is blu”

Generally, a gram size of two is optimal in terms of improving accuracy and cpu exertion, but play around with it nonetheless.
Second, convert word counts to frequency with tfidftransformer. This accounts for the over-appearance of words in large bodies of text. This is not a huge deal for us since tweets are limited to 144 characters, but it is almost always done in building text classifiers and it still improves accuracy by a few percentage points.

Next is choosing the algorithm and, subsequently, its parameters. Most text classifiers use either the multinomial naive bayes algorithm or a support vector classifier, others to experiment with may include: random forest, bernoulli naive bayes, or stochastic gradient descent.

Since I’m using the mNB classifier, I’ll talk about it some. The mNBc looks at each token independently and associates with a class based on the frequency it appears in said class. This factors in to the conditional probability- the liklihood that a text belongs to a class given its the overall association its token’s have with that class. The other thing the mNBc takes into account is the prior distribution of classes. This simply means that the classifier will be biased to assign texts to classes that appeared more often than others in the training data.

Scikit’s mNB will automatically calculate priors, but I found that my custom priors led to 2-3% higher accuracy. The other paramater, alpha, is a smoothing parameter that adds an equal and small probability for never-before-seen words to associate with all classes.

Here’s the full pipeline for the classifier:

In [3]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB(alpha=1,class_prior=[.27,.45,.28]))])

Now split the data into training and testing sets, then train/test the classifier.

In [4]:

from sklearn import cross_validation
featureTrain, featureTest, labelTrain, labelTest = cross_validation.train_test_split(
    trainingData['text'], trainingData['cat'],test_size=0.20)
fit = text_clf.fit(featureTrain, labelTrain)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Getting the results.

In [5]:

accuracy = text_clf.score(featureTest, labelTest)
predictions = text_clf.predict(featureTest)
crossTable = pd.crosstab(labelTest, predictions, rownames=['Actual:'], colnames=['Predicted:'], margins=True)
falsePos = sum(crossTable['Accept'].drop('All').drop('Accept')) / crossTable['Accept']['All']
falseNeg = (crossTable['Other']['Accept'] + crossTable['Reject']['Accept'])/\
           (crossTable['Other']['All'] + crossTable['Reject']['All'])
print("Accuracy:",accuracy)
print(crossTable)
print("False Positives:",falsePos,'\n',"False Negatives:",falseNeg)

Accuracy: 0.794019933555
Predicted:  Accept  Other  Reject  All
Actual:                               
Accept         146     10      22  178
Other           13    161      30  204
Reject          27     22     171  220
All            186    193     223  602
False Positives: 0.215053763441 
 False Negatives: 0.0769230769231

I like to throw all this in a for loop to get an average accuracy over n runs and add a confustion matrix.

In [6]:

accuracy = []
fP = []
fN = []
for i in range(100):
    featureTrain, featureTest, labelTrain, labelTest = cross_validation.train_test_split(
        trainingData['text'], trainingData['cat'], test_size=0.20)
    fit = text_clf.fit(featureTrain, labelTrain)
    test = text_clf.score(featureTest, labelTest)
    predictions = text_clf.predict(featureTest)
    crossTable = pd.crosstab(labelTest, predictions, rownames=['Actual:'], colnames=['Predicted:'], margins=True)
    falsePos = sum(crossTable['Accept'].drop('All').drop('Accept')) / crossTable['Accept']['All']
    falseNeg = (crossTable['Other']['Accept'] + crossTable['Reject']['Accept'])/\
           (crossTable['Other']['All'] + crossTable['Reject']['All'])
    accuracy.append(test)
    fP.append(falsePos)
    fN.append(falseNeg)

print("Accuracy:",sum(accuracy)/float(len(accuracy)))
print("False Positives:",sum(fP)/float(len(fP)))
print("False Negatives:",sum(fN)/float(len(fN)))

Accuracy: 0.799368770764
False Positives: 0.218822638591
False Negatives: 0.0650916583851

Roughly 80%, that’s OK considering our small sample size (one of the benefits of the mNBc is that it works well on small sample sizes). You can use the false positves/negatives to tweak your priors. If you think there’s too many false positives for a particular class, lower the prior distribution for that class.

Let’s test out the classifier.

In [7]:

text = "I have the flu" #Should be.......Accept
text2 = "My sister has the flu"#.........Reject
text3 = "I had the flu last month."#.....Reject
text4 = "Top 10 ways to treat the flu."#.Reject
text5 = "Dog flu outbreak."#.............Other
print(text_clf.predict([text,text2,text3,text4,text5]))

['Accept' 'Reject' 'Reject' 'Other' 'Other']

Looks like the classifier is working pretty good, but I did throw it some softball questions.
The next step is to save the classifier to a pickle object so i don’t have to retrain it.

In [8]:

import pickle
saveClassifier = open("tdMNB.pickle","wb")
pickle.dump(fit, saveClassifier)
saveClassifier.close()

Now that we have a trained classifier, we can start classifying some tweets.

Streaming Tweets¶

Now we are at the point where we can start filling out our databse. The following script is similar to the “Training Data” script only now we need to get location information, classify the tweet, and save to a database instead of a CSV. I’m using a module called geopy to to find latitude and longitude coordinates from the location information provided with the tweet. If the location is valid and in the US, we then classify the tweet. If it returns an “Accept” classification, it is saved to a SQLite database with its coordinates, the datetime of the tweet, and the tweet ID (for cross-referencing and primary key).

Edit: I did this about 8 months ago and have no idea why I used a sql db over a csv, lol.

In [9]:

from tweepy import Stream, OAuthHandler
from tweepy.streaming import StreamListener
import os
import sys

module_path = os.path.abspath(os.path.join('/Users/macuser/PycharmProjects/td2'))

if module_path not in sys.path:
    sys.path.append(module_path)
    
import time, json, sqlite3, re, myAccess
from geopy.geocoders import Nominatim
from time import mktime
from datetime import datetime
import pickle

#ConnectingDB
conn = sqlite3.connect('twitterdemiologyDB.db')
c = conn.cursor()

#CreatingTable
c.execute('CREATE TABLE IF NOT EXISTS Main(id TEXT, date DATETIME, lat TEXT, lon TEXT)')

#AccessCodes
consumerKey = myAccess.consumerKey
consumerSecret = myAccess.consumerSecret
accessToken = myAccess.accessToken
accessSecret = myAccess.accessSecret

#Loading Classifier
classifierF = open("tdMNB.pickle","rb")
classifier = pickle.load(classifierF)
classifierF.close()

#Scraping/ParsingTweets
class Listener(StreamListener):

    def on_data(self, raw_data):
        try:
            jsonData = json.loads(raw_data)
            #Converting date to datetime format:
            date = jsonData['created_at']
            date2 = str(date).split(' ')
            date3 = date2[1]+' '+date2[2]+' '+date2[3]+' '+date2[5]
            datetime_object = time.strptime(date3, '%b %d %H:%M:%S %Y')
            dt = datetime.fromtimestamp(mktime(datetime_object))
            #Parsing out ID, the tweet itself, and location:
            tweetID = jsonData['id_str']
            pretweet = jsonData['text']
            userInfo = jsonData['user']
            location = userInfo['location']
            if jsonData['lang'] == 'en' and location != 'Midwest' and location!= 'Whole World' and location != 'Earth':
                #print(dt, pretweet, location)
                geolocator = Nominatim()
                geolocation = geolocator.geocode(location)
                try:
                    #The 2-5 len range helps to remove inaccurate/unspecific locations
                    if "United States of America" in geolocation.address[::] and 5>= len(geolocation.address.split(",")) > 2 :
                        lat = geolocation.latitude
                        lon = geolocation.longitude
                        print(geolocation.address, '\n', lat, lon)
                        if not jsonData['retweeted'] and 'RT @' not in pretweet:
                            tweet = re.sub(r"(http\S+|#\b)", "", pretweet)
                            tweet = " ".join(filter(lambda x: x[0] != '@', tweet.split()))
                            tweet = str(tweet.encode('unicode-escape')).replace('\\', ' ')
                            print(tweet)
                            classification = classifier.predict([tweet])[0]
                            if classification == 'Accept':
                                print("Tweet Accepted")
                                c.execute('INSERT INTO Main(id, date, lat, lon) VALUES(?,?,?,?)',
                                          (tweetID, dt, lat, lon))
                                conn.commit()
                            else:
                                print("Tweet Rejected")
                        else:
                            print("Is a Retweet")
                    else:
                        print("Location not in USA")
                except Exception as e:
                    print("Invalid locaion:", str(e))
        except BaseException as e:
            print("Failed ondata,",str(e))
            time.sleep(0.1)
    def on_error(self, status_code):
        print(status_code)

#Access
auth = OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessSecret)

#InitiateStreaming
twitterStream = Stream(auth, Listener())
twitterStream.filter(track=['flu'])

/Users/macuser
Location not in USA
Location not in USA
Location not in USA
Location not in USA
Invalid locaion: 'NoneType' object has no attribute 'address'

Most of the tweets that come in are rejected do to invalid locations. Of the valid locations, about %30 are classified as “Accept” and enter the database. I get about 100 useable tweets/day (it’s currently Summer, will probably get a lot more come Winter.

Mapping the Data¶

What I’m going to do here is plot the data onto a heatmap. This loop will create heatmaps 1 day at a time containing the past 7 days of data. I used the module gmap to create the heatmaps.

In [10]:

import sqlite3
import gmplot
from datetime import timedelta
from datetime import datetime



#ConnectingToDB
conn = sqlite3.connect('twitterdemiologyDB.db')
c = conn.cursor()

#QueryingData
scanner = str('2017-06-22 10:59:25')
for i in range(10): #Number of days that the data spans I just put 10 becuase I don't want a
                    #bunch of html files in my dir (I already made them)
    tail = str(datetime.strptime(scanner, '%Y-%m-%d %H:%M:%S') - timedelta(hours=168))
    c.execute('SELECT lat, lon FROM Main WHERE date > "{x}" AND date < "{y}"'.format(x=tail,y = scanner))
    latArray = []
    lonArray = []
    for row in c.fetchall():
        if row[0] and row[1]:
            latArray.append(float(row[0]))
            lonArray.append(float(row[1]))
    #MappingData
    file = 'flu' + str(scanner) + '.html'
    gmap = gmplot.GoogleMapPlotter.from_geocode("United States")
    gmap.heatmap(latArray, lonArray)
    gmap.draw(file)
    scanner = str(datetime.strptime(scanner, '%Y-%m-%d %H:%M:%S') + timedelta(hours=24))

Now I want to take these html files and make a gif out of them to show
a timelapse of flu cases over time.
There are packages such as imgkit that will take a screenshot of html files and
save them as png’s, but I could not get them to work with gmaps (they worked for other html files, though)
so I just manually screenshotted them with chrome addon.

Here is the final result:

tdCrop

Environment vs Inheritance – Air Pollutants Account for More Variation in Gene Expression than Ancestry, Transcriptome Analysis Reveals

Researchers at the Ontario Institute for Cancer Research use RNA-sequence methods to generate transcription profiles of over 1000 French Canadians and find correlations between a number of environmental factors, chiefly air composition.

Overview

Advancements in molecular sequencing allow for high-throughput analysis of genomic information^[1]. As a result, clinics around the world are using sequence methods, coupled with their clinical history, to generate profiles and find links between genetics and illness^[2]. In this study, researchers compare environmental vs ancestral associations with transcpritonal profiles.

Transcriptional Profiling

A person’s transcriptome is the sum total of the RNA that their cells produce. It is, loosely, a way of quantifying your bodies production (or transcription) of an particular gene. Transciptional activity is something that can be turned on or off, like a switch, in response to a number of factors. Commonly, these factors are grouped into two categories, genetic and environmental. Genetic, or ancestral, factors boil down to a person’s DNA. Simply put, some people may naturally have higher expression of a gene because it is a trait they inherited. It is, outside of mutagenesis, not malleable, but still contributes to the expression profile. Environmental factors are any number of external stimuli- chemical, radiologic, audiologic, etc.- that affects the bodies rate in which it produces, or transcribes, a particular gene. A key aspect of this study is to determine whether environmental factors contribute more to a person’s transcriptional profile than their ancestry.

Environmental vs Ancestral Effects on Transcriptome

The study sampled individuals from 3 locations with varying environmental dynamics- Montreal, Quebec City, and Saguenay- and determined ancestry via genotyping of ca. 2 million SNP’s. Essentially, individuals from each locale were be cohorted by regional ancestry. I.E.

Montreal: Native, Quebec City migrant, Saguenay migrant

Quebec City: Native, Montreal migrant, Saguenay migrant

Saguenay- Native, Montreal migrant, Quebec City migrant

Principal component analysis on the transcriptomes revealed that a person’s ancestry had marginal contributions to their profile in comparison to locality. For example, 2 people with Montreal ancestry living in different cities would differ more in their profiles than 2 people – 1 with Quebec City ancestry and 1 with Montreal ancestry- both living in Montreal. Overall, 170 differentially expressed genes were identified between locations.

Figure 2 from original study showing variation in expression is greater between environmental cohorts than ancestral ones.

Sulfur Dioxide is a Chief Contributor in Transcriptome Variation

The 3 locations selected have particular environmental conditions, due to a number of factors such as population density and urbanization. This study considered several factors ranging from number of parks, to food availability, to atmospheric composition. Cointertia analysis on randomly separated cohorts of these environmental exposures pinpointed sulfur dioxide as a chief contributor of transcriptome variation.

Emitted as a byproduct from the combustion of fossil fuels, SO₂ levels are higher in industrialized zones, like Montreal, in this case. Health effects include respiratory problems and lung damage^[3], but what does this look like at the molecular level?

Further experimentation involving direct exposure of blood samples to controlled levels of SO₂ pinpointed associated expression levels. Within the 170 differentially expressed genes, a number of affected pathways were identified. Chiefly, those involved with the transport of oxygen or white blood cells; including: hemoglobin complex, chemokine-mediated signaling pathway, leukocyte chemotaxis, et.al.

Takeaways

The key takeaway from this study, I think, is not the correlation between pollutants and gene expression. Not that these factors are not important within a public health context, just that this has been established before and is well known. The principal finding for me is that environmental factors account for more transcriptome variation than hereditary ones. Within the context the nature vs nurture paradigm, nurture gets awarded a point, here.

References

Reuter JA, Spacek D, Snyder MP. High-Throughput Sequencing Technologies. Molecular cell. 2015;58(4):586-597. doi:10.1016/j.molcel.2015.05.004.
Casamassimi A, Federico A, Rienzo M, Esposito S, Ciccodicola A. Transcriptome Profiling in Human Diseases: New Advances and Perspectives. International Journal of Molecular Sciences. 2017;18(8):1652. doi:10.3390/ijms18081652.
Kampa M, Castanas E. Human Health Effects of Air pollution. Environmental Pollution. Volume 151, Issue 2, January 2008, Pages 362-367. doi:https://doi.org/10.1016/j.envpol.2007.06.012

Original Study

Favé M-J, et.al. Gene-by-environment interactions in urban populations modulate risk phenotypes. Nature Communications 9, Article number: 827 (2018). doi:10.1038/s41467-018-03202-2

Cancer Cells Send MicroRNA In Exosomes to Macrophages, Rewiring Them for Tumorigenesis

Researchers at the National Cancer Institute in Maryland pinpoint miR-1246, a micro RNA, as a potential therapeutic target in mutant p53 colorectal cancers.

Recent studies have identified exosomal signalling as a route for cancer cells to communicate with neighboring stromal and leukocytic cells^[1]. The tumors evolve to release exosomes packed with proteins and functional RNA into the extracellular matrix. The nearby cells take up the exosomes as they release their contents into their cytoplasm. Like many oncogenic phenotypes, this is an exploitation of a normal, healthy activity whereby cells communicate the need for nutrition, presence of disease, etc with the rest of the body^[2]. Cancer cells, however, can package their exosomes with tumor-supporting agents to aid in the process of angiogenesis, immune-suppression, and metastasis.

The researchers in this study have discovered one of these agents in colorectal cancer- miR-1246.

When growing macrophages and mutant colorectal cell lines in a membrane-separated culture, they observed significant upregulation of IL10 and VEGF in the macrophages.

These cytokines are hallmark of tumor-supporting in macrophages and have been targets of therapy in the past. IL-10 is an anti-inflammatory and acts as an immunosuppressant to keep the body from detecting and killing the cancer cells^[3]. While VEGF is the key player in angiogenesis- the formation of blood vessels- a process that cancer cells exploit and take over to fuel their nutritious needs while rapidly expanding. VEGF inhibitors like Avastin have already been approved for clinical use by the FDA^[4], but there is always a need to find more therapeutic targets.

Using mass spectrometry and Western blotting on the exosomes yielded unique signatures in the mutant p53 cell line. Subsequent RNA extraction and profiling from clinical colorectal tumor samples were able to identify a number of micro RNA’s associated with the cancer cells, the most notable of them was miR-1246.

This microRNA, 73 base pairs in length^[5], is an emerging agent in cancer studies. Recently it has been associated with liver and lung cancers ^[6,7]. Now, in this study, it is been not only identified in colon cancer, but exosomal signalling as well. The study shows that, after being taken up by the macrophages, the miR-1246 likely acts as a transcription factor, or upstream regulator of the tumorigenic proteins IL-10 and VEGF.

In addition to miR-1246, the researchers identified another therapeutic target, hnRNPa2b1- a sorting protein involved in packaging micro RNA into exosomes. RNA:Protein pull-down assays showed a coupling between the two agents.

The study explores the quick-evolving territory of targeting micro RNA’s and manipulating exosomes in cancer therapy. Being able to identify and knock out another route that cancer uses to thrive is key in implementing personalized medicine plans. Cancers vary wildly from patient to patient as they accumulate different mutations and exploit varying pathways to thrive. Identifying another tumorigenic agent means identifying another target for therapy. With high-throughput sequencing methods, profiling tumors and identifying targets will, In my opinion, be the future of cancer treatment.

References

Soung YH, Ford S, Zhang V, Chung J. Exosomes in Cancer Diagnostics. Mok SC, ed. Cancers. 2017;9(1):8. doi:10.3390/cancers9010008.
Kalyanaraman B. Teaching the basics of cancer metabolism: Developing antitumor strategies by exploiting the differences between normal and cancer cell metabolism. Redox Biology. 2017;12:833-842. doi:10.1016/j.redox.2017.04.018.
Hyun-Jeong Ko, Seong-Ryeol Kim, Jae-Hyoung Song and Bo-Ra Lee. Interleukin-10 Attenuate Tumor Growth by Inhibiting Interleukin-6/Signal transducer and activator of transcription 3 Axis on Myeloid-Derived Suppressor Cells.
J Immunol May 1, 2016, 196 (1 Supplement) 73.5. doi: 10.1016/j.canlet.2016.07.012.
Food and Drug Administration. FDA Approval for Bevacizumab. National Cancer Institute. Updated 2014.
Gene [Database]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] –. Gene ID: 100302142, Homo sapiens MIR1246. Available from: https://www.ncbi.nlm.nih.gov/nuccore/NM_001349333.1.
Chai S, et al. Octamer 4/microRNA-1246 signaling axis drives Wnt/β-catenin activation in liver cancer stem cells. American Association for the Study of Liver Diseases: Hepatology. 2016. doi:10.1002/hep.28821.
Dexiao Yuan, et al. Extracellular miR-1246 promotes lung cancer cell proliferation and enhances radioresistance by directly targeting DR5. Oncotarget. 2016. doi:10.18632/oncotarget.9017.

Original Study

Tomer Cooks, Ioannis S. Pateras, Lisa M. Jenkins, Keval M. Patel, Ana I. Robles, James Morris, Tim Forshew, Ettore Appella, Vassilis G. Gorgoulis & Curtis C. Harris. Mutant p53 cancers reprogram macrophages to tumor supporting macrophages via exosomal miR-1246. Nature Communicationsvolume 9, Article number: 771 (2018) . doi:10.1038/s41467-018-03224-w

An Endangered Animal For Every State

Fun little infographic I made. Full disclosure, apart from being listed as threatened/endangered, these choices were fairly arbitrary. I tried to get a good mix of phyla (obviously biased towards mammals lol).

Photo Credit:

1. AL Carlton Ward Jr
2. AK Ray Bulson
3. AZ ZakVTA – Flickr
4. AS Jim Rathert
5. CA dustandfog – Flickr
6. CO Johan Spaedtke
7. CT Dr. Anthony Swineheart
8. DE Ned Smith Center for Nature and Art
9. FL National Geographic Kids
10. GA Sandy Sharkey
11. HI Brocken Inaglory
12. ID David Moskowitz
13. IL Missouri Department of Conservation
14. IN Pinterest – Unidentified
15. IA Victoria Kaufman
16. KS Doug Hommert
17. KY National Park Service
18. LA Katie Steiger-Meister
19. ME Tom Barnes
20. MD National Park Service
21. MA Pittsburgh Post-Gazette
22. MI Dawn Villella
23. MN Karen Hollingsworth
24. MS Sean Graham
25. MO Chicago Tribune
26. MT Nathan Rupert
27. NE Kimberly Fraser
28. NV Monty Rickard
29. NH Helen Briggs
30. NJ Ohio Department of Natural Resources
31. NM Arizona Game and Fish
32. NY Kirstin Breisch Russell
33. NC Heather Paul
34. ND The Travel Edition – WordPress
35. OH Jeol Trick
36. OK Merlin D. Tuttle
37. OR NWT Species at Risk
38. PA Mary Holland
39. RI Michael Milicia
40. SC Morgan Wolf
41. SD Doug Buckland
42. TN Dave Herasimtshuk
43. TX Shutterstock
44. UT Shutterstock
45. VT Conservation Northwest
46. VA Sean McCann
47. WA Ryan Wolt
48. WV John McCoy
49. WI Todd Rosenberg
50. WY Ryan Moehring

Status sourced from state FWS department, may vary from national and IUCN status.
Other Information sourced from IUCN and FWS and others.