Pandas, for each groupby group, enumerate over column of strings and convert to counter dictionary
Pandas, for each groupby group, enumerate over column of strings and convert to counter dictionary
I'm trying to automate building a networkx graph for any input pandas dataframe.
The dataframe looks like this:
FeatureID BC chrom pos ftm_call
1_1_1 GCTATT 12 25398138 NRAS_3
1_1_1 GCCTAT 12 25398160 NRAS_3
1_1_1 GCCTAT 12 25398073 NRAS_3
1_1_1 GATCCT 12 25398128 NRAS_3
1_1_1 GATCCT 12 25398107 NRAS_3
Here's the algorithm I need to sort out:
So far, here is what I have:
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="t")
hamming_df = hamming_df[["FeatureID", "BC", "chrom", "pos"]]
# initiate graphs
G = nx.DiGraph(name="G")
KRAS = nx.DiGraph(name="KRAS")
NRAS_3 = nx.DiGraph(name="NRAS_3")
# list of reference graphs
ref_graph_list = [G, KRAS, NRAS_3]
def add_basecalls(row):
basecall = row.BC.astype(str)
target = row.name[1]
pos = row["pos"]
chrom = row["chrom"]
# initialize counter dictionary
d = defaultdict()
# select graph that matches ftm call
graph = [f for f in ref_graph_list if f.graph["name"] == target]
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(add_basecalls)
But this isn't pulling out the barcodes as strings that I can just enumerate across, it's pulling them out as a series and I'm stuck.
Desired output is a graph containing the following information, example shown for the first BC "GCTATT" with fictitious counts:
FeatureID chrom pos Nucleotide Weight
1_1_1 12 25398138 G 10
1_1_1 12 25398138 C 22
1_1_1 12 25398139 T 12
1_1_1 12 25398140 A 15
1_1_1 12 25398141 T 18
1_1_1 12 25398142 T 22
Thanks in advance!
You got it, great idea.
– SummerEla
Aug 21 at 19:21
1 Answer
1
You probably need an additional apply with axis=1 to parse the rows for each group:
apply
axis=1
import pandas as pd
import numpy as np
import networkx as nx
from collections import defaultdict
# initiate graphs
GRAPHS = "G": nx.DiGraph(name="G"),
"KRAS": nx.DiGraph(name="KRAS"),
"NRAS_3": nx.DiGraph(name="NRAS_3"), # notice that test_data.txt has "NRAS_3" not "KRAS_3"
WEIGHT_DICT = defaultdict()
def update_weight_for_row(row, target_graph):
pos = row["pos"]
chrom = row["chrom"]
for letter in row.BC:
print(letter)
# now you have access to letters in BC per row
# and can update graph weights as desired
def add_basecalls(grp):
# select graph that matches ftm_call
target = grp.name[1]
target_graph = GRAPHS[target]
grp.apply(lambda row: update_weight_for_row(row, target_graph), axis=1)
# read in test basecalls
hamming_df = pd.read_csv("./test_data.txt", sep="t")
hamming_df2 = hamming_df[["FeatureID", "BC", "chrom", "pos"]] # Why is this line needed?
stuff = hamming_df.groupby(["FeatureID", "ftm_call"])
stuff.apply(lambda grp: add_basecalls(grp))
Thank you so much! It was driving me nuts... PS- To answer your question at hamming_df2, I subset the dataframe to just the necessary columns because it's large..
– SummerEla
Aug 21 at 22:02
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Can you post a desired output?
– user3483203
Aug 21 at 18:52