Working with Pandas Dataframe Subtraction and Indexes

Working with Pandas Dataframe Subtraction and Indexes



I have several text files, with some chronological metrics on real estate data over time. I would like to import this data into a dataframe, and then calculate a new dataframe that is the difference between a metric for a specific city, relative to the average of all the cities, at that date. For a given date, I would also like the output to be NaN if the individual city had no data, even if there is a relevant average for the other cities.



A couple issues I've encountered or am concerned about



--> I am not sure how the best way is to avoid a situation where some Memphis data = 'NaN', but the average of multiple cities was 17 so Memphis - Average = -17. My desired behavior is for Memphis (NaN) minus Average (17) = NaN.



--> My column order changes to alphabetical when doing a subtraction of my average from the individual cities, and I don't know why. While I have manually checked, I am concerned I'm doing something wrong in terms of indexing, so that I may be subtracting misaligned columns or rows without knowing it.



--> I am extracting off some text data (a 'Record' string and the 'City' name) from my original dataset, and concatenating it back onto the finalized relative city. This seems a bit inefficient, and similarly, I am concerned this could be adding indexing risk.


import glob
import pandas as pd

df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) for f in glob.glob('city Text Files/*.txt')], ignore_index = True)

df_headers = df.loc[:,['Record','City']]
df_average = df.groupby(['Date'], as_index=False).mean()

df_rel_cities_wip = (df.set_index('Date')-df_average.set_index('Date').reindex(df.Date)).reset_index()
df_rel_cities_wip = df_rel_cities_wip.drop(['Record','City'], axis = 1) #these otherwise stay behind as blank columns because they're strings
df_rel_cities_wip = df_rel_cities_wip.round(6) #Gets rid of floating point almost-zeros
df_rel_cities = pd.concat([df_headers, df_rel_cities_wip], axis=1, sort=False) #Bolts city names and alpha record numbers back to the relative dataset



EDIT - ADDING SOME SAMPLE DATA TO ILLUSTRATE
First dataframe (df) would be the original records. For various reasons, there may not be viable $/sqft or days-on-market (DOM) available for Memphis on January 3.


Record City Date $/SQFT DOM
M12 Memphis 01/01/2018 100 18
M13 Memphis 01/02/2018 112 73
M14 Memphis 01/03/2018 NaN NaN
D73 Dallas 01/01/2018 300 36
D74 Dallas 01/02/2018 320 53
D75 Dallas 01/03/2018 325 43
A19 Atlanta 01/01/2018 255 11
A20 Atlanta 01/03/2018 263 18



The average metric on each day for all these cities is as follows (df_average)


Date $/SQFT DOM
01/01/2018 218.3 21.7
01/02/2018 216.0 63.0
01/03/2018 294.0 30.5



And my final dataframe (df_rel_cities) would show the difference between the 'city average' on each date, and the actual. Note that M14 Memphis is NaN to begin - so the desired outcome is to see NaN as the difference versus the industry, given that Memphis had problematic data that day. When I run my code above, I find it alphabetizes my column order as well.


Record City Date $/SQFT DOM
M12 Memphis 01/01/2018 -118.3 -3.7
M13 Memphis 01/02/2018 -104.0 10.0
M14 Memphis 01/03/2018 NaN NaN
D73 Dallas 01/01/2018 81.7 14.3
D74 Dallas 01/02/2018 104.0 -10.0
D75 Dallas 01/03/2018 31.0 12.5
A19 Atlanta 01/01/2018 36.7 -10.7
A20 Atlanta 01/03/2018 -31.0 -12.5





Can you create some sample data?
– jezrael
Aug 27 at 5:03




1 Answer
1



I believe need:


df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True)
for f in glob.glob('city Text Files/*.txt')])

#get only numeric columns
cols = df.select_dtypes(np.number).columns

#create DataFrame with same size as original with means
df_average = df.groupby('Date')[cols].transform('mean')
print (df_average)
$/SQFT DOM
0 218.333333 21.666667
1 216.000000 63.000000
2 294.000000 30.500000
3 218.333333 21.666667
4 216.000000 63.000000
5 294.000000 30.500000
6 218.333333 21.666667
7 294.000000 30.500000

#substract only numeric columns
df[cols] = (df[cols] - df_average ).round(1)
print (df)
Record City Date $/SQFT DOM
0 M12 Memphis 01/01/2018 -118.3 -3.7
1 M13 Memphis 01/02/2018 -104.0 10.0
2 M14 Memphis 01/03/2018 NaN NaN
3 D73 Dallas 01/01/2018 81.7 14.3
4 D74 Dallas 01/02/2018 104.0 -10.0
5 D75 Dallas 01/03/2018 31.0 12.5
6 A19 Atlanta 01/01/2018 36.7 -10.7
7 A20 Atlanta 01/03/2018 -31.0 -12.5





I receive an "InvalidIndexError: Reindexing only valid with uniquely valued Index objects" warning from the last line with .reset_index(). I think this is because by setting the index to 'Date', which repeats for each city in the dataset, that the concatenation has trouble
– The Rhyno
Aug 29 at 3:00





@TheRhyno - Yes, problem was with duplicated DatetimeIndex. For avoid this problems is best use GroupBy.transform, check edited answer.
– jezrael
Aug 29 at 5:25



GroupBy.transform






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ