Working with Pandas Dataframe Subtraction and Indexes

I have several text files, with some chronological metrics on real estate data over time. I would like to import this data into a dataframe, and then calculate a new dataframe that is the difference between a metric for a specific city, relative to the average of all the cities, at that date. For a given date, I would also like the output to be NaN if the individual city had no data, even if there is a relevant average for the other cities.

A couple issues I've encountered or am concerned about

--> I am not sure how the best way is to avoid a situation where some Memphis data = 'NaN', but the average of multiple cities was 17 so Memphis - Average = -17. My desired behavior is for Memphis (NaN) minus Average (17) = NaN.

--> My column order changes to alphabetical when doing a subtraction of my average from the individual cities, and I don't know why. While I have manually checked, I am concerned I'm doing something wrong in terms of indexing, so that I may be subtracting misaligned columns or rows without knowing it.

--> I am extracting off some text data (a 'Record' string and the 'City' name) from my original dataset, and concatenating it back onto the finalized relative city. This seems a bit inefficient, and similarly, I am concerned this could be adding indexing risk.

import glob import pandas as pd df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) for f in glob.glob('city Text Files/*.txt')], ignore_index = True) df_headers = df.loc[:,['Record','City']] df_average = df.groupby(['Date'], as_index=False).mean() df_rel_cities_wip = (df.set_index('Date')-df_average.set_index('Date').reindex(df.Date)).reset_index() df_rel_cities_wip = df_rel_cities_wip.drop(['Record','City'], axis = 1) #these otherwise stay behind as blank columns because they're strings df_rel_cities_wip = df_rel_cities_wip.round(6) #Gets rid of floating point almost-zeros df_rel_cities = pd.concat([df_headers, df_rel_cities_wip], axis=1, sort=False) #Bolts city names and alpha record numbers back to the relative dataset

EDIT - ADDING SOME SAMPLE DATA TO ILLUSTRATE
First dataframe (df) would be the original records. For various reasons, there may not be viable $/sqft or days-on-market (DOM) available for Memphis on January 3.

Record City Date $/SQFT DOM M12 Memphis 01/01/2018 100 18 M13 Memphis 01/02/2018 112 73 M14 Memphis 01/03/2018 NaN NaN D73 Dallas 01/01/2018 300 36 D74 Dallas 01/02/2018 320 53 D75 Dallas 01/03/2018 325 43 A19 Atlanta 01/01/2018 255 11 A20 Atlanta 01/03/2018 263 18

The average metric on each day for all these cities is as follows (df_average)

Date $/SQFT DOM 01/01/2018 218.3 21.7 01/02/2018 216.0 63.0 01/03/2018 294.0 30.5

And my final dataframe (df_rel_cities) would show the difference between the 'city average' on each date, and the actual. Note that M14 Memphis is NaN to begin - so the desired outcome is to see NaN as the difference versus the industry, given that Memphis had problematic data that day. When I run my code above, I find it alphabetizes my column order as well.

Record City Date $/SQFT DOM M12 Memphis 01/01/2018 -118.3 -3.7 M13 Memphis 01/02/2018 -104.0 10.0 M14 Memphis 01/03/2018 NaN NaN D73 Dallas 01/01/2018 81.7 14.3 D74 Dallas 01/02/2018 104.0 -10.0 D75 Dallas 01/03/2018 31.0 12.5 A19 Atlanta 01/01/2018 36.7 -10.7 A20 Atlanta 01/03/2018 -31.0 -12.5

Can you create some sample data?
– jezrael
Aug 27 at 5:03

1 Answer
1

I believe need:

df = pd.concat([pd.read_csv(f, na_values = ['#VALUE!', '#DIV/0!'], keep_default_na = True) for f in glob.glob('city Text Files/*.txt')]) #get only numeric columns cols = df.select_dtypes(np.number).columns #create DataFrame with same size as original with means df_average = df.groupby('Date')[cols].transform('mean') print (df_average) $/SQFT DOM 0 218.333333 21.666667 1 216.000000 63.000000 2 294.000000 30.500000 3 218.333333 21.666667 4 216.000000 63.000000 5 294.000000 30.500000 6 218.333333 21.666667 7 294.000000 30.500000 #substract only numeric columns df[cols] = (df[cols] - df_average ).round(1) print (df) Record City Date $/SQFT DOM 0 M12 Memphis 01/01/2018 -118.3 -3.7 1 M13 Memphis 01/02/2018 -104.0 10.0 2 M14 Memphis 01/03/2018 NaN NaN 3 D73 Dallas 01/01/2018 81.7 14.3 4 D74 Dallas 01/02/2018 104.0 -10.0 5 D75 Dallas 01/03/2018 31.0 12.5 6 A19 Atlanta 01/01/2018 36.7 -10.7 7 A20 Atlanta 01/03/2018 -31.0 -12.5

I receive an "InvalidIndexError: Reindexing only valid with uniquely valued Index objects" warning from the last line with .reset_index(). I think this is because by setting the index to 'Date', which repeats for each city in the dataset, that the concatenation has trouble
– The Rhyno
Aug 29 at 3:00

@TheRhyno - Yes, problem was with duplicated DatetimeIndex. For avoid this problems is best use GroupBy.transform, check edited answer.
– jezrael
Aug 29 at 5:25

GroupBy.transform

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt