Home Convert variable to multiple columns with Pandas

# Convert variable to multiple columns with Pandas

Computarch
1#
Computarch Published in 2018-02-13 23:45:26Z
 I have a legacy datafile that contains data in the following format: SURVEY NUM TEMPORAL WHS 1 Byz WHS 1 Byz_Um WHS 1 IAII WHS 1 L_Isl WHS 1 L_Rom WHS 1 Mod WHS 1 Nab WHS 2 Byz WHS 2 Mod WHS 2 Unk WHS 2 MP WHS 3 Byz WHS 3 Nab WHS 3 LMP WHS 3 UP WHS 4 LMP WHS 4 MP WHS 4 UP WHS 5 Byz WHS 5 Unk WHS 5 LMP  etc.. Essentially, the column "NUM" is a unique identifier that relates to a specific site, and the column "TEMPORAL" is a value associated with that site. For whatever reason, the original file repeates this over several lines for sites with mutltiple temporal occupations (this archaeological data). I would like to use Pandas to conver this to something like so: SURVEY NUM Byz Byz_Um IAII L_Isl LMP L_Rom Nab MP Mod Unk UP WHS 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 WHS 2 1 0 0 0 0 0 0 0 0 1 1 0 0 1 WHS 3 1 0 0 0 0 0 0 0 1 1 1 0 1 0 WHS 4 0 0 0 0 0 0 0 0 0 0 0 1 0 1 WHS 5 1 0 0 0 0 0 0 1 0 0 0 0 1 0  Where a 1 is placed into a new column if that TEMPORAL period exists. I tried using df.pivot with "NUM" as the index and "TEMPORAL" as the columns, but that did not work. There are several thousand sites in this database, so doing it manually is not a possibility. Any ideas?
piRSquared
2#
 np.bincount and pd.factorize i, r = pd.factorize(df['SURVEY']) j, c = pd.factorize(df['NUM TEMPORAL']) n, m = len(r), len(c) b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) pd.DataFrame(b, r, c) Byz Byz_Um IAII L_Isl L_Rom Mod Nab Unk MP LMP UP WHS 1 1 1 1 1 1 1 1 0 0 0 0 WHS 2 1 0 0 0 0 1 0 1 1 0 0 WHS 3 1 0 0 0 0 0 1 0 0 1 1 WHS 4 0 0 0 0 0 0 0 0 1 1 1 WHS 5 1 0 0 0 0 0 0 1 0 1 0  pd.get_dummies and pd.DataFrame.dot pd.get_dummies(df['SURVEY']).T.dot(pd.get_dummies(df['NUM TEMPORAL'])) Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk WHS 1 1 1 1 0 1 1 0 1 1 0 0 WHS 2 1 0 0 0 0 0 1 1 0 0 1 WHS 3 1 0 0 1 0 0 0 0 1 1 0 WHS 4 0 0 0 1 0 0 1 0 0 1 0 WHS 5 1 0 0 1 0 0 0 0 0 0 1  pd.crosstab pd.crosstab(*df.values.T) col_0 Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk row_0 WHS 1 1 1 1 0 1 1 0 1 1 0 0 WHS 2 1 0 0 0 0 0 1 1 0 0 1 WHS 3 1 0 0 1 0 0 0 0 1 1 0 WHS 4 0 0 0 1 0 0 1 0 0 1 0 WHS 5 1 0 0 1 0 0 0 0 0 0 1  Or pd.crosstab(df['SURVEY'], df['NUM TEMPORAL']) NUM TEMPORAL Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk SURVEY WHS 1 1 1 1 0 1 1 0 1 1 0 0 WHS 2 1 0 0 0 0 0 1 1 0 0 1 WHS 3 1 0 0 1 0 0 0 0 1 1 0 WHS 4 0 0 0 1 0 0 1 0 0 1 0 WHS 5 1 0 0 1 0 0 0 0 0 0 1  reconstruction pd.Series(1, df.values.T.tolist()).unstack(fill_value=0) Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk WHS 1 1 1 1 0 1 1 0 1 1 0 0 WHS 2 1 0 0 0 0 0 1 1 0 0 1 WHS 3 1 0 0 1 0 0 0 0 1 1 0 WHS 4 0 0 0 1 0 0 1 0 0 1 0 WHS 5 1 0 0 1 0 0 0 0 0 0 1  Explanation df.values.T.tolist() passes a list of length 2, consisting of two levels. Those levels get interpreted as a MultiIndex. 1 becomes the default value. unstack with fill_value=0 reshapes accordingly
Wen
3#
 By using get_dummies df.set_index(['SURVEY','NUM']).TEMPORAL.str.get_dummies().sum(level=1) Out[127]: Byz Byz_Um IAII LMP L_Isl L_Rom MP Mod Nab UP Unk NUM 1 1 1 1 0 1 1 0 1 1 0 0 2 1 0 0 0 0 0 1 1 0 0 1 3 1 0 0 1 0 0 0 0 1 1 0 4 0 0 0 1 0 0 1 0 0 1 0 5 1 0 0 1 0 0 0 0 0 0 1 
Computarch
4#
 Thanks to piRSquared for leading me to the correct answer. The following code did exactly what I wanted: df2 = pandas.crosstab(df['NUM'], [df['SURVEY'], df['TEMPORAL']]) 
 You need to login account before you can post.
Processed in 0.337567 second(s) , Gzip On .