Using this implementation of parallelization raises an ImportError: cannot import name 'Parallel' from 'multiprocessing'
The following code tries parallelization with the "denominator" function and should give me the sum of the fields "basalareap","basalareas","basalaread" in a new column. When I import the whole library viafrom multiprocessing import *
The process start but comes to no end.
What is wrong with my syntax?
import numpy as np
from multiprocessing import cpu_count, Parallel
import pandas as pd
#Some example dataframe
np.random.seed(4)
layer = pd.DataFrame(np.random.randint(0,25,size=(10, 4)),
columns=list(['basalareap','notofinterest', 'basalareas', 'basalaread']))
###Filter Fields by selecting columns of interest
fields = ["basalareap","basalareas","basalaread"]
#In reality data is a geodatframe, it would be:
#layer = layer[fields+['geometry']]
#but here:
layer = fields
data = layer
def denom():
data['denominator'] = data[["basalareap","basalareas","basalaread"]].sum(axis=1)
cores = cpu_count()
partitions = cores
def parallelize(data,func):
data_split = np.array_split(data,partitions)
pool = Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data,denom)
I'm using Windows 10 and Python 3.7.4
Answer
You need to be more careful about your code.
Already when you write layer = fields you change your dataframe to become a simple list. Your function denom doesn't return anything, and it should return a dataframe given how you use it in the multiprocessing.
You never use Parallel in your code (it does not exist indeed), but use Pool. That's what you should import. Check other posts if one doesn't work, check for typos and inconsistencies.
import numpy as np
from multiprocessing import Pool
import pandas as pd
#Some example dataframe
np.random.seed(4)
layer = pd.DataFrame(np.random.randint(0,25,size=(1000, 4)),
columns=list(['basalareap','notofinterest', 'basalareas', 'basalaread']))
###Filter Fields by selecting columns of interest
fields = ["basalareap","basalareas","basalaread"]
#Check pandas syntax online if you have doubt on how to use it. This is how you
#keep only a list of columns.
layer = layer[fields]
def denom(df):
df['denominator'] = df[["basalareap","basalareas","basalaread"]].sum(axis=1)
#You need to return a dataframe for use in you parallelize function below
return df
def parallelize_dataframe(df, func, n_cores=4):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
#In windows you need to insert a guard in the main module to avoid creating
#subprocesses recursively
if __name__ == '__main__':
test = parallelize_dataframe(df=layer, func=denom)
print(test.head())
No comments:
Post a Comment