python - Parallel processing in Pandas

Wednesday, 18 March 2015

python - Parallel processing in Pandas

Using this implementation of parallelization raises an ImportError: cannot import name 'Parallel' from 'multiprocessing' The following code tries parallelization with the "denominator" function and should give me the sum of the fields "basalareap","basalareas","basalaread" in a new column. When I import the whole library viafrom multiprocessing import * The process start but comes to no end.

What is wrong with my syntax?

import numpy as np
from multiprocessing import cpu_count, Parallel
import pandas as pd


#Some example dataframe
np.random.seed(4)
layer = pd.DataFrame(np.random.randint(0,25,size=(10, 4)),
                  columns=list(['basalareap','notofinterest', 'basalareas', 'basalaread']))

###Filter Fields by selecting columns of interest
fields = ["basalareap","basalareas","basalaread"]

#In reality data is a geodatframe, it would be:

#layer = layer[fields+['geometry']]

#but here:
layer = fields

data = layer

def denom():
    data['denominator'] = data[["basalareap","basalareas","basalaread"]].sum(axis=1)



cores = cpu_count()
partitions = cores

def parallelize(data,func):
    data_split = np.array_split(data,partitions)
    pool = Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()

    return data

data = parallelize(data,denom)

I'm using Windows 10 and Python 3.7.4

Answer

You need to be more careful about your code.

Already when you write layer = fields you change your dataframe to become a simple list. Your function denom doesn't return anything, and it should return a dataframe given how you use it in the multiprocessing.

You never use Parallel in your code (it does not exist indeed), but use Pool. That's what you should import. Check other posts if one doesn't work, check for typos and inconsistencies.

import numpy as np

from multiprocessing import  Pool
import pandas as pd

#Some example dataframe
np.random.seed(4)
layer = pd.DataFrame(np.random.randint(0,25,size=(1000, 4)),
    columns=list(['basalareap','notofinterest', 'basalareas', 'basalaread']))

###Filter Fields by selecting columns of interest
fields = ["basalareap","basalareas","basalaread"]


#Check pandas syntax online if you have doubt on how to use it. This is how you 
#keep only a list of columns.
layer = layer[fields]

def denom(df):
    df['denominator'] = df[["basalareap","basalareas","basalaread"]].sum(axis=1)
    #You need to return a dataframe for use in you parallelize function below
    return df



def parallelize_dataframe(df, func, n_cores=4):
    df_split = np.array_split(df, n_cores)

    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))

    pool.close()
    pool.join()


    return df

#In windows you need to insert a guard in the main module to avoid creating 
#subprocesses recursively
if __name__ == '__main__':
    test = parallelize_dataframe(df=layer, func=denom)

    print(test.head())

Blog

Wednesday, 18 March 2015

python - Parallel processing in Pandas

No comments:

Post a Comment

arcpy - Changing output name when exporting data driven pages to JPG?