Which machine learning or r model to use based on data available

mlsops · March 1, 2023, 5:44pm

Hello,

I am new to the predictive analytics field. For my studies, I am investigating financial data. I need a bit of help to be honest to frame my question properly and to figure out which model to use based on the variables identified to answer the question below, which is why I am hoping to pick your expert brains with my question.

Goal:
Ultimately I would like to predict how much money should be provided to three organizations based on how well they perform in helping their clients find employment.
The more clients these organizations help, the more money they get. I have a chart that shows how much money will be provided to each organization if they can get at least 80% of their clients to be employed after 1 month of providing service to the clients. If the client is still employed after 3 months then the organization gets more money, and if the client is still employed after 12 months then the organization gets even more money.
Basically...a client comes to an organization for help finding a job. The organization helps the client..i.e. resume help/find employment for them and then once the client lands a job, the organization's work is done. However, that organization will not receive money right away just because the client landed a job, the organization will receive money if the client is still employed in the same place after 1 month ($85) of landing the job, and then exactly 3 months from that date, the organization will receive more money ($115) if the client is still employed in the same place after 3 months; and then exactly 12 months from that date, the organization will receive more money ($1000) if the client is still employed in the same place after 12 months.

I have 1 year worth of data from these 3 organizations that shows how many clients came to them for employment help and out of the clients that came to them for help, how many are employed in the same place after 1,3, 12 months. I also have the total payments that went out to these 3 organizations based on the number of clients each organization has helped landing a job.

What I want to know is, based on the client intake (number of clients that came to the 3 organizations for help) and client outcome (number of clients that landed and stayed in same jobs 1,3,12 months after receiving service) from this year, how can I predict what the dollar amount will be for next year that needs to be given to the 3 organizations?

I hope my question made sense to you, please feel free to ask me any question if you need more clarifications, based on your expertise, what model in R can I use to do this kind of analysis?

Thank you

technocrat · March 1, 2023, 7:54pm

is always the hardest part of data analysis. A framework can help, based on school algebra. f(x) = y, where

x is the data at hand
y is the result to be derived
f is one or more functions to transform x into y

In R, each of these is represented by an object with various properties. Objects can be, and usually are, composite. Functions are objects that act on other objects, including other functions.

To make this concrete, the 12 month data you described can be represented by a data frame object that might look something like the following faked data

# source object
x <- data.frame(
  org_id = 1:20,
  intake = c(
    1098, 859, 680, 729, 619, 733, 499,
    858, 1100, 768, 702, 787, 1010, 825, 846, 515, 940, 678,
    1395, 532),
   q1_30 = c(
    49, 65, 25, 74, 18,
    100, 47, 24, 71, 89, 37, 20, 26, 3, 41, 27, 36, 5,
    34, 87
  ), 
  q1_90 = c(
    58, 97, 42, 24, 30, 43, 15, 22,
    100, 8, 36, 68, 86, 18, 69, 4, 50, 49, 26, 6
  ), q1_360 = c(
    6,
    2, 3, 21, 99, 58, 10, 40, 5, 33, 49, 73, 29, 76,
    84, 9, 35, 16, 69, 96
  ), q2_30 = c(
    82, 24, 18, 69, 55,
    40, 21, 57, 42, 98, 13, 53, 54, 83, 32, 80, 60, 29,
    81, 73
  ), q2_90 = c(
    85, 43, 58, 72, 29, 55, 38, 1, 13,
    78, 5, 73, 95, 16, 99, 42, 100, 57, 96, 25
  ), q2_360 = c(
    63,
    32, 81, 14, 6, 47, 43, 62, 37, 80, 31, 34, 96, 86,
    38, 88, 84, 15, 89, 42
  ), q3_30 = c(
    87, 60, 12, 26,
    41, 65, 66, 56, 24, 25, 61, 62, 14, 34, 94, 32, 27,
    10, 57, 28
  ), q3_90 = c(
    37, 10, 5, 35, 78, 14, 28, 54,
    90, 31, 43, 52, 81, 59, 27, 30, 89, 75, 88, 73
  ),
  q3_360 = c(
    17, 62, 13, 95, 63, 49, 61, 1, 100, 33,
    28, 2, 31, 8, 80, 3, 98, 12, 65, 30
  ), q4_30 = c(
    51,
    60, 95, 37, 47, 56, 70, 16, 10, 71, 73, 25, 3,
    82, 53, 28, 79, 57, 80, 61
  ), q4_90 = c(
    43, 22, 26,
    54, 97, 84, 55, 96, 58, 45, 37, 100, 79, 85, 34,
    25, 65, 14, 49, 11
  ), q4_360 = c(
    28, 18, 55, 42,
    36, 20, 79, 61, 71, 27, 5, 86, 95, 41, 11, 38,
    73, 23, 89, 19))

# intermediate object
x1 <- data.frame(org_id = x$org_id)
x1$placement <- rowSums(x[3:14])
x1$score <- round(x1$placement/x$intake,2)

# intermediate object
eligible <- x[which(x1$score >= 0.8),][-2]

# y
pay30 <- 80
pay90 <- 115
pay360 <- 1000
q30 <-  c(2,5,8,11)
q90 <-  c(3,6,9,12)
q360 <- c(4,7,9,13)

y <- data.frame(org_id = eligible$org_id)
y$pay_out30 <- eligible[q30] * pay30
y$pay_out90 <- eligible[q90] * pay90
y$pay_out360 <-eligible[q360] * pay360
y$total <- rowSums(y[2:4])
y
#>   org_id pay_out30.q1_30 pay_out30.q2_30 pay_out30.q3_30 pay_out30.q4_30
#> 1      5            1440            4400            3280            3760
#> 2      6            8000            3200            5200            4480
#> 3      7            3760            1680            5280            5600
#> 4     10            7120            7840            2000            5680
#> 5     12            1600            4240            4960            2000
#> 6     17            2880            4800            2160            6320
#> 7     20            6960            5840            2240            4880
#>   pay_out90.q1_90 pay_out90.q2_90 pay_out90.q3_90 pay_out90.q4_90
#> 1            3450            3335            8970           11155
#> 2            4945            6325            1610            9660
#> 3            1725            4370            3220            6325
#> 4             920            8970            3565            5175
#> 5            7820            8395            5980           11500
#> 6            5750           11500           10235            7475
#> 7             690            2875            8395            1265
#>   pay_out360.q1_360 pay_out360.q2_360 pay_out360.q3_90 pay_out360.q4_360  total
#> 1             99000              6000            78000             36000 258790
#> 2             58000             47000            14000             20000 182420
#> 3             10000             43000            28000             79000 191960
#> 4             33000             80000            31000             27000 212270
#> 5             73000             34000            52000             86000 291495
#> 6             35000             84000            89000             73000 332120
#> 7             96000             42000            73000             19000 263145

as x and the last data frame, y calculated from x. y was designed before populating it.

Four quarters is insufficient to forecast except by the seasonally naive method—the future will be like the past or present, only a little bit more or a little bit less. Seasonal naive takes the last data point from the year earlier period. To make a judgmental prediction from the forecast requires taking a position on changes in seasonally adjusted employment rates to be expected in the period ahead period.

After making a prediction, any available historical period before the year being used should be considered as a reality check.

Attention should be given to carry-over rules—for example, does a client entering the system in December 2022 and placed in 2023Q2 enter in to the scoring?

Finally, consider how the payout scheme can be gamed by organizations playing with timing, such as entering employment begun on the last day of Q1 as the first day of Q2.

mlsops · March 1, 2023, 9:32pm

@technocrat
Thank you so much for this detailed response. That's a really good point you brought up about the quarters. There are 4 fiscal year quarters. The 1 year data I have available is for the fiscal year 2021-2022:
April 1, 2021-June 30, 2021 - Q1
July 1, 2021-Sept 30, 2021 - Q2
Oct 1, 2021-Dec 31, 2021 - Q3
Jan 1, 2022-March 31, 2022 - Q4
You are correct, a client that entered employment in Q2 and was employed for an entire month will result in the organization to receive the 1 month payment on Q2. If the client reaches their 12 month point next fiscal year (2022-2023) then the organization will receive that 12month payment next fiscal year.

In terms of gaming the system (haha I am sure that happens) unfortunately I have to go with what the data shows. How would you suggest I lay down the model? Would providing a snapshot of what the data looks like be helpful to you?

I have a scenerio analysis that may be helpful:
If 10 clients walk into an organization's door today and assuming that after all of them received service from that organization, they landed jobs and were employed for all 12 months, then the organization is eligible to receive $12,000 to serve all those clients:
1month = $85
3month = $115
12months = $1000
Total = $1200

However, based on the 1 year data available, what I see is:
80% of the clients were employed at 1 month
60% were employed at 3 months
20% were employed at 12 months

and so, the organizations are eligible to receive their payment for those month depending on which quarter the client was employed.

How would I use Seasonal naive model for this analysis? I would much appreciate your input in this. Thank you

technocrat · March 1, 2023, 10:34pm

My mock-up didn't capture this. Instead, I used as eligibility the total clients placed divided by the total walking in the door. Since everyone reflected in the quarterly data was employed for at least thirty days, I don't think that should be a problem. It also avoids getting into an unnecessary level of detail that fails to change the outcome even though it more closely tracks what's happening.

My data were generated randomly, so they are not realistic. But the same calculations should work for your actual data if they are aggregated the same way. If, however, payments are calculated individually for each client there could be some adjustment needed, because x would not be as assumed.

With the data available, which is limited to only the four quarters, this is more of a descriptive than a modeling exercise because payout is completely determined according to the quarterly input, number of clients, and the payout levels. It is possible to do scenario planning by varying assumptions made for future periods as to the quarterly data expected as some proportion of the historical (perhaps on an agency-by-agency basis), client intake levels (perhaps based on general economic conditions expected), assessing the budgetary impact of changing payout unit amounts, etc.

No. At this point you should have enough of a start to modify the x and y examples to conform to the specific business problem, understand how the calculations performed by f work and then make the appropriate adjustments.

Come back with a reprex. See the FAQ in a new thread if you have difficulties with specific steps.

mlsops · March 1, 2023, 11:03pm

@technocrat thank you, so would you suggest for my predictive analytics question, it would be better to predict what the client intake and outcome (employed VS unemployed at 1, 3, 12 months) based on the 1 year data I have available?

technocrat · March 2, 2023, 12:27am

Unless you have better data available, the only alternative is making guesses.

system · April 13, 2023, 12:27am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.