New to R. Selecting cases using with multiple selectors to find breeding patterns in horse pedigree database

Pedigree_Nerd · May 18, 2021, 5:02am

Hi there

I'm a horse pedigree nerd who likes to find common matings in the pedigrees of successful cutting horses.

I've started to use formulas in spreadsheets for this but it's becoming quite tedious to come up with a formula for each possible combination in a 5 generation pedigree.

I've been wondering whether I can use R to help me find breeding patterns (matings) that may repeat in a sample of successful individuals (I suspect there are several of such patterns). The goal is to be able to say that 'this pattern is present in winners of $X million' (earnings are included in the dataset).

The data (ancestors) for each individual are laid out in rows as follows:

Column 1: Horse's name (current performer)

Column 2: Money earned

Column 3: Generation 1 Top (name of sire)

Column 4: Generation 1 Bottom (name of dam)

Column 5: Generation 2 Top (name of paternal grandsire)

Column 6: Generation 2 Top (name of paternal granddam)

Column 7: Generation 2 Bottom (name of maternal grandsire)

Column 8: Generation 2 Bottom (name of maternal granddam)

And so on until the 5th generation.

Each following generation has the double of horses as the previous one.

I’m not interested in finding out the great producing individuals over the last few decades (everyone already knows who they are). I’m more interested about the crosses that tend to produce winners. And by crosses I don’t mean the sire and the dam, but the bloodlines a bit further back in the pedigree.

For example, there are cases in which two successful half brothers are out of different mares that are very closely related but not in an obvious way. They could share some ancestors in the 3rd or 4th generation that are placed similarly but not equally. There could be two individuals that are 3/4 siblings even though they are by and out of different individuals.

I hope this makes sense to a non-horsey person?

From my initial research into R seems like selecting cases using multiple selectors may do the job?

Or is there a base function in R or a package that could work better?

Just trying to get any useful intel before I go down a rabbit hole that takes me nowhere.

Cheers

technocrat · May 18, 2021, 5:07am

To move on to the domain-specifics (horse breeding), start the dialogue with a reprex representative dataset. See the FAQ: How to do a minimal reproducible example reprex for beginners.

mara · May 19, 2021, 2:06pm

Can you expand a bit on what you mean by this? This is certainly something you could do, but having an example of your data structure (the kind that's copy-and-pasteable, see the reprex FAQ for details) and knowing a bit more about what your methodology is will help us help you.

I tried to look at an example paper to see what it might be, but no luck. However, if you have access to the journal, you might find this article interesting:
https://www.sciencedirect.com/science/article/abs/pii/S0737080621000150

Pedigree_Nerd · May 24, 2021, 6:52am

Hi Mara (and technocrat)

Thanks for the reply and thanks for taking the time to look up similar research.

My objective is different from the one in the research from that article. That research seeks identify the most influential ancestors in Brazilian cutting-bred quarter horses.

I'm seeking to go beyond just identifying individuals and find the most influential matings (known in the horse industry as 'nicks') in modern American cutting horses.

Currently most of this information is limited to only identifying the successful products from crossing of sire X with the daughters of sire Y (maternal grandsire). I'm trying to identify influential matings beyond the first two generations in a pedigree.

Below is an example of my data structure using datapasta and reprex. It's only showing the 2-generation pedigrees (the original dataset goes to the 5th generation) of the first 2 horses in the dataset.

Not sure whether this makes it easier to understand?

#>   NAME   Earnings GEN.1.top  GEN.1.Bottom GEN.2.top.1 GEN.2.top.2 GEN.2.Bottom.1
#>   <chr>     <int> <chr>      <chr>        <chr>       <chr>       <chr>         
#> 1 Suepe…   214982 Dual Smar… Ichis My Ch… Dual Rey    The Smart … Cat Ichi      
#> 2 Metal…   177786 Metallic … Dual Rey Mi… High Brow … Chers Shad… Dual Rey      
#> # … with 1 more variable: GEN.2.Bottom.2 <chr>

technocrat · May 24, 2021, 7:41am

A reprex can be simply cut-and-pasted.

dput(mtcars)
#> structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 
#> 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 
#> 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 
#> 19.7, 15, 21.4), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 
#> 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4), 
#>     disp = c(160, 160, 108, 258, 360, 225, 360, 146.7, 140.8, 
#>     167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7, 
#>     71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145, 
#>     301, 121), hp = c(110, 110, 93, 110, 175, 105, 245, 62, 95, 
#>     123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150, 
#>     150, 245, 175, 66, 91, 113, 264, 175, 335, 109), drat = c(3.9, 
#>     3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 
#>     3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76, 
#>     3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
#>     ), wt = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 
#>     3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 
#>     1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14, 
#>     1.513, 3.17, 2.77, 3.57, 2.78), qsec = c(16.46, 17.02, 18.61, 
#>     19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6, 
#>     18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87, 
#>     17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
#>     ), vs = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 
#>     0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), am = c(1, 
#>     1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
#>     0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1), gear = c(4, 4, 4, 3, 
#>     3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 
#>     3, 3, 4, 5, 5, 5, 5, 5, 4), carb = c(4, 4, 1, 1, 2, 1, 4, 
#>     2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 
#>     2, 2, 4, 6, 8, 2)), row.names = c("Mazda RX4", "Mazda RX4 Wag", 
#> "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant", 
#> "Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C", 
#> "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood", 
#> "Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic", 
#> "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin", 
#> "Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2", 
#> "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora", 
#> "Volvo 142E"), class = "data.frame")

It doesn't have to be the complete data, it can be a built-in data set with the same structure or even made-up data. What is shown is only a collection of variable names, one typeof integer and the others typeof chr.

It appears that what is sought is a model, f in which y, the response variable, earnings, can be estimated given some combination of x_i ... x_n, variables that represent pedigree attributes, such as GEN.1.top

The threshold task will be to encode pedigrees as dummy variables from lists of sires and dams by generation.

For example, consider two sires, Threepenny and Opera, each of which produces in one generation, three sires (A,B,C,D,E,F) and three dams (G,H,I,J,K,L). In a subsequent generation, there may be matings of A-G, B-J, E-K, etc. How will these be encoded in the data?

mara · May 24, 2021, 1:01pm

Oh, interesting. I've never heard the term nicks. I know that for eventing (which was my bailiwick in ye olden days, these days I'm just doing dressage) WBFSH does sire rankings by points from FEI competitions, but folks are also pretty obsessed with the “maternal-grandsire” phenomenon.

As @technocrat mentioned, dput() is more useful than head() because the former lets us copy and paste and play with the data, while the latter is for printing just to see. If you want to look at matings as opposed to individuals, you'll probably want to create a variable that includes the combination of dam and sire (or sire and damsire/maternal grandsire—whichever combination you're interested in) so you can use that as a the identifier (which is easy enough just through some sort of concatenation of the strings, if your data are all cleaned up).

Pedigree_Nerd · May 26, 2021, 6:12am

Ok folks

Here's the reprex of a made-up dataset using dput():

dput(pedigrees)
#> structure(list(NAME = c("Horse 1", "Horse 2", "Horse 3", "Horse 4", 
#> "Horse 5", "Horse 6"), Earnings = c(100000L, 300000L, 250000L, 
#> 150000L, 400000L, 350000L), G1T = c("Sire 1", "Sire 2", "Sire 1", 
#> "Sire 3", "Sire 2", "Sire 4"), G1B = c("Dam 1", "Dam 2", "Dam 3", 
#> "Dam 4", "Dam 5", "Dam 6"), G2T1 = c("Grandsire 1", "Grandsire 2", 
#> "Grandsire 1", "Grandsire 3", "Grandsire 2", "Grandsire 3"), 
#>     G2T2 = c("Grandam 1", "Grandam 2", "Grandam 1", "Grandam 3", 
#>     "Grandam 2", "Grandam 4"), G2B1 = c("Grandsire 4", "Grandsire 5", 
#>     "Grandsire 4", "Grandsire 6", "Grandsire 7", "Grandsire 8"
#>     ), G2B2 = c("Grandam 5", "Grandam 6", "Grandam 7", "Grandam 8", 
#>     "Grandam 9", "Grandam 10"), G3T1 = c("Grt Grandsire 1", "Grt Grandsire 2", 
#>     "Grt Grandsire 1", "Grt Grandsire 3", "Grt Grandsire 2", 
#>     "Grt Grandsire 4"), G3T2 = c("Grt Grandam 1", "Grt Grandam 2", 
#>     "Grt Grandam 1", "Grt Grandam 3", "Grt Grandam 2", "Grt Grandam 4"
#>     ), G3T3 = c("Grt Grandsire 5", "Grt Grandsire 6", "Grt Grandsire 5", 
#>     "Grt Grandsire 7", "Grt Grandsire 6", "Grt Grandsire 8"), 
#>     G3T4 = c("Grt Grandam 5", "Grt Grandam 6", "Grt Grandam 5", 
#>     "Grt Grandam 7", "Grt Grandam 6", "Grt Grandam 8"), G3B1 = c("Grt Grandsire 9", 
#>     "Grt Grandsire 10", "Grt Grandsire 9", "Grt Grandsire 11", 
#>     "Grt Grandsire 12", "Grt Grandsire 12"), G3B2 = c("Grt Grandam 9", 
#>     "Grt Grandam 10", "Grt Grandam 9", "Grt Grandam 11", "Grt Grandam 12", 
#>     "Grt Grandam 12"), G3B3 = c("Grt Grandsire 13", "Grt Grandsire 14", 
#>     "Grt Grandsire 15", "Grt Grandsire 16", "Grt Grandsire 17", 
#>     "Grt Grandsire 18"), G3B4 = c("Grt Grandam 13", "Grt Grandam 14", 
#>     "Grt Grandam 15", "Grt Grandam 16", "Grt Grandam 17", "Grt Grandam 18"
#>     )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
#> ))

^{Created on 2021-05-26 by the reprex package (v2.0.0)}

Hopefully this will make it easier for you to understand how the data is laid out.

What I'd like the program to do is to identify the horses with duplicate ancestors in each generation but being able to filter them by duplicates in in the previous generation/s.

If you look at the made-up dataset you can see that some of those six horses are related. 'Horse 1' and 'Horse 3' are 3/4 siblings and 'Horse 2' and 'Horse 5' are 1/2 siblings.

What I'd like to achieve is a program that would process that dataset and its output would be something like:

Pattern G1T: 'Sire 1' found in 'Horse 1' and 'Horse 3' and produced $350,000 in earnings
Pattern G1T: 'Sire 2' found in 'Horse 2' and 'Horse 5' and produced $700,000 in earnings
Pattern G1T+G2B1: 'Sire 1' and 'Grandsire 4' found in 'Horse 1' and 'Horse 3' and produced $350,000 in earnings

The larger the dataset the larger the number of patterns and their complexity.

As I mentioned in my first post, I know I can do that using spreadsheets. I created functions that find duplicates filtered by duplicates in the previous generation. But I have to write a function for each possible combination, and that would take a very long time.

Thanks in advance for your input.

Cheers

technocrat · May 26, 2021, 6:24am

Close. Just cut-and-paste the result of dput. I've closed up the embedded blanks with _ to avoid unnecessary hassle (for presentation, column names are easily changed). I've also edited to make a direct data frame. Consider this a marker.

suppressPackageStartupMessages({
  library(dplyr)
})

dat <- data.frame(NAME = c("Horse_1", "Horse_2", "Horse_3", "Horse_4", 
"Horse_5", "Horse_6"), Earnings = c(100000L, 300000L, 250000L, 
150000L, 400000L, 350000L), G1T = c("Sire_1", "Sire_2", "Sire_1", 
"Sire_3", "Sire_2", "Sire_4"), G1B = c("Dam_1", "Dam_2", "Dam_3", 
"Dam_4", "Dam_5", "Dam_6"), G2T1 = c("Grandsire_1", "Grandsire_2", 
"Grandsire_1", "Grandsire_3", "Grandsire_2", "Grandsire_3"), 
    G2T2 = c("Grandam_1", "Grandam_2", "Grandam_1", "Grandam_3", 
    "Grandam_2", "Grandam_4"), G2B1 = c("Grandsire_4", "Grandsire_5", 
    "Grandsire_4", "Grandsire_6", "Grandsire_7", "Grandsire_8"
    ), G2B2 = c("Grandam_5", "Grandam_6", "Grandam_7", "Grandam_8", 
    "Grandam_9", "Grandam_10"), G3T1 = c("Grt_Grandsire_1", "Grt_Grandsire_2", 
    "Grt_Grandsire_1", "Grt_Grandsire_3", "Grt_Grandsire_2", 
    "Grt_Grandsire_4"), G3T2 = c("Grt_Grandam_1", "Grt_Grandam_2", 
    "Grt_Grandam_1", "Grt_Grandam_3", "Grt_Grandam_2", "Grt_Grandam_4"
    ), G3T3 = c("Grt_Grandsire_5", "Grt_Grandsire_6", "Grt_Grandsire_5", 
    "Grt_Grandsire_7", "Grt_Grandsire_6", "Grt_Grandsire_8"), 
    G3T4 = c("Grt_Grandam_5", "Grt_Grandam_6", "Grt_Grandam_5", 
    "Grt_Grandam_7", "Grt_Grandam_6", "Grt_Grandam_8"), G3B1 = c("Grt_Grandsire_9", 
    "Grt_Grandsire_10", "Grt_Grandsire_9", "Grt_Grandsire_11", 
    "Grt_Grandsire_12", "Grt_Grandsire_12"), G3B2 = c("Grt_Grandam_9", 
    "Grt_Grandam_10", "Grt_Grandam_9", "Grt_Grandam_11", "Grt_Grandam_12", 
    "Grt_Grandam_12"), G3B3 = c("Grt_Grandsire_13", "Grt_Grandsire_14", 
    "Grt_Grandsire_15", "Grt_Grandsire_16", "Grt_Grandsire_17", 
    "Grt_Grandsire_18"), G3B4 = c("Grt_Grandam_13", "Grt_Grandam_14", 
    "Grt_Grandam_15", "Grt_Grandam_16", "Grt_Grandam_17", "Grt_Grandam_18"
    ))

sires <- unique(dat$G1T)

pick_sires <- function(x) dat[which(dat["G1T"] == sires[x]),]

sum(pick_sires(1)[2])
#> [1] 350000

system · June 16, 2021, 6:24am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.