Today, the agriculture industry works to maximize the amount of food we gain from crops by breeding plants with the strongest, highest-yielding genetics. Plant breeding is complex, and data analytics can help scientists at research and development organizations like Syngenta make advances in seed product development.
We've proven that data-driven strategies can help our industry accelerate innovation and breed better seeds that require fewer resources and are adaptable to more diverse environments. Developing models that can predict the performance of potential corn products can help scientists more accurately select seeds that increase the productivity of the crops farmers plant – and will help address the growing global food demand.
Scientists at research and development organizations like Syngenta are working to accelerate innovation in plant science. Their goal is to deliver consistent and reliable yield to farmers despite ever-changing environments due to variable weather conditions. Plant breeders work to maximize the amount of food we gain from crops by breeding plants with the most resilient, highest-yielding genetics, and then provide the seeds from those efforts to farmers around the world.
We've proven that data-driven strategies can help our industry breed better seeds that require fewer resources and are adaptable to more diverse and variable environments. Developing models and analytical approaches that identify patterns and insights in our experimental data can help scientists more accurately choose seeds that increase the productivity of the crops we plant – and will help address the growing global food demand.
PROBLEM SETTING
Commercial corn is processed into multiple food and industrial products. It is widely known as one of the world’s most important crops. Each year, plant breeders create new corn products, known as experimental hybrids, by crossing two “parents” together. The parents are known as inbreds and the development of the inbreds takes up the bulk of a corn breeding program. Most of that effort is spent evaluating the inbreds by crossing to another inbred, called a “tester.”
It is a plant breeder’s job to identify the best parent combinations by creating experimental hybrids and assessing the hybrids’ performance by “testing” it in multiple environments to identify the hybrids that perform best. Historically, identifying the best hybrids has been by trial and error, with breeders testing their experimental hybrids in a diverse set of locations and measuring their performance, then selecting the highest yielding hybrids. The process of selecting the correct parent combinations and testing the experimental hybrids can take many years and is inefficient, simply due to the number of potential parent combinations to create and test.
Given historical hybrid (inbred by tester) performance data across years and locations, how can we create a model to predict/impute the performance of the crossing of any two inbred and tester parents?
PROBLEM SETTING
Commercial corn is processed into multiple food and industrial products. It is widely known as one of the world’s most important crops. Each year, plant breeders create new corn products, known as experimental hybrids, by crossing two “parents” together. The parents are known as inbreds and the development of the inbreds takes up the bulk of a corn breeding program. Most of that effort is spent evaluating the inbreds by crossing to another inbred, called a “tester.”
It is a plant breeder’s job to identify the best parent combinations by creating experimental hybrids and assessing the hybrids’ performance by “testing” it in multiple environments to identify the hybrids that perform best. Historically, identifying the best hybrids has been by trial and error, with breeders testing their experimental hybrids in a diverse set of locations and measuring their performance, then selecting the highest yielding hybrids. The process of selecting the correct parent combinations and testing the experimental hybrids can take many years and is inefficient, simply due to the number of potential parent combinations to create and test.
RESEARCH QUESTION
Given historical hybrid (inbred by tester) performance data across years and locations, how can we create a model to predict/impute the performance of the crossing of any two inbred and tester parents?
For example, given 5,000 inbreds (parents), the number of potential crosses is 12,497,500 —far more than can be created or tested. Due to limited testing resources, breeders are only able to select a small subset of all the possible inbred combinations, which can lead to lost opportunities.
This issue is the basis for the 2020 Syngenta Crop Challenge in Analytics. Can an accurate model be constructed to predict the performance of crossing any two inbreds? Such a model would allow breeders to focus on the best possible combinations.
In simpler terms, can we use hybrid data collected from crossing inbreds and testers together to predict the result of cross combinations that have not yet been created and tested? Namely, are we able to construct a recommender system to propose new parent combinations based on the hybrid performance from other parent combinations and attributes they have in common?
The following Table 1 is an illustration of the challenge. Each “X” is the set of observed performance data points of hybrids from their corresponding inbred by tester combinations. With the information from the table, how can a model be built to predict/impute the mean yield of each missing combinations (“?”)?
Table 1. Research question illustration.
Tester 1 | Tester 2 | Tester 3 | |
Inbred 1 | X | X | X |
Inbred 2 | X | ? | ? |
Inbred 3 | ? | X | X |
Inbred 4 | ? | ? | X |
Inbred 5 | X | X | ? |
OBJECTIVE
The objective is to estimate yield performance of the cross between inbred and tester combinations in a given holdout set. Specifically, we are asking for the mean yield performance of each inbred by tester combination in the holdout set.
Notes
DOWNLOAD SUBMISSION TEMPLATE
Submissions must be in MS-Word or LaTeX format using the appropriate submission template. You can download the submission template here (.zip).
Additionally, observing the standards for academic publication, entries should include a written report with the following:
The entries will be evaluated based on:
You are provided with the following datasets described below.
Key for Datasets:
This table provides the meaning of each variable in the two datasets.
Training Dataset | YEAR | Year grown |
LOCATION | ID for each location | |
INBRED | ID for Inbred | |
INBRED_CLUSTER | Cluster association for each inbred which denotes genetic grouping | |
TESTER | ID for Tester | |
TESTER_CLUSTER | Cluster association for each tester which denotes genetic grouping | |
YIELD | The performance of the Line and Tester combination |
Testing Dataset | INBRED | ID for INBRED |
INBRED_CLUSTER | Cluster association for each line which denotes genetic grouping | |
TESTER | ID for Tester | |
TESTER_CLUSTER | Cluster association for each tester which denotes genetic grouping | |
YIELD | The performance of the Line and Tester combination – to be predicted |