 # MIS 660 Descriptive and Predictive Analytics GCU

\$220.00

## MIS 660 Descriptive and Predictive Analytics GCU

### MIS 660 Full Course Discussions GCU

#### MIS 660 Topic 1 DQ 1

Suppose you wanted to estimate the average household income of all Grand Canyon University (GCU) students. To expedite the process, you only gather household income data from all your friends who major in business at GCU. You then calculate the average income among your friends and report that it represents the average income of all GCU students. Is this a good approach? If not, how would you gather data to derive a better estimate? Explain your answer.

#### MIS 660 Top ic 1 DQ 2

Income data typically have some outliers. For example, Tim Cook, CEO of Apple, Inc., had a salary of about 400 million in 2011. Suppose you had a data set of incomes in 2011 for all GCU faculty and Tim Cook. Which measure of central tendency would you use when reporting on the incomes in your data set if you do not want outliers to have much effect? Explain your answer.

#### MIS 660 Topic 2 DQ 1

Suppose you had daily temperature data indicating the “high” point of each day for 2015. If you want to show how the high differs over time, what are some of the plot types that will allow you do this? What are some benefits to binning the data into one of 52 weeks and plotting the average high for each week? Would it make sense to do something similar for the four quarters in the year? Why or why not?

#### MIS 660 Topic 2 DQ 2

Many times, data are missing because of various reasons. This poses some challenges when doing data analysis. For example, suppose you wanted to do some analysis of the yearly incomes of the faculty at GCU. When asked for their incomes, 25% of the faculty did not participate in the survey; therefore, their incomes are missing from the dataset. How would you summarize the income data in this case? Is it appropriate to ignore the missing incomes and summarize the data without them? Should you estimate the missing incomes, perhaps with the overall average, to complete the data set?

#### MIS 660 Topic 3 DQ 1

Data summarization is usually not enough when performing analysis. Most of the time, adding context by telling a story about the data is necessary to describe the analysis to others, especially those who are not data-savvy. What are some general guidelines to follow to tell a good data story? What story elements or structure should be used to organize the presentation?

#### MIS 660 Topic 3 DQ 2

Consider your organization, or an organization you are most familiar with. Explain the general process of data aggregation for a typical metric (e.g., sales revenue, cost per unit, etc.) used in the organization. What specific charts are commonly used to visually depict the data? What might be some areas for improvement regarding how the data is visually presented?

#### MIS 660 Topic 4 DQ 1

What are some of the limitations of using Excel for pivot tables/charts? Why does that make software like Tableau more appealing in the workplace?

#### MIS 660 Topic 4 DQ 2

Plotting summarized data will almost always help to convey results more easily. However, there are situations where plotting the summarized data instead of creating a simple table makes data interpretation more difficult. Provide two examples of poor charts/graphs and explain why they are difficult to interpret.

#### MIS 660 Topic 5 DQ 1

Data is useless without the skills to manipulate, summarize, and analyze it. In fact, even after data is summarized into a reporting format such as graphs and tables, it still requires someone to add context and describe the results to fully explain the data. This can be difficult, especially if data is being presented to nontechnical individuals. Describe two techniques that can be used to better describe analysis results to nontechnical individuals.

#### MIS 660 Topic 5 DQ 2

When most people think about data reporting or visualization, they think about a nicely crafted graph that will not be interactive with a user. Some new tools, such as Tableau, can create visualizations that can interact with a user with informative pop-up information, more drill-down information, and the ability to export filtered results. Describe two benefits to having a user interact with a standard report. Are there any drawbacks if the user modifies the report?

#### MIS 660 Topic 6 DQ 1

Summarize key data distribution concepts including probability mass functions (PMF), probability density functions (PDF), and cumulative distribution functions (CDF). Based on your organization or any organization you are most familiar with, provide an example of a PMF, an example of a PDF, and an example of a CDF, based on the type of data used in the organization.  How would you summarize each of these to someone who is not familiar with each of these functions?

#### MIS 660 Topic 6 DQ 2

Suppose you had a six-sided die where each number (1, 2, 3, 4, 5, and 6) has the same probability of showing up (1/6). If the die is rolled an infinite number of times and the number recorded, what will be the average value that shows up? Is the average value one of the actual possibilities (1, 2, 3, 4, 5, or 6)? Why or why not?

#### MIS 660 Topic 7 DQ 1

Suppose you wanted to understand the relationship between a customer’s yearly income (X) and the number of movies (Y) the customer watched in a year. You then gather data on incomes and the number of movies watched in a year. The range of incomes in your data set is \$5K to \$150K. After fitting a simple linear model and performing all the appropriate diagnostics, the model showed that, on average, for every \$10K in income, the customer watched 1.5 movies in the year. So, for example, if a customer earned 60K in a year, he or she would be expected to watch nine movies during the year. Now you want to apply this model to your very wealthy friend who will earn \$1 million in the next year. Is this an appropriate application of your model? Why or why not? Provide specific examples to justify your opinion.

#### MIS 660 Topic 7 DQ 2

If you regress daily high temperature (Y) on the amount of ice cream sales (X), you will notice that there is a strong positive correlation between the two. In other words, as daily ice cream sales increase, the daily high temperature increases. This implies that if we knew the amount of ice cream sales in a particular day, we could estimate, with a high level of accuracy, the high temperature in that day. Does this mean that if we wanted to increase the daily temperature, we need to sell more ice cream? Explain why or why not?

#### MIS 660 Topic 8 DQ 1

Suppose you were asked to investigate which predictors explain the number of minutes that 10- to18-year-old students spend on Twitter. To do so, you build a linear regression model with Twitter usage (Y) measured as the number of minutes per week. The four predictors you include in the model are Height, Weight, Grade Level, and Age of each student. You build four simple linear regression models with Y regressed separately on each predictor, and each predictor is statistically significant. Then you build a multiple linear regression model with Y regressed on all four predictors, but only one predictor, Age, is statistically significant, and the others are not. What is likely going on among the four predictors? If you include more than one of these predictors in the model, what are some problems that can result?

#### MIS 660 Topic 8 DQ 2

After building a regression model and performing residual diagnostics, you notice that the errors show severe departures from normality and appear to have nonconstant variance. What steps would you take in this case to resolve the errors? If the problems are not corrected after all steps are taken, what does that imply about the modeling approach you are taking? Explain in detail.

### MIS 660 Full Course Assignments GCU

#### MIS 660 Topic 1 Aggregating Data

The purpose of this assignment is to use a spreadsheet to create a visual representation of a data set.

For this assignment, you will use the “Heights” dataset. In the dataset, the heights (in mm) of n = 199 married couples are recorded. The data comes from a random sample from the much larger population of married couples. Complete each of the steps below to create a visual representation of the dataset.

Part 1:

Using Excel functions, calculate the following summary values for each of the three variables:

1. Minimum
2. First quartile
3. Second quartile (Median)
4. Third quartile
5. Maximum
6. Mean
7. Range
8. Sample standard deviation
9. Sample variance
10. Coefficient of variation

Part 2:

Address each of the following questions in a written Word document.

1. On average, are husbands or wives taller? What is the average difference in millimeters between the two genders? Explain your answer.
2. How would you interpret the median heights?
3. Compare the means and the medians for each dataset. What initial conclusions can be made here regarding the “contour” of each dataset?
4. Compare the standard deviation values. Which dataset (husbands or wives) has the most dispersion? What does your conclusion suggest?
5. Given the answers in question 1, compare the variability of heights between husbands and wives. Which partner type is more likely to have extremely tall individuals (outliers)?
6. Interpret the % coefficient of variation.

Part 3:

Your manager has requested some additional information from you regarding the data. Specifically, you have been asked to calculate the differences between “Male Heights” and “Female Heights.” Your manager is only interested in married couples in which the husbands are taller than their wives. Repeat the analyses requested in Part 1 for this new dataset. What conclusions can be drawn here? Include discussion about whether outliers exist in this dataset.

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 2 Data Manipulation

The purpose of this assignment is to use spreadsheet capabilities to perform data manipulation and to explain the process used in the handling of the data.

For this assignment, you will use the “Claims” dataset. In the dataset, the claims data for n = 608 people are recorded. The data derive from a random sample of females diagnosed with ischemic heart disease over 24 months (see Exercise 7.27 in the textbook).

Instead of using urgent care centers, some people rely on the Emergency Room (ER) to address most, if not all, of their medical needs. In fact, someone who has three or more ER visits within 24 months is considered a high ER user. Complete the steps below to execute this assignment.

1. Using the dataset and Excel, create a new column titled “High_ER_User” with “Yes” if three or more ER visits; otherwise “No.”
2. Duration is measured in days, but 30-day intervals are more appropriate for most reporting purposes. Using Excel, create a new column titled “Duration_Months” by converting the duration into 30-day intervals.
3. Many times complications and comorbidities are rare; therefore, these two negative events are summed together. Using Excel, create a new column titled “Comps_Comorbs” by adding complications with comorbidities.
4. Many times age is grouped in 10-year intervals. Using Excel’s VLOOKUP function, create a new column titled “Age_Group” with grouped ages of “21-30 yrs,” “31-40 yrs,” and so on for 10-year intervals. The last age group would be “61-70 yrs.” Use a tab titled “Age_Groups” for this task.

Next you will create a pivot table with the data and execute the following (refer to the examples in the resource “Data Manipulation Screenshots”).

1. Use “High_ER_User” as a filter to obtain two filtered views of the pivot table.
2. Summarize the data to get counts of claims, sum of claims and months, and average of procedures, prescribed drugs, ER visits, and complications/comorbidities.
3. Add a calculated field titled “Claims PM” to the pivot table. This calculated field is the sum of claims divided by the sum of duration months and measures the average claim amount per month (PM).

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 3 Visual Representation of Data

The purpose of this assignment is to use pivot tables and pivot charts to aggregate data and to explain the process used for data aggregation.

For this assignment, you will use the “Claims 2” dataset. Use Excel pivot tables and pivot charts for this exercise.

Part 1:

Create a dashboard describing the data by age group (e.g., 21-30 yrs, 31-40 yrs, 41-50 yrs, 51-60 yrs, and 61-70 yrs). The dashboard should include the graphs and charts listed in the locations described. The dashboard should be a separate tab in Excel that only includes the five items below. A sample layout is provided below the dashboard description.

1. Top Left: Bar graph showing the average number of ER visits for each of the five age groups. Show the actual average values above each bar.
2. Middle Left: Bar graph showing the average number of procedures for each of the five age groups. Show the actual average values above each bar.
3. Bottom Left: Bar graph showing the average claim cost for each of the five age groups. Show the actual average values above each bar.
4. Top Right: Pie chart showing the percent of the total sum of all claim costs for each of the five age groups. Show the actual percent values of each slice.
5. Bottom Right: Line graph showing the percent of each age group that has one or more ER visits. Show the actual percent values of each group. To create this chart, first create a new calculated column, named “Has ER Visit,” that is equal to 1 when the patient has had one or more ER visits; otherwise 0. HINT: The average of a 0-1 column is a percent. Refer to the example in the resource “Visual Representation of Data Screenshot: Preview of the Excel Dashboard.”

Part 2:

Interpret the dashboard and the story it is attempting to tell users by writing a 250-word summary that clearly describes the merits of each of the charts used on the dashboard. For example, discuss why a pie chart might be more appropriate than a bar graph for highlighting the information you want key stakeholders to obtain by studying that content on the dashboard. Include specific discussion about why each specific chart is used to illustrate the data it represents.

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 4 Data Visualization With Tableau

The purpose of this assignment is to use data visualization tools to aggregate and depict data and to interpret the data visualization results.

For this assignment, you will use the “Claims 2” dataset. You will use Tableau to replicate the dashboard you created in the Topic 3 assignment. In addition, you will compare and contrast the Excel and Tableau software.

Part 1:

Create a dashboard describing the data by age group (e.g., 21-30 yrs, 31-40 yrs, 41-50 yrs, 51-60 yrs, and 61-70 yrs). The dashboard should include the graphs and charts listed in the locations described. The dashboard should be submitted as a Tableau file. A sample layout is provided in the resource, “Data Visualization With Tableau Screenshot: Preview of the Tableau Dashboard.”

1. Top Left: Bar graph showing the average number of ER visits for each of the five age groups. Show the actual average values above each bar.
2. Middle Left: Bar graph showing the average number of procedures for each of the five age groups. Show the actual average values above each bar.
3. Bottom Left: Bar graph showing the average claim cost for each of the five age groups. Show the actual average values above each bar.
4. Top Right: Pie chart showing the percent of the total sum of all claim costs for each of the five age groups. Show the actual percent values of each slice.
5. Bottom Right: Line graph showing the percent of each age group that has one or more ER visits. Show the actual percent values of each group. To create this chart, first create a new calculated column, named “Has ER Visit,” that is equal to 1 when the patient has had one or more ER visits; otherwise 0. HINT: The average of a 0-1 column is a percent.

Part 2:

In 250 words, compare and contrast the use of Excel and Tableau in data visualization. Include specific discussion about the following in your summary.

1. Software ease of use.
2. Software visualization capabilities.
3. Software limitations.
4. Discussion of when each of these software programs is most appropriate for use.

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 5 Benchmark – Telling the Analytics Story

The purpose of this assignment is to create a data story and communicate findings to key stakeholders.

Part 1:

For this assignment, you will use the “Arizona Incomes by Zip Code” dataset. You will use Tableau to create a data story that illustrates the median household incomes of Arizona residents. The data provided includes all zip codes in Arizona. Each record is a unique zip code. The data includes the following columns:

1. Zip_Code: An Arizona zip code.
2. Metro_Area: Whether or not the zip code is within the Phoenix-metro area.
3. City: The city name of the zip code.
4. Median_Income: The median household income of each zip code based on 5-year estimates (2010-2014) from the U.S. Census Bureau.

The marketing manager has asked you to analyze income data for Arizona residents so that leaders in his department can determine the company’s advertising strategy. The marketing manager intends to share this data with other decisions makers and the marketing department staff so that everyone has a thorough understanding of how the income information can be used to determine specific target markets in upcoming advertising campaigns. Because most of these individuals do not have a strong understanding of analytics, they must be able gain the information listed below from studying the charts. In the data story, use visualizations, heat maps, boxplots, etc. to describe the following:

1. How do incomes differ across zip codes within the Phoenix-metro area (using geo-mapping)?
2. What is the relative difference in incomes across zip codes within the Phoenix-metro area (use a heat map)?
3. What are the distribution of incomes across Arizona?
4. How do incomes within the Phoenix metro area differ from those outside of that area?

Part 2:

Demonstrate the ability to communicate the analytics story to key stakeholders, including the marketing manager, by creating a 6-10 slide PowerPoint presentation (with speaker notes for each slide). Use the charts generated in Tableau to illustrate the data story as it relates to the income of Arizona residents. The slides and speaker notes should address the following for each chart presented.

1. What information is the chart providing to stakeholders?
2. Why is the information in the chart important to key stakeholders?
3. How can this information be used in making decisions about how marketing dollars can be allocated?

Refer to the resource, “Creating Effective PowerPoint Presentations,” located in the Student Success Center, for additional guidance on completing this assignment in the appropriate style.

While APA format is not required for the body of this assignment solid academic writing is expected, and documentation of sources should be presented using APA formatting, guidelines, which can be found in the APA Style Guide, located in the Student Success Center.

This assignment uses a rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

Benchmark Information

This benchmark assignment assesses the following programmatic competencies:

1.4: Utilize data visualization techniques to communicate findings.

#### MIS 660 Topic 6 Data Distributions

The purpose of this assignment is to apply data distributions to discrete and continuous data and justify the selection of the distributions.

For this assignment, you will use the “Random Variables” dataset. You will use SPSS to analyze the dataset and address the questions presented. Findings should be presented in a Word document along with the SPSS outputs.

Part 1:

Identify if the following random variables are discrete or continuous.

1. Number of defected items in a shipment.
2. Height of males (in mm) who attend Grand Canyon University.
3. Yearly income among all people in the United States.
4. Whether or not a high school graduate is accepted into a college.
5. Time that it takes for a person to run a mile.
6. The number of emergency hospital visits that each person had in the last 12 months.

Part 2:

Let X be a random variable of the outcome after rolling a six-sided die one time that is not fair. In fact, the die is designed to never result in a 1 or 6, while the other outcomes (i.e., 2, 3, 4, and 5) are equally probable.

1. What are the individual probabilities for all possible values of X?
2. What are the cumulative probabilities for all possible values of X?
3. What is  = ?
4. What is  = ?
5. What is  = ?

Part 3:

The dataset provided consists of the following random variables:

1. BMI: The body mass index of a random set of people.
2. Distance: The distance (in feet) that a baseball player hit the ball.
3. Height: The height of males (in mm).
4. Income: The income (in dollars) of people in a large company.
5. Pass: The outcome when taking an exam (1=Pass; 0=Fail).
6. Wait Time: The time (in minutes) that it takes when waiting for the train.

Answer each question below. Use SPSS as needed, and include the software outputs as part of the Word document you submit.

1. What is a Q-Q plot?
2. Given a set of realized values of a random variable, how can a Q-Q plot be used to assess the distribution of the random variable?
3. Using histograms and Q-Q plots (except for binomial), match each random variable to one of the following distributions: Binomial (with N=1, P=0.7), Chi-square (with d.f.=20), Exponential, Lognormal, Normal, and Uniform.

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 7 Simple Regression Analysis

The purpose of this assignment is to apply simple regression concepts, interpret simple regression analysis models, and justify business predictions based upon the analysis.

For this assignment, you will use the “Trucks” dataset. You will use SPSS to analyze the dataset and address the questions presented. Findings should be presented in a Word document along with the SPSS outputs.

The business characteristics of n = 250 U.S. trucking and delivery companies for calendar year 2011 were recorded. Among the characteristics studied were the number of drivers and the number of trucks (power units) each company employed.

Part 1:

Given that the data consists of counts and range of counts is large, a natural log transformation is usually performed to improve the linear model results. Apply a natural log transform to both variables and then plot the Y = log(Trucks) vs. X = log(Drivers).

Is there a linear relationship? Justify your answer by providing the SPSS output as an illustration.

Part 2:

Build a simple linear model by regressing Y on X and testing whether or not a relationship exists between the number of drivers and the number of trucks. Address the following questions in your written response:

1. After fitting the model, plot the standardized residuals (on vertical axis) vs. the standardize predictions (on horizontal axis). Is there a pattern? How would you interpret the pattern or lack of pattern?
2. After fitting the model, derive the normal probability plot and interpret what the plot means.
3. What is the coefficient of determination, R2, of the model? How would you interpret the R2?
4. What is the estimate of β1? How would you interpret the estimate of β1?
5. Is the estimate of β1 significantly different than 0? Assume an α = 0.01.
6. What is a 95% confidence interval for β1? How would you interpret the 95% confidence interval for β1?
7. If a new trucking and delivery company with 4,900 drivers were to be formed, how many trucks would you expect the company would employ based on the model?

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.

#### MIS 660 Topic 8 Multiple Regression Analysis

The purpose of this assignment is to apply multiple regression concepts, interpret multiple regression analysis models, and justify business predictions based upon the analysis.

For this assignment, you will use the “Strength” dataset. You will use SPSS to analyze the dataset and address the questions presented. Findings should be presented in a Word document along with the SPSS outputs.

The compressive strength (Y) of concrete is influenced by the mixing proportions and by the time that it is allowed to cure, although the exact relationship between the strength and the components is unknown. The provided data includes the results of n = 1030 concrete strength experiments that include the following:

1. Strength (in MPa): The compressive strength of the concrete.
2. Age (in days): The number of days the concrete was allowed to cured.
3. Coarse_Aggregate (in kg/m3): The proportion of coarse aggregate in the mix.
4. Fine_Aggregate (in kg/m3): The proportion of fine aggregate in the mix.
5. Cement (in kg/m3): The proportion of cement in the mix.
6. Slag (in kg/m3): The proportion of furnace slag in the mix.
7. Superplasticizer (in kg/m3): The proportion of plasticizer in the mix.
8. Water (in kg/m3): The proportion of water in the mix.
9. Ash (in kg/m3): The proportion of fly ash in the mix.

Part 1:

Derive various transformations of compressive strength to determine which transformation, if any, results in a variable that most closely mimics a normal distribution. To do this, plot Q-Q plots after each transformation listed below, and decide which one should be used to build a multiple linear model. Explain your answer and provide the SPSS output as an illustration.

1. Strength (no transformation)
2. Square root of Strength
3. Squared Strength
4. (Natural) Log of Strength
5. Reciprocal of Strength

Part 2:

Based on the transformation selected in Part 1, build a multiple linear regression model with all eight predictors.

1. Use t-tests to determine if any of the predictors significantly affect the compressive strength of concrete. Explain why each variable should or should not be included in the model. Assume α = 0.05. Show the appropriate model results to explain your answer.
2. If any predictors from question 1 are found to be not significant, remove them and re-run the model to create a reduced model (RM). Are all the remaining variables still statistically significant? Show the appropriate model results to explain your answer.
3. Based on the RM, should there be concern about multicollinearity among the predictors selected? Show the appropriate model results to explain your answer.
4. After fitting the RM, derive the residual plot (standardized residuals vs. standardized predicted values) and normal probability plot. Interpret each plot.
5. What is the coefficient of determination, R2, of the RM? How would you interpret the R2?
6. Based on the RM, what would be the new estimated compressive strength that is currently 50 MPa, after a 10-day increase in curing time? Assume all other predictors are held constant.
7. How would you interpret the intercept (constant) in the RM? Does the interpretation make sense given the data you used to build the RM?

Part 3:

Given the following components and aging time below, what is the estimated compressive strength based on the RM?

1. Age: 50 days
2. Coarse_Aggregate: 900 kg/m3
3. Fine_Aggregate: 600 kg/m3
4. Cement: 300 kg/m3
5. Slag: 200 kg/m3
6. Superplasticizer: 7 kg/m3
7. Water: 190 kg/m3
8. Ash: 70 kg/m3

Part 4:

What is a 95% confidence interval of the estimate in Part 3? How would you interpret the 95% confidence interval? (Hint: Use the SPSS scoring wizard to address this question.)

APA format is not required, but solid academic writing is expected.

This assignment uses a grading rubric. Please review the rubric prior to beginning the assignment to become familiar with the expectations for successful completion.

You are not required to submit this assignment to LopesWrite.