My background includes working on scientific projects as the data guy. In these positions, I was responsible for establishing valid data collection procedures, collecting usable data, and statistically analyzing and presenting the results. In this post, I describe the excitement of being a statistician helping expand the limits of human knowledge, what I learned about applied statistics and data analysis during the first big project in my career, and the challenges along the way!
What is it like to explore the unknown? Whether it’s academic research, a quality improvement initiative, or big data, ordinary people explore the unknown, solve mysteries, and solve questions that science has never answered before. To do this, researchers spend a lot of time reading research papers to develop their knowledge. They must understand the full extent of existing knowledge in their research area. Then, the researchers take an amazing step. They find a research question that will take them beyond the edge of human knowledge, a problem that no one has answered before. This is the transition from understanding existing knowledge to creating new knowledge.
Going beyond the edge of human knowledge sounds like an exciting place to work. And it is! Scientists and analysts do this every single day. However, it doesn’t mean you’ll always hear about our work on the news. The questions and answers are new, but not always newsworthy. So, most of us work in anonymity. Bit by bit, scientists use statistics to push the bubble of human knowledge further and further out as they make discoveries. We record the results in the ever-growing number of journal articles. Researchers study these articles, and they ask additional new questions. In this fashion, we push back the frontier of what is known.
One of the coolest things about the field of statistics is that statistical analysis provides you with a toolkit for exploring the unknown. Christopher Columbus needed a lot of tools to navigate to the New World and make his discoveries. Statistics are the equivalent tools for the scientific and quality improvement explorer because they help you navigate the sea of data that you collect. You’ll be increasingly thankful for these tools when you see a worksheet filled with numbers and you’re responsible for telling everyone what it all means. The excitement of discovery happens after all of the work to set up the experiment, collect the data, verify it, and arrange it in your statistical software. Finally, there is that moment you perform the analysis, which reveals the meaning behind the data. That’s why I love it!
Having valid data is a prerequisite for using all of the fantastic tools that statistical software offers. Because researchers work with questions that science has not answered, it’s not surprising that we run into obstacles for collecting useful data. Often, these are novel problems. Indeed, researchers may never have collected a particular type of data before. Collecting a new data type requires scientists to learn how to collect it accurately and precisely before we can even begin to answer the main research question. On good days, you can happily think of these difficulties as creative challenges. On a few days, it felt like I was just solving one problem after another. After all, we’re trying to generate nice and neat data that we can analyze, but reality is messy.
Related post: The Importance of Statistics
Statistics: Where the Rubber Meets the Road!
That interaction between messy reality and neat, usable data is an exciting place. It’s a place that ties together scientists’ lofty goals to the nitty-gritty nature of reality. It’s where the rubber meets the road. If you don’t get usable data, you can’t answer your research question despite the slick analytical tools!
Let’s bring it to life on a more personal level. I’ll detail one of those “rubber meets the road” instances that were a part of my early on-the-job education!
My background was first in statistical analysis, and then I transitioned to research. My first big research project was incredibly eye-opening because I was responsible for ensuring that we generated good, clean data, and then analyzing it! It was during this early project when the large amount of effort required to create good data became very apparent to me.
Collecting valid data is a process, not a single measurement event. This process requires that you have standard procedures and measurement instruments that work together in perfect harmony to produce data that you actively verify are correct. If you have a lousy process or an imperfect instrument, your data are flawed. Everything has to be perfect. You must be directly involved in the process’s nitty-gritty details and the measurements to ensure perfection.
My Early On-The-Job Learning
So, back to my first big research project, which assessed physical activity and bone density growth. I was the one full-time person on the data side. I worked with a large team of very talented experts, who each contributed a part of their time to this project. These experts included electrical and mechanical engineers, programmers, electricians, shop technicians, nutritionists, and a bone densitometer operator, among others. Together, we developed and tested hardware, software, assessments, and procedures that would produce an array of different types of data.
We also had a full-time nurse on the project who interacted with the subjects daily. She provided feedback about the suitability of using our devices and survey assessments, administered the surveys, and fit the monitoring equipment on our subjects. Out of this milieu, I had to ensure that we produced a mountain of many different types of trustable data. It was quite a balancing act!
Good data are a stern taskmaster. I was more of a numbers person, but just knowing statistics isn’t good enough. No statistical analysis can save you if you have bad data. For the sake of usable data, I learned a lot of new skills and quickly got my hands dirty in the tiny details of data collection.
There is a lot of work to do before you even begin collecting data.
For example, I learned how to do the following:
- Use and extract data from equipment such as force plates and bone densitometers.
- Strip wires, solder, choose the right electrical connector for the application, and understand circuit board basics.
- Understood different programming languages so I could work with the programmer to ensure that data collection was just right.
- Coordinated with a battery company to design a custom dual-voltage battery that fit inside a small space to meet our unique needs.
- Acquired a wealth of nutrition information and how to assess nutritional intake.
- Studied how different positioning of measurement devices could affect the results.
What I Learned About Applied Statistics
Whew! Fortunately, I love learning new things!
Along the way, I checked and rechecked pilot data as we changed things to monitor the improvement in data quality. I also wrote standard procedures to ensure consistent data collection.
It was a great learning experience. And these are the critical lessons I learned:
- Collecting good data is a process, not an event.
- You will spend more time determining the best way to collect the data than actually collecting the data.
- You must be determined, adaptable, and willing to learn a lot of new things.
- Don’t assume anything. Check and double-check all of your data streams. Verify everything.
- You can learn statistics in school, but there’s nothing like having a multi-million dollar project on the line to really know statistics inside and out!
Looking for Data Problems . . . And Finding Them!
I’ll give one example of how verifying seemingly simple data often involves more than meets the eye. This was a case where the rubber met the road, and I smelled smoke!
The diligence required to obtain and validate useful data became apparent to me very early at the biomechanics lab. Imagine a young guy who’s eager not to mess up. There is this nagging fear that many mistakes in research happen when you miss something or do something incorrectly at the outset, and it bites you in the derriere later. You fear that during data analysis you’ll uncover a problem you can’t fix. I didn’t want that to happen on my watch.
I quickly learned that this fear was a well-founded one!
As part of the bone density study, we planned to measure each subject’s activity. Our subjects were to wear activity monitors for 12 hours, on randomly scheduled days, multiple times per year. These activity monitors use accelerometers to measure movement. They are sophisticated enough to distinguish natural human actions from artificial movement types, such as riding in a car. They also are very durable and easy to use — the researcher doesn’t adjust anything. And they have been well validated in the scientific literature using very sophisticated analyses.
In short, no one expected any problems with these trusted devices. I thought this would be a nice, simple place to start verifying measurements before the study to avoid problems later. It was an excellent place to start, but not as simple as expected!
Assessing Activity Data
The activity data don’t translate to an exact picture of the subjects’ actions. However, you can see the scores rise and fall with activity levels, and compare scores to see where each subject’s activity level falls within your sample of activity scores. To ensure that our activity monitors were working correctly, I had pilot subjects wear the devices for a quick measurement system analysis. Sure enough, greater activity produced higher readings, just as expected. So far, so good. I thought I’d move on to creating a standard procedure.
To collect good data, you need standard procedures for setting up and using measurement equipment. So, I wanted to establish these standards for the activity monitors. Participants wear these devices on a belt around their waist. A good practice is to standardize the devices’ position on the waist — readings shouldn’t be higher or lower between subjects because of inconsistent positioning.
There was no literature on the differences in positioning, so I did a pilot study of my own. This time I had the subjects wear multiple devices all around their waist. I wanted to quantify the potential risk involved by seeing how much the readings would vary by position on each participant.
As the data came in, it first appeared that positioning was essential. The high and low readings were often different by 15%. These differences were more than I expected, but there was no research to compare it to at the time. However, collecting more data provided insight. The positional pattern of high and low values varied from subject to subject. Finally, it became evident that while the pattern was inconsistent between subjects, it was consistent between devices.
In other words, several of our monitors tended to read too low, by varying degrees.
Finding the Solution for Good Data and Statistics
I contacted the manufacturer and sent the devices back so they could check them out. It turned out the manufacturer had recently switched suppliers for a component, and it was causing problems.
When I re-tested the repaired monitors, the differences between positions on a subject were all less than a several percent, which meant there was no practical difference between positions. All our monitors were now working correctly, and position wasn’t an issue. I established a consistent procedure using standard belts that fit the monitors perfectly and were infinitely adjustable to each subject’s waist. I did this to prevent the monitors from flopping around due to a loose fit.
This experience was both unsettling and positive. It was unnerving because it confirmed my fears: something you miss definitely can come back and bite you later — even when you’re dealing with a straightforward situation. Further, some data problems are subtle. They don’t show up until you check the data in several different ways. It was also a positive experience because it kept me on my toes and ready for bigger challenges in applied statistics that were to come!
Read about another data collection issue I encountered during this project in my post about using control charts with hypothesis tests.