H is for Histagram

3–5 minutes

·

·

I recently wrote a children’s book called D is for Data: The ABCs of Data Analytics. This is the eighth in a series of behind-the-scenes, companion articles that will dive a little deeper into each term. We’ll explore the illustration used to define the term, how the word is used in the data world, and other interesting (to me) trivia.

An illustration of a girl and her histogram

This illustration is based off of a picture of my daughter. (More about how I accomplished this when we get to M.) If you look closely, you can see the audience behind her. I liked the idea that she was presenting a paper at some prestigious conference, because she’s a very confident speaker. She has been my biggest cheerleader during the whole process of writing and publishing this book. I hope that her exposure to these terms will give her confidence to take on advanced subjects and maybe follow in her dad’s footsteps.

Histograms have a warm place in my heart as one of the first analytical tools that I used. I started my career as a programmer supporting various business teams. New programmers all had to take turns being on-call, which meant taking calls from people when things weren’t right. One of the skills that I had to develop early on was to quickly assess the severity of an issue. Was the system down for everyone, or was this person on the other end of the phone the only one having a bad day. I found that by simply grouping the data in the system by day or hour and counting the events or transactions, I could get a very quick idea of whether we were seeing normal usage or if there was a problem. For example, if we normally got 300 orders per hour, but for the last 2 hours, we only had 5, then the system was probably broken. Different areas of the system all had different patterns. Payroll entries would be very high on Mondays because the cut-off for payroll was Tuesday. Low counts on a Monday meant trouble, but low counts on a Thursday were normal because the deadline was so far away. Whatever the area, grouping by time and counting the data records showed not only how the data was distributed, but it showed me holes in the data that might indicate trouble. It helped me to monitor the health of my data and transitively, the health of the system I was supporting.

Histograms are a great way to do data discovery. If you’re working with some new data, do some grouping and counting to get a feel for what you’ve got. Maybe group by geography and see if the data is coming from one primary area, or group by time of day to see if there is more data during the day than at night. Whatever you group by, plotting the result as a histogram graph is going to quickly show you whether the counts or sums are evenly distributed, or whether there are big gaps. You can also quickly see if there are daily, weekly or seasonal trends. Knowing these macro trends really helps you when you deep dive into other analysis because you know the big picture of how your data is distributed over different dimensions.

Another thing to look for in a histogram is data skew. Data is skewed when there is much more data for one category than others. For example, if you are counting customer purchases per state and your home state has 70% of the purchases, then you would say that the data is skewed toward your home state. It is important to understand data skew when doing distributed data processing because you want to spread the workload out evenly over each computer that is solving part of the problem. If you were counting sales in the afore-mentioned sales data and distributed the work by state, then one computer would get over 70% of the purchases (the ones from your home state). This would lead to the other computers finishing early while you waited for the computer that got your home state sales to finish. It’s best to distribute work evenly and so data skew is usually thought of as a bad thing.

I recently wrote a children’s book called D is for Data: The ABCs of Data Analytics. This is the eighth in a series of behind-the-scenes, companion articles that will dive a little deeper into each term. We’ll explore the illustration used to define the term, how the word is used in the data world, and…

Leave a comment

Thunder Chicken Studios is a creative firm specializing in strikingly beautiful images and thought-provoking prose.

⏬

If you want to be informed about our latest work, just fill the form.

Blog at WordPress.com.

← Back

Thank you for your response. ✨

Warning
Warning
Warning.