Tag Archives: George Somi

WATCH Blue Get the Last Word: The 2015 NCAA Men’s Basketball Championship

Following up on the successful post visualizing  live Tweets pertaining to the Final Four matchup between undefeated Kentucky and Wisconsin, this writer decided to take it one step further and to portray the Twitter fandom of Duke Nation and Badger Nation during last night’s National Championship game. If you succumbed to your early bedtime, the blue that fills the interactive map at the end of the animation should amply suggest who the victor was!

Each of the 54,625 randomly mined Tweets had to contain at least one of the following inherently pro-Duke or pro-Wisconsin keyword: “GoWisconsin,” “GoWisconsinBadgers,” “GoBadgers,” “BadgerNation,” “BadgersBelieve,” “FrankTheTank,” “DukeNation,” “OkaforWins,” “GoDuke,” “GoDukeBlueDevils,” “GoBlueDevils,” and “BlueDevilNation.” Those Tweets, which had been collected using the filterStream function of the R statistical computing language, were then filtered down to 2,409 randomly geocoded Tweets (Tweets whose locations could be derived). Finally, each Tweet was categorized as being pro-Duke or pro-Wisconsin before being inputted into a web mapping platform called CartoDb. In the end, Duke Nation can lay claims to being more excited on Twitter during last night’s game, broadcasting 1,724 messages versus Badger Nation’s 685 in this random sample of geocoded Tweets – another reason for the Blue Devil haters out there to hate! The final data set can be accessed here: DukeWisLVantData.

The following excerpts from the Guardian’s minute-by-minute commentary of the match can be a helpful guide to the map. Pause the video at the indicated times to see how the match’s real-time twists and turns had an impact on real-world social media behavior:

 9:21 PM ET – Kaminsky hits a long three and then takes a charge. HEY! Stop stealing Duke’s game, thief!

9:43 PM ET – Duke has its biggest lead of the game, 21-16, with a little over 8 minutes left in the half.

9:51 PM ET – Wisconsin 24, Duke 23 / 4:47 left in the half

The Badgers are on a run – 7 unanswered – capped by Kaminsky making and and-1. When Kaminsky finished his play, CBS showed a bar in Madison with tons of people drinking.

10:25 PM ET – The second half is underway. Wisconsin starts with the ball and … takes a 34-31 lead on a corner three by Koenig.

10:28 PM ET – Wisconsin 38, Duke 33 / 18:32 left

Coach K calls a timeout after Dekker scores off a Duke turnover. The Blue Devils look sloppy coming out of the half. How is that possible? They just were coached up by the great Mike Krzyzewski for 15 minutes.

10:40 PM ET – Wisconsin 46, Duke 39 / 13:55 left

Koenig hits another three to give Wisconsin its biggest lead.

And Dekker still hasn’t hit a three in the game. Very bad for Duke. (Good for America?)

10:46 PM ET – Wisconsin 51, Duke 45 / 11:43 left

Grayson Allen is keeping Duke in the game. If he doesn’t go pro, he will enjoy a long college career of being despised by adults all across the country.

10:50 PM ET – Grayson Allen has scored the last 8 points for Duke. Grayson Allen sounds like the name of trust fund son of an attorney. It’s a very Duke name.

11:04 PM ET – We’re tied at 56 with less than 5 minutes left. Yes, this will do nicely. Thank you, sports.

11:13 PM ET – Duke 63, Wisconsin 58 / 2 minutes left

Back-to-back buckets for Okafor after sitting out for six minutes. He was out for so long, he apparently forgot he was having a bad game.

11:22 PM ET – Duke 68, Wisconsin 63 / 14.9 left

After Tyus Jones hit two free throws, Wisconsin had an awful offensive possession, ran down a ton of clock and then threw up an airball from the top of the key. Prepare for another Duke title.

How Wisconsin vs. Kentucky Lit Up Twitter USA

38-1. It’s a record that any team would covet at this point in the college basketball season. Unfortunately, that one small blemish in an otherwise pristine record came at the wrong time for the Kentucky Wildcats. Perhaps equally distressing, this loss not only ended an historic, perfect season prematurely, but it will also linger in the collective memory of many Americans. According to Turner, last night’s broadcast by Turner Sports and CSB Sports garnered the highest cable rating ever for a college basketball game, and it was the highest rated Final Four game since 1993.

For those of you who missed it, the result to last night’s memorable Final Four matchup between Kentucky and Wisconsin in the Men’s NCAA Basketball Tournament came down to the wire, with the game tied at 60-60 with 2:25 left in the match. Flashback to the 6:36 mark in the second half, and it would’ve been easy to call the game. Kentucky’s large frontcourt and speedy guards had overcome a 52-44 deficit to build a 60-56 lead against the Badgers, and their second half resurgence meant that they had a 79.2% chance of advancing to the championship game tomorrow. According to Nate Silver’s simulations, if both teams had twelve remaining possessions at a 60-56 scoreline, Kentucky’s winning percentage was as high as 81.9%.

Instead, Wisconsin’s defense held Kentucky scoreless between the 6:36 and 0:56 stretch. Now, AP Player of the year, Frank Kaminsky, has an opportunity to completely redeem last year’s Final Four loss by playing an in-form Duke team, which handily defeated Michigan State earlier in the day, in tomorrow’s championship game.

Luckily, this writer had the opportunity to animate a map of Tweets correlating to this intense game. This process consisted of three steps:

  • data mining with the R statistical computing language using the filterStream function;
  • exporting, cleaning, and filtering the dataset on Excel; and
  • using the “Torque” and “Intensity” features of CartoDb, a cloud platform for GIS and web mapping tools, to interactively map that dataset.
Wisconsin vs. Kentucky Final Four Viewership Intensity Map
Wisconsin vs. Kentucky Final Four Viewership Intensity Map

Here are some interesting facts and observations pertaining to the maps and their creation:

  • Each tweet had to include at least one of the eleven following keywords: “Wisconsin,” “WisconsinBadgers,” “Badgers,” “Kentucky,” “KentuckyWildcats,” “Wildcats,” “WisconsinvsKentucky,” “KentuckyvsWisconsin,” “UKvsWIS,” “Wisconsin vs Kentucky,” “FinalFour,” and “NCAATournament.”
  • The TwitteR filterStream search mined a total of 303,403 worldwide tweets from Twitter (130,228 first half tweets and 173,175 second half tweets) during the game. The filterStream function was used throughout the game, but not during halftime. This would explain the blank gap in the map about halfway through the video.
  • Of the 300K+ Tweets that were randomly mined, 12,224 were location enabled tweets. In other words, the Twitter user had location enabled for Tweets, so those specific Tweets could be mapped with given Longitude and Latitude specifications and creation times. Therefore, the random plots on these maps represent approximately 4% of the data collected.
  • The game was played at a relatively controlled manner. At halftime, both teams were deadlocked at 36 points apiece, and neither team managed to achieve a double-digit lead at any point in the game. Unsurprisingly, the steady generation of tweets in the video matches what was largely a steady, calculated game. In addition, it is important to note that unlike certain sports, such as football or soccer, where scoring often occurs in isolated spurts, basketball features scored baskets throughout a game. Barring a seminal event like a game-winning shot, one would not expect a drastic surge of Tweets at any point during the game.

Mapping El Clásico: A Visualization of the Global Spectacle

This past Sunday’s El Clásico (The Classic) was the latest battle on the pitch between Spain’s, and arguably the world’s, most celebrated footballing titans, Real Madrid Football Club and Football Club Barcelona. The UEFA Champions League Final aside, Real Madrid versus Barcelona is the most watched club game in the world. The March 2014 game attracted 400 million viewers, while Sunday’s match plausibly reached the 500 million mark. beIN SPORTS, a global sports network operated by Qatari Sports Investments, has already released its viewership statistics. The latest El Clásico was viewed by 2.12 million beIN SPORTS viewers, making it the most-viewed game on that network alone. In fact, compared to the teams’ earlier matchup in October, Sunday’s broadcast attracted a +16% and +64% increase in English and Spanish viewership, respectively. Those figures had me wondering: what could social media tell us about this game’s global viewership? More importantly for Barca and Madrid fanboys, who was Twitter’s darling: Lionel Messi or Cristiano Ronaldo? Luckily, this writer has some insights to share…

… Before sharing these insights, unless your name is Ann Coulter, you might wonder why El Clásico even matters. Here are three reasons why you should care about this event:


First of all, Real Madrid and Barcelona feature the most talented, expensive, and globally recognizable athletes in the world. Cristiano Ronaldo and Lionel Messi are the players of their generation, global icons, and the collective winners of every FIFA Ballon d’Or since 2008. Ronaldo, the reigning Ballon d’Or winner and the world’s second highest paid athlete, is often considered the best athlete in the sport and Portugal’s best player since the great Eusebio. Many former players and pundits have called Lionel Messi the best player of all-time, supplanting Pelé and his fellow Argentine countryman, Maradona. Messi is La Liga’s all-time leading goalscorer and the winner of the most FIFA Ballon d’Or awards.

Both teams’ rosters also feature the English Premier League’s best players, Gareth Bale (2013) and Luis Suarez (2014), as well as James Rodriguez, World Cup 2014’s top goalscorer, and Brazilian superstar Neymar, who is possibly the best player under the age of 25. Toni Kroos was probably Germany’s best player in the country’s World Cup-winning year, and 2010 World Cup winner Xavi Hernandez is arguably the most important player in Spanish history. Because these two all-star teams are filled with much of the world’s best talent, the overall value of the squads is approximately £1 billion (1.486 billion U.S. dollars). Elite talent also produces exquisite goals like Suarez’s game-winner for Barca.


Second, El Clásico is more than a game – it represents a social history infused with political meaning and identity. Football Club Barcelona earned its slogan “més que un club” (“more than a club”) for its defense and refuge of the once-threatened Catalan language and its cultural resistance to Francisco Franco’s dictatorship and a larger Spanish identity. Therefore, Barcelona versus Real Madrid represented Catalonia’s confrontation with Franco, Castilian authority, and right-wing nationalism. Tim Lewis’s book review of El Clasico: Barcelona v Real Madrid by Richard Fitzpatrick insightfully portrays this historical chapter with an anecdote about the revolutionary Dutch footballer, Johan Cruyff:

In August 1973, the Dutch footballer Johan Cruyff, then at the peak of his considerable powers, signed for Barcelona. He had been pursued by Real Madrid too, but spurned their advances by saying he would never play for a team “associated with Franco.” To cement his hero-rebel status, Cruyff led his new club to a 5-0 away victory against Real Madrid and a few days afterwards, in February 1974, he named his newborn son Jordi. Sant Jordi is the patron saint of Catalonia and it was a pointed move as General Franco had not only banned the Catalan language but also outlawed Catalan names (Jorge being the preferred Spanish iteration of George).

Cruyff formed an immediate bond with Barcelona – he still lives in the city – but his decision also reflected a prevailing wisdom that Real Madrid were the team of the regime. They enjoyed favoured status and preferential treatment from Spanish administrators and referees, at least until General Franco died in his bed in late 1975 – or that was how the story went. Barcelona were oppressed and beaten down.

Therefore, El Clásico represents more than a showdown between Spain’s two largest cities: it represents an historic nationalistic struggle that has manifested itself publicly as recently as the Catalan self-determination referendum last November. Some notable players of Barcelona continue to incorporate the Catalan culture as a part of their identity. For example, in 2012, the club’s renowned midfielder and La Masia graduate, Andrés Iniesta, told France Football magazine, “I was born in La Mancha, but I grew up in Catalonia, and I love living here. I’ve spent more time in Barcelona than in Albacete. I’m Spanish, but I also feel Catalan. I feel the same.” Until 2010, those same cultural divisions have also been cited as a cause for internal discord in the Spanish national team.


Third, viewership of El Clásico represents the growth of the game in the United States. The United States’ matchup against Portugal at last summer’s World Cup attracted more domestic viewership than the NBA finals, the World Series, and the NHL playoffs. Believe it or not, Major League Soccer is the third most attended sports league in America, and its expansion into Orlando and New York represents the growing popularity of the league. Furthermore, in 2013, Messi became the seventh most popular athlete in the United States (Ronaldo was not too far behind at #21).


This writer utilized the TwitteR package of the R statistical computing language to collect several hundred thousand streaming, Roman character tweets before, during, and after last Sunday’s game. It is important to note that these findings do not represent conclusive scientific findings, but they are fun insights nonetheless.

The pregame and postgame tweets were randomly collected via the filterStream function for a duration of five minutes apiece, beginning ten minutes before kickoff and ending ten minutes after the final whistle. These tweets had to contain the search words “Messi” or “Ronaldo.” Both collections of “Messi” and “Ronaldo” tweets have since been parsed and stored in a data frame. Then, the following script was used to compare the number of tweets mentioning each player:

c( length(grep(“Ronaldo”, tweets.df$text, ignore.case = TRUE)), length(grep(“Messi”, tweets.df$text, ignore.case = TRUE)) )

However, those results won’t be disclosed until the end of this article!

This writer also utilized R’s filterStream function to collect in-game tweets during the full course of each half, including stoppage time (~46 minutes in the first half and ~50 minutes in the second half). The search words were “Clasico” and “BarcelonavsRealmadrid,” so the collected tweets had to mention either keyword. Next, this data was transformed into a spatial points data frame. Those points, in turn, were plotted on different basemaps designed by Google. For instance, the following code was used to plot “Clasico” and “BarcelonavsRealmadrid” tweets in the Iberian Peninsula (Fig. 1):

#plot the hybrid Google Maps basemap: Spain

map <- qmap(‘Spain’, zoom = 6, maptype = ‘hybrid’)

#plot the Clasico points on top

map + geom_point(data = Clasico, aes(x = Longitude, y = Latitude), color=”red”, size=3, alpha=0.5)

Fig. 1: 2nd Half Tweets in the Iberian Peninsula
Fig. 1: 2nd Half Tweets in the Iberian Peninsula
Fig. 2: Density Surface of 2nd Half Tweet Plot Points in the Iberian Peninsula
Fig. 2: Density Surface of 2nd Half Tweet Plot Points in the Iberian Peninsula

On the surface, not many surprises were uncovered while mapping El Clásico tweets in the Iberian Peninsula. Madrid was undoubtedly the most active city on Twitter, and this comes as no shock since the city nearly doubles Barcelona in population. Barcelona, Valencia, Seville, and Málaga – Spain’s second, third, fourth, and sixth largest cities – were also unsurprisingly active on the Twittersphere during game time. However, after examining the density surface of these plot points (Fig. 2) during both halves, one discovers that Lisbon, Portugal contained a greater density of tweets than Barcelona. This is surprising, given that Barcelona not only hosted the match, but also because the city’s population is estimated at 1.63 million, its metropolitan area numbers more than 4.6 million inhabitants, and the city proper is one of Europe’s most densely populated cities. By contrast, Lisbon’s greater metropolitan population is estimated at 2.66 million inhabitants, and its population density is roughly half the size of Barcelona’s. Both cities feature a similar combined percentage of inhabitants aged 0-14 and 65+ (Barcelona: 33.1%, Lisbon 30.3%) – two age groups that are less likely to use social media. Perhaps, linguistics can explain this phenomenon (perhaps a future topic for research), but one cannot also underestimate the Portuguese temptation to cheer on fellow countrymen Cristiano Ronaldo, Pepe, and Fábio Coentrão – all Real Madrid players. Never underestimate the lure of Ronaldo.

A comparison of the tweets emanating from Europe reveals that Europeans posted more “Clasico” and “BarcelonavsRealmadrid” tweets on the Twittersphere in the second half (Fig. 3, 4). Perhaps the most surprising revelation for the neutral observer is the amount of tweets coming from Istanbul, Turkey, which at ~14.1 million inhabitants, is Europe’s largest city (Fig. 5, 6). Neither Real Madrid, nor Barcelona features a Turkish national in its roster, yet La Liga has managed to export its premier matchup to the Istanbul market.

Fig. 3: 1st Half Tweets in Europe
Fig. 3: 1st Half Tweets in Europe
Fig. 4: 2nd Half Tweets in Europe
Fig. 4: 2nd Half Tweets in Europe
Fig. 5: Density Surface of 1st Half Tweet Plot Points in Europe
Fig. 5: Density Surface of 1st Half Tweet Plot Points in Europe
Fig. 6: Density Surface of 2nd Half Tweet Plot Points in Europe
Fig. 6: Density Surface of 2nd Half Tweet Plot Points in Europe

Turkey’s neighboring state, Syria, has dominated global headlines for its ongoing civil war, human displacement and refugee crisis, human rights violations, and the rise of extremist elements like ISIS. Since football is the most popular sport in Syria, how many “Clasico” and “BarcelonavsRealmadrid” tweets came from the heartland of the Arab World?

Fig. 7: 1st Half Tweets in the Middle East
Fig. 7: 1st Half Tweets in the Middle East
Fig. 8: Tweets in Syria
Fig. 8: Tweets in Syria

The answer: not many (Fig. 7). The only plot captured (Fig. 8) was close to regime-held As-Suknah in the Homs Governorate. This is unsurprising, given that most Syrians speak Arabic; “Clasico” and “BarcelonavsRealmadrid” are unlikely terms utilized in Arabic tweets. Second, a 4:00 PM ET kickoff, considered late in Spain (9:00 PM), occurs even later in Syria (10:00 PM). Third, the few Syrians who can temporarily escape the destruction and weariness of war might still have to overcome blackouts in electricity and internet censorship.

A visualization of tweets in North and South America reveals that El Clásico was a trending topic in the United States, particularly in major cities on the East and West Coasts (Fig. 9).

Fig. 9: 2nd Half Tweets in North America
Fig. 9: 2nd Half Tweets in North America

Los Angeles and New York can both stake a claim as leading markets for El Clásico (Fig. 10).

Fig. 10: Density Surface of 2nd Half Tweet Plot Points in North America
Fig. 10: Density Surface of 2nd Half Tweet Plot Points in North America

Football in the United States has a long way to go to match the enthusiasm in Argentina, Brazil, and Uruguay (Fig. 11, 12)– winners of nine World Cup titles and the home nations of Lionel Messi, Javier Mascherano, Luis Suarez, Neymar, Dani Alves, Rafinha, Douglas, Adriano, Marcelo, and Lucas Silva.

Fig. 11: 2nd Half Tweets in South America
Fig. 11: 2nd Half Tweets in South America
Fig. 12: Density Surface of 2nd Half Tweet Plot Points in South America
Fig. 12: Density Surface of 2nd Half Tweet Plot Points in South America

An observation of the African continent (Fig. 13) and Asia (Fig. 14) also reflects the potential of El Clásico’s growth in other markets. (Again, it is important to consider time zone and linguistic differences, as well as geopolitics and broadcast availability.) For instance, it would seem that China was a missed opportunity compared to Japan, Malaysia, and Saudi Arabia (Fig. 15). With the rise of Chinese investment in the sport and the increasing prevalence of European clubs’ tours in the country (including Real Madrid’s exhibition against 109 Chinese children), one would imagine that El Clásico will only continue to grow in China. It’s important to note that while many Chinese citizens finds ways to use Twitter, the social media site was actually blocked in the mainland following the July 2009 riots in the western province of Xinjiang.

Fig. 13: 2nd Half Tweets in Africa
Fig. 13: 2nd Half Tweets in Africa
Fig. 14: 2nd Half Tweets in Asia
Fig. 14: 2nd Half Tweets in Asia
Fig. 15: Density Surface of 2nd Half Tweet Plot Points in Asia
Fig. 15: Density Surface of 2nd Half Tweet Plot Points in Asia

Finally, in the epic battle between Australia and New Zealand… well, you can decide on the winner (Fig. 16).

Fig. 16: 2nd Half Tweets in Australia and New Zealand
Fig. 16: 2nd Half Tweets in Australia and New Zealand


Ronaldo won El Clásico’s individual goal tally 1-0, but Messi’s team won the match. The face of Madrid may be the reigning player of the year, but Pelé and recently featured statistical analyses in The Economist and FiveThirtyEight have anointed Barcelona’s front man as his generation’s best. Surely, Ronaldo and his 34.4 million Twitter followers can mention the Portuguese superstar more in a ten minute span than Messi, who has yet to create an official Twitter account.

Surprisingly, Lionel Messi still won Sunday’s pregame and postgame Twitter battle.

Pregame: Ronaldo: 835, Messi: 2022

Postgame: Ronaldo: 5414, Messi: 8121

No worries, CR7 fans. Ronaldo might have lost this Twitter battle, but your favorite player did supplant Shakira this past month as the most liked person on all of Facebook. That presumes he also beat Messi.

Welcome to LVantData!

Welcome! LVantData is where data mining and social media analysis meets trending current events. In short, it’s where YOU WVANT DATA!

At times, LVantData will function as a study of the Levant – the area in the eastern Mediterranean encompassing Lebanon and Syria, as well as southern Turkey, Cyprus, Jordan, Israel, and the occupied Palestinian territory. Historically, the diverse people of the Levant have viewed themselves and their societies as byproducts of a crossroads of various rich civilizations and empires, including the Canaanites, Ancient Egyptians, Assyrians, Babylonians, Phoenicians, Greeks, Romans, Umayyads, Abbasids, Fatimids, Ottoman Turks, Europeans, and Arabs.  We hope that a better grasp of the politics of the contemporary Levant might enable ordinary people to make better sense of what’s happening in the region.

At other times, LVantData will observe some of the hottest news and sports trends we personally find interesting. Football/soccer fans, this is the place for you!

LVantData is the unlikely brainchild of two Harvard friends and avid football (soccer) fans, George Somi and Christopher Miller, coming from different hemispheres (West vs. East) and from completely different research styles (Qualitative vs. Quantitative). It’s only fitting, then, that this blog is a crossroads in and of itself.

We invite you to participate in this community, and we welcome blog post ideas and contributions. We hope to quantify and qualify events that will allow you to better assess policies advocated by television pundits, politicians, and government officials and to have fun observing the trending phenomena that we’ve found interesting.

While we realize there may be disagreements, we hope to maintain a civil environment. A few rules we will ask you to honor:

  • Please refrain from personal attacks and insults against other users.
  • Please refrain from racist, sexist, obscene, and other hateful and discriminatory language.
  • Please refrain from sectarian attacks, dehumanizing groups of people, calls of violence, and expressions of happiness at the death or suffering of others.
  • Please refrain from posting spam, links to irrelevant commercials, and commercial messages.
  • In short, bullies, trolls, and spammers aren’t welcome.

Now, let’s get started!