Hollywood: a man’s world

Context:

My role is a guest lecturer at University of California Los Angeles (UCLA) Film school.  I work at seejane.org which is the public site of Geena Davis Institute on Gender in the Media. My target audience are students enrolled in Film Studies.

My goal is convince new film makers that diversity is where the money is. My data on Hollywood is from a survey conducted by the Annenberg School of Journalism for the Geena Davis Institute on Gender in the Media entitled “Gender Roles & Occupations: A Look at Character Attributes and Job-Related Aspirations in Film and Television”. This is a study of 129 family films, 275 prime time shows and 36 children’s shows from 2006-2011, and evaluates this media on the roles it portrays for males and females.

My data on the population’s roles for males and females is from the US Bureau of Labor Statistics, and I use this to compare Hollywood to reality.

My take on the research conducted by the Institute is that it has actually been quite timely, in that in the years since 2011 when it was conducted, Hollywood has actually really started a massive movement and change. So my pitch is very positive and saying that the tide is turning by looking at real box office results from box office mojo.

My pitch deck begins by outlining how from 2006-2011 there is clearly gender imbalance towards males, and also a narrow view of what it means to be a male (slide 2-4). My visualisations highlight the key indicators of the lack of diversity in the cast (on a gender axis) and the types of professions the majority of men are portrayed in.

In slide 3, I posit the reason for this imbalance, due to the huge number of male directors dominating Hollywood. My visualisation highlights the 100% statistic and shows some familar faces.

In slide 6  I show who is underrepresented, women and certain males, and my visualisation in Slide 7 creates a metric to show the imbalance, ie the variance between population proportion of roles vs Hollywood, so under representation is on the left, and over representation is on the right.

In slide 8 and 9 I give evidence that since 2011 this is actually starting to change with some serious box office success of films with diverse casts.

Slide 10 is my key take away to students, that magic happens through embracing diversity and uncertainty, rather than a staid old formulaic film.

 

The presentation:

 

 

Slide 1: Title Slide

 

Slide 2 Establishing the basis for the argument: Observation 1

Slide 3 Establishing the basis for the argument: Observation 2

 

Slide 4 Establishing the basis for the argument: Observation 3

Slide 5 Surmising the underlying reason

Slide 6  Expressing dissatisfaction with Hollywood who leaves large groups underrepresented

Slide 7  Expressing the opportunity in the imbalance

Slide 8  Evidence of a shift in the balance

 

Slide 9  Final Argument 

Slide 10 Conclusion and Key take away

A News Agent and a Publican walk into a bar

 

A News Agent and a Publican walk into a bar….

Australia’s 3,800 News Agencies have been suffering greatly in recent times, and their future continues to look dark.

As our members know too well,  turnover is forecast to continue to decline 3% annually, driven largely by consumption of free digital media, but also the decline in instant lottery sales, which represents one quarter of News Agencies $2bn turnover.

The other enormous problem that our members face is that news agencies exclusive rights to sell scratchies in Australia expired on 31 March 2018 (after being extended for 5 years), and super markets are desperate to move in on instant lottery sales.

News agencies still have not diversified and seem at a loss to resolve this problem.

For inspiration, News Agencies, the National Association and the Media companies who rely on our member network should look at Gaming Machines.

Gaming Machines have been the source of the decline in instant lottery turnover, as gamblers turn in droves to gaming machines in pubs and clubs since deregulation in the early nineties.

Gaming machines are now widely distributed (except for WA), present in 3,000 licensed pubs and clubs (75% of total), and there are almost 200k machines, with half of them located in NSW (Figure 1) in these venues.

 

Figure 1  The proliferation of Gaming Machines in Clubs and Pubs in 2015/2016

Subsequently, Gaming machine turnover has increased 600%, from $23 bn in 1990/91 to $143bn in 2015/16 (Figure 2) .

Figure 2 Explosion in turnover of Gaming Machines in Clubs and Pubs

For pubs across Australia, this has meant a reversal in fortunes for the publican. Each machine is estimated to make over $100k annually for the pub owner.

If News Agencies were given the license to have poker machines on premises, they could double State revenue assuming another 200k machines could be installed in their base of 3,800 outlets, and provide a future revenue stream that will support their business well beyond what instant lotteries and newspapers could do.

Figure 2 Newsagency of the future

The National Association of News Agencies are polling our members to assess the level of support for this future direction of News Agencies.  Once we have gathered the facts, we will lobby both media companies and Government as your representative body for the legislative change to make this happen.

Please complete the survey HERE to give your view on this vital issue by 15 June 2018.

 

 

 

Context:

My target audience is News Agent National Association members, and the medium is an online newsletter to members. The message is to lobby for gaming machine licenses, and the goal is a call to action to members to give their feedback on this proposal (via a survey). My role is the advocate for this idea.

Please note this is not my personal view, its just where my thoughts ended up!

My data is from the Australian Gambling Statistics 1990–91 to 2015–16, 33rd edition which is a survey conducted annually by the Queensland Government.  The data comes in excel format, ready to use.  I augmented this with data on the number of gaming machines in each state, combined with the Australian Government Productivity Commission’s Inquiry into Gambling from 2010.

Some definitions:
Gaming machines: All jurisdictions, except Western Australia, have
a state–wide gaming machine (poker machine) network operating in clubs and/or hotels. (WA only has machines in the Crown Casino, 1,750 of them). The data reported under this heading do not include gaming machine data from casinos. Gaming machines accurately record the amount of wagers played on the machines. So turnover is an actual figure for each jurisdiction. In most jurisdictions operators must return at least 85 per cent of wagers to players as winnings, either by cash or a mixture of cash and product.
Instant lottery: Commonly known as ‘scratchies’, where a player scratches a coating off the ticket to identify whether the ticket is a winner. Prizes in the instant lottery are paid on a set return to player and are based on the number of  tickets in a set, the cost to purchase the tickets, and a set percentage retained by  the operator for costs.
Expenditure (gross profit): These figures relate to the net amount lost or, in other words, the amount wagered less the amount won, by people who gamble.  Conversely, by definition, it is the gross profit (or gross winnings) due to the operators of each particular form of gambling.

Using think cell for the corporate audience

Sometimes, you have an extremely corporate audience (think News Corp, where I worked for 4 plus years).

This audience has seen it all before, every tool, every business idea and every design fad. They do not want razzle dazzle, they want accountability and reproduce-ability.

They want transparent and honest presentation of your research,  assumptions, workings, plans and conclusions, to ensure stakeholders can critique and ultimately buy into your work.

For these types of audiences, I use think cell. It is the secret weapon behind the professional charts of a consultancy firm, and I am sharing it with you!  It is an excel and powerpoint plug in, and costs about $300 a year, although you can get a 28 day free trial when you sign up https://server.think-cell.com/portal/en/trial.srf.

Think cell can do waterfall charts in a flash, gorgeous work break down structures using GANTT charts, calculate and demonstrate cumulative annualised growth rate in a few clicks.

For anyone who has tried to do these things in Excel, you are going to enjoy seeing how easy this is.

Waterfall Charts

Waterfall charts are often used to show contributions to movements in profit, revenue or expenditure from one period to the next https://en.wikipedia.org/wiki/Waterfall_chart.

The ABS National Expenditure data

For my data, I am using the Australian Bureau of Statistics National Accounts for 2016/2017 and 2015/2016 to illustrate movements in national expenditure

http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/5206.0Dec%202017?OpenDocument#Time

I selected this data because it is publicly available and appropriate for use of a waterfall chart.

My first step is to ensure I understand how the columns work, and what is a subtotal.

There is a tiny bit of cleaning required.

It is common for ABS data to compare the current period to the same period last year i.e Dec 2017 to Dec 2016, to account for seasonal variations between quarters.

So I only included December quarters. I then worked out the movement between each December quarter in each category.

Think cell viz tool

Think cell is a Office plug in therefore has its own menu in Excel (and Powerpoint) as shown in Figure 3 below. I selected the waterfall option.

Think cell have a guide to create each of their charts, including waterfall charts: https://www.think-cell.com/en/support/manual/waterfall.shtml

Firstly, you need to attempt to lay the data out as shown in the guide, and select it with your mouse (Table 1).  Note the empty row between header and data which is required. also note the “e” for the end column.

Table 1 Excel Data Table

Then you select the chart you want: waterfall (Figure 3).

Figure 3 Selecting the right chart

You then move into Powerpoint to paste the chart. This is a bug as you used to be able to paste directly into Excel.  Figure 4 shows how Table 1 is laid out, straight out of the box.

Figure 4 Waterfall Chart

The last step is customising it to make it easier to read for the audience.

Figure 5 shows the options when you right click, which are all there to easily add more or less detail and allow you to focus.

Figure 5 Right Click

Almost every aspect is configurable if you click on it. You just need to zoom in so you can differentiate the details.

Figure 6 shows about 5 minutes of finessing within Powerpoint.

Figure 6 The power of think cell

This is the kind of thing corporate executives love to see. The colours are consistent. Key variances are highlighted.

Except for that little yellow 18 which is an error in movement (see how accountable think cell is!).

I just love think cell.  Of my three blogs, this is the one tool I have used extensively before, but I just had to share it with my classmates.

This tool is built on Microsoft Office and distills years of consulting experience into its left and right mouse buttons. The only downside: it doesnt work with Google docs and probably not Mac either.

 

Down the Conversion Funnel using rawgraphs.io

Feedback on my first DVN blog led me down the conversion funnel

My first blog and in class presentation used Tableau to explore conversion rates https://15-6762.ca.uts.edu.au/using-tableau-to-explore-funnel-conversion-rates/

My feedback suggested funnel charts and Sankey diagrams, and a free app rawgraph.io site.

A Sankey Diagram, and the more recent Alluvial Charts are an attention grabbing flow diagram

These diagram types basically map the change between a number of histograms showing the same data, but split different ways, as shown by the example alluvial diagram created by Cory Brunson in Diagram 1.

Diagram  1 An Alluvial Diagram using the Titanic data set  http://corybrunson.github.io/ggalluvial/articles/ggalluvial.html

Data cleaning was an iterative and educational process

Once I began experimenting with the tool,  I questioned my decision to use Alluvial charts almost immediately!

The data I had was conversion data but was not at all in the right format.

Thankfully the rawgraph tool was fast, so I could play with the data.

Table 1 The original data

The original data ended up like Table 2 below. It was completely aggregated without any of the daily detail.

Table 2 Final data format

For the 55,361 attempted logins, I ensured the categorical variables all added to this number by creating new variables.

The Source variable I defined as the original Mobile, Desktop and Tablet, but then added in Login, Redemption and Sign Up.

For the Destination variable I created the Login, Redemption, Sign up and Cancel Login, Cancel Redemption and Cancel Sign up values.

Lastly for the device variable, I had the original mobile, desktop and tablet and then created a Lost variable.

rawgraph was super easy to use

rawgraph was an excellent tool for quickly learning about the graphs and what data is required for them.

The website has a comprehensive library of guides.

I used “How to make an alluvial diagram” https://rawgraphs.io/learning/how-to-make-an-alluvial-diagram/

There are four steps:

  1. Load your data
  2. Choose the layout
  3. Map your dimensions
  4. Customize

It  was iterative: cycling through steps 1- 4 and reformatting the data based on what I learned, but luckily the site was very responsive.

Step 1 Load your data

I actually just pasted mine in, easy!

Figure 1 Step 1

Step 2 Choose a Chart

rawgraphs has 21 chart templates and the ability to create a custom chart.

Figure 2 Step 2

Step 3 Map your dimensions

As shown in Figure 3, the dimensions on the left are parsed into types (number and strings) to create your graph.

Figure 3 Step 3

 

Step 4 The final step, customise your Visualisation

In the final step, there are a limited number of customisation options, but not enough!

Figure 4 Step 4

 

I played around with the order of the Steps, in order to make the chart more meaningful, and ended up with Figure 5 below.

Figure 5 The Final Alluvial Chart

My findings

My alluvial chart explains how the traffic moves from mobile desktop and tablet (bottom left column), into the different steps in the funnel (middle column). The right column shows what proportion of this traffic is lost.

Honestly, this does not make intuitive sense to me, so there must be more work to be done in order to have more columns and splits.

Essentially these types of diagrams best represent a snap shot of a segmented population.  Working through this type of chart (Sankey, Alluvial and Parallel coordinates) made me realise the shortcomings in the data for this purpose.

If each error landing page had been tagged, then I would have not required a LOST category. So this data was not ideal for use with this type of chart, but I learned a lot!

I guess these charts are useful for temporal comparisons, or have no idea which parts of the website get used most.

In conclusion, I think the rawgraph.io site is easy to use, and great to learn about different charts, and what they can and cannot do, and what data formats they need.  But the Sankey-like charts did not work for the data I had, and I needed to do a lot more re-work to get it making sense.

 

 

 

 

 

Read on if you want more information on Sankey and Alluvial diagrams – not part of assignment

“Sankey Diagrams are attention grabbing flowcharts that help in quick visualisation of the distribution and losses of material and energy in a process. The width of the lines used in drawing the flowchart is proportional to the quantum of material or energy.”
(source: http://www.sankeydiagrams.com)

As material or volume flows from one step to the next,  all volume must be accounted for, including wastage, new inputs and growth via processing.

“Alluvial diagrams are a type of flow diagram originally developed to represent changes in network structure over time. In allusion to both their visual appearance and their emphasis on flow, alluvial diagrams are named after alluvial fans that are naturally formed by the soil deposited from streaming water.” https://en.wikipedia.org/wiki/Alluvial_diagram

Cory Brunson also had some definitions..

  • An axis is a dimension (variable) along which the data are vertically grouped at a fixed horizontal position. The diagram above uses three categorical axes: ClassSex, and Age.
  • The groups at each axis are depicted as opaque blocks called strata. For example, the Class axis contains four strata: 1st2nd3rd, and Crew.
  • Horizontal (x-) splines called alluvia span the width of the diagram. In this diagram, each alluvium corresponds to a fixed value of each axis variable, indicated by its vertical position at the axis, as well as of the Survived variable, indicated by its fill color.
  • The segments of the alluvia between pairs of adjacent axes are flows.
  • The alluvia intersect the strata at lodes. The lodes are not visualized in the above diagram, but they can be inferred as filled rectangles extending the flows through the strata at each end of the diagram or connecting the flows on either side of the center stratum.”

 

 

 

Accountability and Ethics in Data Science: Professional Ethics in contemporary Data Science practice

Executive Summary

This paper will discuss accountability, ethics and professionalism in data science (DS) practice, considering the demands and challenges practitioners face. Dramatic increases in the volume of data captured from people and things, and the ability to process it places Data Scientists in high demand. Business executives hold high hopes for the new and exciting opportunities DS can bring to their business, and hype and mysticism abounds. Meanwhile, the public are increasingly wary of trusting businesses with their personal data, and governments are implementing new regulation to protect public interests.  We ask whether some form of professional ethics can protect data scientists from unrealistic employer expectations and far reaching public accountabilities.

Demand for Data Science

Demand for DS skills is off the charts, as Data Scientists have the potential to unlock the promise of Big Data and Artificial Intelligence.

As much of our lives are conducted online, and everyday objects are connected to the internet, the “era of Big Data has begun.”(boyd & Crawford 2012). Advancements in computing power, and cheap cloud services mean that vast amounts of digital data are tracked, stored and shared for analysis (boyd & Crawford 2012), and there is a process of “datafication” as this analysis feeds back into people’s lives (Beer 2017).

Concurrently, Artificial Intelligence (AI) is gaining traction through successful use of statistical machine learning and deep learning neural networks for image recognition, natural language processing, and games and dialogue question and answer (Elish & boyd 2017).  AI now permeates every aspect of our lives in chatbots, robotics, search and recommendation services, automated voice assistants and self-driving cars.

Data is the new oil, and Google Amazon Facebook and Apple (GAFA) are in control of vast amounts of it. Combined with their network power, this results in super normal profits: US$25bn net profit amongst them in the first quarter of 2017 alone (the Economist 2017). Tesla, which made 20,000 self-driving cars in this time, is worth more than GM which sold 2.5m (the Economist 2017).

Furthermore, traditional industries such as government, education, healthcare, financial services, insurance, retailers, and functions such as accounting, marketing, commercial analysis and research who have long used statistical modelling and analysis in decision making are harnessing the power of Big Data and AI which supplements or replaces “complex decision support in professional settings (Elish & boyd 2017).

All these factors drive incredible demand from organisations, and results in a shortage of supply of Data Scientists.

Demand for Accountability

With this incredible appetite for and supply of personal data, individuals, government, and regulators are increasingly concerned about threats to competition (globally), personal privacy and discrimination, as DS, algorithms and big data are neither objective or neutral (Beer 2017) (Goodman & Flaxman 2016).  These must be understood as socio technical concepts (Elish & boyd 2017), and their limitations and shortcomings well understood and mitigated.

To begin with, the process of summarizing humans into zeros and ones removes context, therefore, contrary to popular mythology about Big Data, the larger the data set, the harder it is to know what you are measuring (Theresa Anderson n.d.; Elish & boyd 2017).  Rather, DS practitioner has to decide what is observed, recorded, included in the model, how the results are interpreted, and how to describe its limitations (Elish & boyd 2017; Theresa Anderson n.d.).

All too often, limitations in the data mean that cultural biases and unsound logics get reinforced and scaled by systems in which spectacle is prioritised over careful consideration”. (Elish & boyd 2017)

In addition, profiling is inherently  discriminatory, as algorithms sort, order, prioritise, and allocate resources in ways that can “create, maintain or cement norms and notions of abnormality” (Beer 2017) (Goodman & Flaxman 2016). Statistical machine learning scales normative logic (Elish & boyd 2017), and biased data in means biased data out, even if protected measures are excluded but correlated ones are included. Systems are not optimised to be unbiased, rather the objective is to have better average accuracy than the benchmark (Merity 2016).

Lastly, algorithms by their statistical nature are risk averse, and focus where they have a greater degree of confidence (Elish & boyd 2017; Theresa Anderson n.d.) (Goodman & Flaxman 2016), exacerbating the underrepresentation of minorities that exist in unbalanced training data (Merity 2016).

In response, the European Union announced an overhaul of their Data Protection regime from a Directive to the far reaching General Data Protection Regulation. Slated to be law by April 2018, this regulation protects the rights of individuals, including citizens right to be forgotten, and securely store their data, but also the right to an explanation of algorithmic decisions that significantly affect an individual (Goodman & Flaxman 2016). The regulations prohibit decisions made entirely by automated profiling and processing, and will impose significant fines for non-compliance.

Ethical Challenges and Opportunities for DS Practitioners

DS practitioners must overcome many challenges to meet these demands for accountability and profit. It all boils down to ethics. Data scientists must identify and weigh up the potential consequences of their actions for all stakeholders, and evaluate their possible courses of action against their view of ethics or right conduct (Floridi & Taddeo 2016).

Algorithms are machine learning, not magic (Merity 2016), but the media and senior executives seem to have blind faith, and regularly use “magic” and “AI” in the same sentence (Elish & boyd 2017).

In order to earn the trust of businesses and act ethically towards the public, practitioners must close the expectation gap generated by recent successful (but highly controlled) “experiments-as-performances”, by being very clear about the limitations of their DS practices. Otherwise DS will be snake oil, and collapse under the weight of the hype and these unmet expectations (Elish & boyd 2017), or breach regulatory requirements and lose public trust trying to meet them.

The accountability challenge is compounded in multi-agent, distributed global data supply chains, as accountability and control are hard to assign and assert (Leonelli 2016), the data may not be “cooked with care” but the provenance and assumptions within the data are unknown (Elish & boyd 2017; Theresa Anderson n.d.).

Furthermore, cutting edge DS is not a science in the traditional sense (Elish & boyd 2017), where hypotheses are stated and tested using scientific method. Often, it really is a black box (Winner 1993), where the workings of the machine are unknown, and hacks and short cuts are made to improve performance without really knowing why these work (Sutskever, Vinyals & Le 2014).

This makes the challenge of making the algorithmic process and results explainable to a human almost impossible in some networks (Beer 2017).

Lastly, the social and technical infrastructure grows quickly around algorithms once they are out in the wild. With algorithms powering self-driving cars and air traffic collision avoidance systems, ignoring the socio-technical context can have catastrophic results. The Überlingen crash in 2002 occurred because there was limited training on what controllers should do when they disagreed with the algorithm (Ally Batley 2017; Wikipedia n.d.). Data scientists have limited time  and influence to get the socio technical setting optimised before order and inertia sets in, but the good news is that the time is now, whilst the technology is new  (Winner 1980).

Indeed, the opportunities to use DS and AI for the betterment of society are vast. If data scientists embrace the uncertainty and the humanity in the data, they can make space for human creative intelligence, whilst at the same time respecting those who contributed the data, and hopefully create some real magic (Theresa Anderson n.d.).

 

 

Professions and Ethics

So how can DS practitioners equip themselves to take on these challenges and opportunities ethically?

Historically, many other professions have formed professional bodies to provide support outside of the influence of the professional’s employer. The members sign codes of ethics and professional conduct, in vocations as diverse as designers, doctors and accountants (The Academy of design professionals 2012; Australian Medical Association 2006; CAANZ n.d.).

Should DS practitioners follow this trend?

What is a profession?

“A profession is a disciplined group of individuals who adhere to ethical standards and who hold themselves out as, and are accepted by the public as possessing special knowledge and skills in a widely recognised body of learning derived from research, education and training at a high level, and who are prepared to apply this knowledge and exercise these skills in the interest of others. It is inherent in the definition of a profession that a code of ethics governs the activities of each profession.“ (Professions Australia n.d.)

The central component in every definition of a profession is ethics and altruism (Professions Australia n.d.), therefore it is worth exploring professional membership further as a tool for data science practitioners.

Current state of DS compared to accounting profession

Let us compare where the nascent DS practice is today with the chartered accountant (CA) profession. The first CA membership body was formed in 1854 in Scotland (Wikipedia 2017a), long after double entry accounting was invented in the 13th century (Wikipedia 2017b).  Modern data science began in the mid twentieth century (Foote 2016), and there is as yet no professional membership body.

Current CA membership growth rate is unknown, but DS practitioner growth is impressive. In 2016, there were 2.1M licensed chartered accountants[1] (Codd 2017). IBM predicts there will be 2.7M data scientists by 2020 (Columbus n.d.; IBM Analytics 2017), predicting 15% growth annually.

The standard of education is very high in both professions, but for different reasons. Chartered Accountants have strenuous post graduate exams to apply for membership, and requirements for continuing professional education (CAANZ n.d.).

DS entry levels are high too, but enforced by competitive forces only. Right now, 39% of DS job openings require a Masters or Ph.D (IBM Analytics 2017), but this may change over time as more and more data scientists are educated outside of universities.

The CA code of ethics is very stringent, requiring high standards of ethical behaviour and outlining rules, and membership can be revoked if the rules are broken (CAANZ n.d.) CAs must treat each other respectfully, and act ethically and in accordance with the code towards their clients and the public.

Lastly, like accounting, DS is all about numbers, and seems like a quantitative and objective science. Yet there is compelling research to indicate both are more like social sciences, and benefit from being reflexive in their research practices (boyd & Crawford 2012; Elish & boyd 2017; Chua 1986, 1988; Gaffikin 2011).   Also like accountants (Gallhofer, Haslam & Yonekura 2013), DS practitioners could suffer criticism for being long on practice and short on theory.

Therefore, DS should look hard at the experience of accountants and determine if, and when becoming a profession might work for them.

For and Against DS becoming a profession

It is conceivable that individually, DS practitioners could be ethical in their conduct, without the large cost in time and money of professional membership.

Data scientists are very open about their techniques, code and results accuracy, and welcome suggestions and feedback. They use open source software packages, share their code on sites like GitHub and BitBucket, contribute answers on Stack Overflow, blog about their learnings and present and attend Meet Ups.  It’s all very collegiate, and competitive forces drive continuous improvement.

But despite all this online activity, it is not clear whether they behave ethically. They do not readily share data as it is often proprietary and confidential, nor do they share the substantive results and interpretation. This means it is difficult to peer review or reproduce their results, and be transparent about their DS practices to ascertain if they are ethical or not.

A professional body may seem like a lot of obligations and rules, but by proclaiming their ethical stance, it could provide the data scientists some protection and more access to data.

From the public’s point of view, a profession is meant to be an indicator of trust and expertise (Professional Standards Councils n.d.). Unlike other professions, the public would rarely directly employ the services of a data scientist, but they do give consent for data scientists to collect their data (“oil”).

Becoming a profession could earn public trust and personal data (Accenture n.d.). It can also help pool resources and allow them to pursue initiatives that are altruistic and socially preferable (Floridi & Taddeo 2016), and actually makes for good leaders who can navigate conflict and ambiguity (Accenture n.d.), and result in good financial results (Kiel 2015).

With the growing regulatory focus on data and data security, it is foreseeable soon that Chief Data Officer and Chief Information Security Officer may be subject to individual fines and jail time penalties like Chief Executive and Chief Financial Officers are with regards to Sarbanes Oxley Act Compliance (Wikipedia 2017c). Professional membership can provide the training and support needed to keep practitioners up to date, in compliance and out of jail.

Lastly, right now, the demand for DS skills far outweigh supply. Therefore, despite the significant concentration in DS employers (in GAFA), the bargaining power of some individual data scientists is relatively high. However, they have no real influence over how their work is used: their only option in a disagreement is to resign.  Over the medium term, supply will catch up with demand, and then even the threat of resignation will become worthless.

In summary

Steering the course of DS practice towards ethical outcomes is easiest at the outset (Winner 1980), however it is highly unlikely DS practitioners will stand up to their employers and voluntarily band together to create a professional membership body in the immediate future.

Professional ethics can protect data scientists from unrealistic employer expectations and far reaching public accountabilities, but the organisational effort may come too late.

Regulatory pressure that counters the power of GAFA may create the force for change, but more likely professional indemnity insurers and legal liability cases will eventually force sole traders and small to medium businesses to band together as a professional body to shoulder the responsibility of public accountability and earn the right to their data.

 

 

 

 

 

 

Bibliography

Accenture n.d., ‘Data Ethics Point of view’, www.accenture.com, viewed 12 November 2017, <https://www.accenture.com/t00010101T000000Z__w__/au-en/_acnmedia/PDF-22/Accenture-Data-Ethics-POV-WEB.pdf#zoom=50>.

Ally Batley 2017, Air Crash Investigation – DHL Mid Air COLLISION – Crash in Überlingen, viewed 20 November 2017, <https://www.youtube.com/watch?v=yQ0yBFoO2V4>.

Australian Medical Association 2006, ‘AMA Code of Ethics – 2004. Editorially Revised 2006’, Australian Medical Association, viewed 20 November 2017, <https://ama.com.au/tas/ama-code-ethics-2004-editorially-revised-2006>.

Beer, D. 2017, ‘The social power of algorithms’, Information, Communication & Society, vol. 20, no. 1, pp. 1–13.

boyd,  danah & Crawford, K. 2012, ‘Critical Questions for Big Data’, Information, Communication & Society, vol. 15, no. 5, pp. 662–79.

CAANZ n.d., ‘Codes and Standards | Member Obligations’, CAANZ, Text, viewed 20 November 2017, <http://www.charteredaccountantsanz.com/member-services/member-obligations/codes-and-standards>.

Chua, W.F. 1988, ‘Interpretive Sociology and Management Accounting Research- a critical review’, Accounting, Auditing and Accountability Journal, vol. 1, no. 2, pp. 59–79.

Chua, W.F. 1986, ‘Radical Developments in Accounting Thought’, The Accounting Review, vol. LXI, no. 4, pp. 601–33.

Codd, A. 2017, ‘How many Chartered accountants are in the world?’, quora.com, viewed 20 November 2017, <https://www.quora.com/How-many-Chartered-accountants-are-in-the-world>.

Columbus, L. n.d., ‘IBM Predicts Demand For Data Scientists Will Soar 28% By 2020’, Forbes, viewed 20 November 2017, <https://www.forbes.com/sites/louiscolumbus/2017/05/13/ibm-predicts-demand-for-data-scientists-will-soar-28-by-2020/>.

Data Science Association n.d., ‘Data Science Association Code of Conduct’, Data Science Association, viewed 13 November 2017, </code-of-conduct.html>.

Elish, M.C. & boyd,  danah 2017, Situating Methods in the Magic of Big Data and Artificial Intelligence, SSRN Scholarly Paper, Social Science Research Network, Rochester, NY, viewed 19 November 2017, <https://papers.ssrn.com/abstract=3040201>.

Floridi, L. & Taddeo, M. 2016, ‘What is data ethics?’, Phi.Trans.R.Soc.A, no. 374:20160360.

Foote, K.. 2016, ‘A Brief History of Data Science’, DATAVERSITY, viewed 21 November 2017, <http://www.dataversity.net/brief-history-data-science/>.

Gaffikin, M. 2011, ‘What is (Accounting) history?’, Accounting History, vol. 16, no. 3, pp. 235–51.

Gallhofer, S., Haslam, J. & Yonekura, A. 2013, ‘Further critical reflections on a contribution to the methodological issues debate in accounting’, Critical Perspectives on Accounting, vol. 24, no. 3, pp. 191–206.

Goodman, B. & Flaxman, S. 2016, ‘European Union regulations on algorithmic decision-making and a ‘right to explanation’’, arXiv:1606.08813 [cs, stat], viewed 13 November 2017, <http://arxiv.org/abs/1606.08813>.

IBM Analytics 2017, ‘The Quant Crunch’, IBM, viewed 20 November 2017, <https://www.ibm.com/analytics/us/en/technology/data-science/quant-crunch.html>.

Kiel, F. 2015, ‘Measuring the Return on Character’, Harvard Business Review, viewed 13 November 2017, <https://hbr.org/2015/04/measuring-the-return-on-character>.

Leonelli, S. 2016, ‘Locating ethics in data science: responsibility and accountability in global and distributed knowledge production systems’, Phil. Trans. R. Soc. A, vol. 374, no. 2083, p. 20160122.

Merity, S. 2016, ‘It’s ML, not magic: machine learning can be prejudiced’, Smerity.com, viewed 19 November 2017, <https://smerity.com/articles/2016/algorithms_can_be_prejudiced.html>.

Professional Standards Councils n.d., What is a profession? | Professional Standards Councils, viewed 19 November 2017, <https://www.psc.gov.au/what-is-a-profession>.

Professions Australia n.d., What is a profession?, viewed 21 November 2017, <http://www.professions.com.au/about-us/what-is-a-professional>.

Sutskever, I., Vinyals, O. & Le, Q.V. 2014, ‘Sequence to Sequence Learning with Neural Networks’, arXiv:1409.3215 [cs], viewed 4 November 2017, <http://arxiv.org/abs/1409.3215>.

The Academy of design professionals 2012, ‘The Academy of Design Professionals – Code of Professional Conduct’, designproacademy.org, viewed 13 November 2017, <http://designproacademy.org/code-of-professional-conduct.html>.

the Economist 2017, ‘The world’s most valuable resource is no longer oil, but data’, The Economist, 6 May, viewed 19 November 2017, <https://www.economist.com/news/leaders/21721656-data-economy-demands-new-approach-antitrust-rules-worlds-most-valuable-resource>.

Theresa Anderson n.d., Managing the Unimaginable, viewed 19 November 2017, <https://www.youtube.com/watch?v=YEPPW09qpfQ&feature=youtu.be>.

Wikipedia 2017a, ‘Chartered accountant’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=Chartered_accountant&oldid=810642744>.

Wikipedia 2017b, ‘History of accounting’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=History_of_accounting&oldid=810643659>.

Wikipedia 2017c, ‘Sarbanes–Oxley Act’, Wikipedia, viewed 21 November 2017, <https://en.wikipedia.org/w/index.php?title=Sarbanes%E2%80%93Oxley_Act&oldid=808445664>.

Wikipedia n.d., Überlingen mid-air collision – Wikipedia, viewed 20 November 2017, <https://en.wikipedia.org/wiki/%C3%9Cberlingen_mid-air_collision>.

Winner, L. 1980, ‘Do Artifacts Have Politics?’, Daedalus, vol. 109, no. 1, pp. 121–36.

Winner, L. 1993, ‘Upon Opening the Black Box and Finding It Empty: Social Constructivism and the Philosophy of Technology’, Science, Technology, & Human Values, vol. 18, no. 3, pp. 362–78.

[1] not including unlicensed practitioners such as bookkeepers, or Certified Practicing Accountants

Anthropomorphising the algorithm

leading on from my last blog post conclusion that holding algorithms accountable is a bit of a daft idea, I want to thank Richard Nota for this wonderful comment on The Conversation article posted by Andrew Waites in our slack channel

Richard Nota

The ethics is about the people that oversee the design and programming of the algorithms.

Machine learning algorithms work blindly towards the mathematical objective set by their designers. It is vital that this task include the need to behave ethically.

A good start would be for people to stop anthropomorphising robots and artificial intelligence.

Anthropomorphising…. I had to google to see if that is even a word (it is).
But that is exactly what I believe needs to happen, stop anthropomorphising algorithms.
As Theresa puts it, they are part of the infrastructure, and once let loose into the wild, they can be made extremely inflexible if they are not created with care and managed appropriately.
Its up to the humans to manage the ethical implications of the algorithms in their systems.
Anyway, the article was written by Lachlan McCalman who works at Data61 and he makes some very good arguments.
He points out that making the smallest mistake possible does not mean NO mistakes.
Lachlan describes 4 errors and how the algorithm can be designed to adjust for these.
1. Different people, different mistakes
There actually can be quite large mistakes for different subgroups that offset each other.  In particular for minorities, where because there are few examples, getting their predictions wrong doesnt penalise the results too much.
I know about this already due to my favourites Jeff Larsson at ProPublica and the offsetting errors in the recidivism prediction algorithm in False negatives and positives for white and black males. Im sure you can work out who was the false negative (incorrectly predicted will not reoffend)  vs false positive (incorrectly predicted will re-offend).
Lachlan suggests to fix this, the algorithm would need to be changed to care equally about accuracy for the sub groups.
2. The algorithm isn’t sure
Of course, its just a guess, and there are varying degrees of uncertainty.
Lachlan suggests the algorithm could allow for giving the benefit of the doubt where there is uncertainty.
3. Historical bias
this one is huge. of course patterns of bias become entrenched if the algorithm is fed biased history.
So changing the algorithm (positive discrimination perhaps) to counter this bias would be required.
4. Conflicting priorities
trade offs need to be made when there are limited resources.
Judgement is required, with no simple answer here.
In conclusion, Lachlan proposes there needs to be an “ethics engineer” who explicitly obtains ethical requirements from stakeholders, converts them into a mathematical objective and then monitors the algorithms ability to meet that objective when in production.

About algorithms being black boxes

For 36111 Philosophies of Data Science Practices’ first assignment, I am exploring the emerging practice of holding algorithms accountable.

Often, people refer to algorithms as black boxes.

There are three different definitions of a black box, according to merriam webster:

Definition of black box

1 :a usually complicated electronic device whose internal mechanism is usually hidden from or mysterious to the user; broadly :anything that has mysterious or unknown internal functions or mechanisms
2 :a crashworthy device in aircraft for recording cockpit conversations and flight data
3 :a device in an automobile that records information (such as speed, temperature, or gasoline efficiency) which can be used to monitor vehicle performance or determine a cause in the event of an accident

 

Usually, when people refer to algorithms, they classify them as type 1 black box.  So what does that imply about how we interact with these black boxes? Its something mysterious that mungifies inputs and turns them into instructions you blindly follow?

If you treat algorithms like this, you may end up opening up a type 2 black box.

Let me explain what I mean with an example, courtesy of Air Crash Investigations tv series (see the episode perhaps illegally uploaded to YouTube here).

In 2002, two planes collided mid air over Überlingen in Germany , tragically killing everyone on board, mostly children. Afterwards, the devastated air traffic controller was murdered in his front garden by a father driven mad by grief who lost his  entire family in the crash (Wikipedia). Absolutely awful.

One of the contributing factors to this disaster was confusion in the human/computer interaction in the use of the Traffic Alert and Collision Avoidance System (TCAS) (see Kuchar and Drumm for how it works). TCAS is basically is a system of sensors and algorithms, that alert and advise pilots of what action to take to avoid collisions . In this incident, there was conflict between the instructions of TCAS and the air traffic controller. One pilot followed TCAS, the other air traffic control, so they both descended, ultimately ending in tragedy.

TCAS software itself did not fail, but as there was no international code on what to do in these circumstances, the overall system failed. The supporting infrastructure was not there.  The human computer interaction was not adequately considered and training. A previous incident in Japan (Wikipedia) had been reported to the International Civil Aviation Authority but no action had been taken. (If that crash had occurred, 677 people would have died, and it would’ve been the largest toll ever).

So my work is going to consider not just countering machine bias in the algorithm itself, but also considering the context in which it is used, and whether this is appropriate.

At the end of the day, holding an algorithm accountable is actually a ludicrous concept. It can only be the humans who are accountable.

On countering machine bias

ProPublica have a whole section dedicated to this topic. So glad to see this, and it appears they have covered insurance companies charging higher premiums in minority neighbourhoods, which I always suspected was happening. Cant wait to read that!

This is a topic for another blog post!

GANs, glorious GANs!

GANs, based on supervised learning and game theory, are just so darn elegant. The Grace Kelly of deep learning.

Pitting the generator and the discriminator against each other (where the generator tries to fool the discriminator into classifying its output as a real sample) is genius in its simplicity.

This report here gives a very good definition of them in Section 2, and creates a multi-task deep convolutional GAN to classify emotions from audio.

Or you can watch Ian Goodfellow describe his creation here.

 

Data Science Ethics: my initial thoughts

I had two main thoughts about this: self regulation by the data science profession, and data literacy.

The promise of big data and artificial intelligence is at an all time high, but by no means at its peak. The availability of data to mine is growing exponentially. And yet the data science community is still relatively small (compared with say, accountants, or bankers) and focused on scientific techniques .

Data science is making immense changes to the way people live, that will impact generations to come.

Reading these articles made me wonder, are data scientists proactively managing the ethical ramifications of the data they create, the algorithms they build, and the decisions made on the basis of their work?

This is a pivotal time in the evolution of data science ethics.

Data Scientists must establish strong ethical foundations in their profession, to ensure data science is used to make the world a better place, and before the profession gets over regulated by government if they dont do their part voluntarily.

As I explain in a past blog post, even Facebook is recognising that they are not just a technology tool, but make a real impact on the world: https://15-6762.ca.uts.edu.au/according-to-mark-zuckerberg-facebook-is-not-a-media-company/

Is now a good time for the profession to become a self regulating membership body?

Will auditors soon start to audit machine learning algorithms? (They should!)

I came across this code of conduct http://www.datascienceassn.org/code-of-conduct.html

Data literacy is also an interesting counterpoint to all of this.

I dont think it will be long before the general populace will revolt against organisations careless with their data, and opaque algorithms determining their fate in a way NOONE can explain.  People dont have blind faith anymore.

The University of Washington is now offering this course: “Calling bullshit”  to improve the quality of science.  http://callingbullshit.org/syllabus.html

 

In the mid nineties, I read Wild Swans, an autobiographical story about three generations of Chinese women (the last being the author Jung Chang) spanning about 100 years. If you want the abridged version, you can read it here in Wikipedia https://en.wikipedia.org/wiki/Wild_Swans.

After reading what they endured being on the losing side of a war, and then being under Communist rule, I’m certain those three daughters of China would warn us to guard our personal information closely, and watch how its being used against us. Random pieces of data given away here and there, could become information weapons in the wrong hands, and not just for us but for our descendants.

This is just one of the many sources of a general feeling of foreboding that I have about my personal data.

The other forces that make me think a slow train wreck is coming:

  • Ease of dissemination of “information” due to social media
  • Growing ease of storage
  • inability to destroy your own data, its immutable
  • diminishing interpretability of results

 

Below are some notes from the articles

 

privacy anonymity transparency trust and responsibility concern data collection curation analysis and use

What is data ethics? http://rsta.royalsocietypublishing.org/content/374/2083/20160360

Floridi and Taddeo talk about three axes of data science ethics

Data ethics concerns the generation recording curation processing dissemination sharing and use of the data

Data science ethics is what is done with the data ie the ethics of the algorithms and the ethics of the practices.

regarding the algorithms, auditing the outcomes against a gold standard is esssential, to ensure it is achieving  sensible and ethical results

creating a professional code of conduct to ensure ethical practices

3 Key Ethics Principles for Big Data and Data Science

Jay Taylor

collect minimal and aggregate

identify and scrub sensitive data

have a crisis management plan in place in case your insight backfires

above all, teach ethics!

 

 

 

 

 

 

 

Using Tree Based Gradient Boosting Models to classify terrorism events as Suicide Attacks

Using Tree Based Gradient Boosting Models to classify terrorism events as Suicide Attacks

Tracy Keys

13 June 2017

Background

My team, Gonzo, used the Global Terrorism Database (GTD) to explore whether distinct features of terrorism events could predict the ABC’s online reaction to them. We did this through web scraping the ABC’s Twitter feed, and Google Search results, and then built generalized linear models, and ElasticNet regularization models.

Our research illustrated the dramatic increase in terrorism events in recent years, and as shown below (Figure 1), the absolute number and proportion of suicide attacks is also on the rise. Most of these attacks were through bombings or explosions (Figure 2). I wanted to explore these suicide attacks further, and identify what were the most important characteristics in the GTD or most influential factors in determining the classification of suicide attack.

 

Figure 1 Terrorist Attacks during 2005-2015

 

Figure 2 Terrorist Attack Types during 2005-2015

My aims in this blog are, firstly, to deepen our teams understanding of how we can use the database itself, and secondly, to use a new statistical method, decision tree classification, (a gradient boosting model), to answer my new research question: how can Gradient Boosting Models classify terrorism events as Suicide Attacks?

Data Preparation

I had to change the data import by converting all the logical variables to factors for the gbm package, and make sure there were no NAs.

The package also has limitations in the number of levels a factor can have, so the research focused on the Middle East and North Africa, and South Asia regions in the GTD.

In addition, further filtering occurred on city to only those cities that had experienced a suicide attack, and this way I could keep my city and group name levels below 1024.

Gbm also takes binary outcome variables, so I translated my target “suicide” into “outcome_binary”.

After initial data exploration, and reading up on the GTD codebook, and getting extreme correlation in my outcome and variables, 3 variables were removed from the data: Weapsubtype1_txt=” Suicide…”, Nkillterr  and terrorist_killed

 

 

 

The gbm model

My data sets were split 70/30 into training and testing sets. My best cross validated gbm model is shown below:

 

gbm_fit = gbm(outcome_binary ~ ., distribution = “bernoulli”, data = training,cv.folds=10,
verbose = “CV”, n.trees = 100, interaction.depth = 3)

Model Evaluation

As this is a classification model with a binary outcome, I evaluated the model by calculating the Confusion matrix shown below.

 

Reference
Suicide=No Suicide=Yes
Prediction Suicide=No 8252 6
Suicide= Yes 1059 620

Table 1 Confusion Matrix

Due to the high number of false positives (1059 of 1679), precision of the model is 37%, but accuracy is high (89%) due to the negative prediction values. Due to the sparsity of the response variable, this is a common result.   This can be shown graphically in Figure 3 ROC Chart.

The Area under the Curve (AUC) was 98.33%. This is really high (100% is perfect).  This is illustrated by the very small gap in the training and testing Gains Chart Receiver Operating Characteristic (ROC) curve in Figure 3.

I did have some faith in the result however as I had already removed the 3 variables that were high correlated with the suicide variable.

 

Figure 3 ROC Chart

Sensitivity score is 99%, and specificity 88.6%. I used Stephan’s model evaluation, but I have to say, something looks odd with the charts (Figure 4)

Figure 4 Sensitivity Specificity Chart

 

Model Findings

The model calculated the probability threshold for classification as suicide attack was 6.35%.  The gbm summary table explained that 3 variables were 100% of the relative influence: nperps, weapsubtype1_txt and city.

 

##                                     var   rel.inf
## nperps                           nperps 47.774687
## weapsubtype1_txt       weapsubtype1_txt 42.885061
## city                               city  9.340252

I analysed these variables further to see why they were so influential.

The majority of terrorist and in particular, suicide attacks were perpetrated by one attacker. This does not mean they were acting alone, but were the only person who carried out the attack in the vast majority of cases (Figure 5).

 

Figure 5 Suicide attacks by number of attackers (nperps)

 

Figure 6  Weapon (sub) type (weapsubtype1_txt)

Vehicles were used in the majority of suicide attacks (but not attacks overall) (Figure 6). Given the finding that most suicide attacks are bomb/explosive attacks (Figure 2), this finding makes sense.

 

Lastly, the third most influential variable, the city is illustrated in Figure 7. Bagdad has withstood the greatest amount of terrorist attacks over the ten year period, including suicide attacks. Bagdad has suffered many devastating car bomb suicide attacks in this time, killing hundreds of people.

Figure 7 Cities that withstood terrorist attacks (city)

Figure 8 Groups perpetrating terrorist attacks (gname)

The Islamic State (ISIL) have been the perpetrators of the majority of suicide attacks. Boko Haram, an active terrorist group that also perpetrates suicide attacks is not included as they operate in the sub-Saharan African region.

I can conclude that the 3 most important variables from my model stand up to scrutiny of further data analysis.

Recommendations for enhancing the model

I tried to get glmnet with ridge and lasso working, to deal with the sparseness of my response variable, however the model would run overnight and then fail. Getting this working would definitely improve the model.

Building elasticnet regularization model and conducting upsampling would also improve it.

I would like to overlay rates of growth in suicide these communities to see if there is an relationship with the increasing number of suicide attacks.

Conclusions

In western society, suicide is a very significant contributor to early deaths. Given this, and my findings that suicide attacks, vehicles and a single person can be used so effectively by terrorist organization,  I fear suicide attacks will only become more frequent.