Driving Change: Analyzing Efficiency in NYC Taxis

Explore our innovative research on the normalcy of taxi trips in New York City, utilizing over a decade of data to unearth inefficiencies, detect fraud, and redefine the future of urban transportation.

Create Your Own Variations

Sign in to customize this poster and create unique variations. Adjust text, colors, and style to match your needs perfectly.

Prompt

hey, need help making this project on google drawings, here is all the info i have please guide me step by step, i already have the page setup correctly as shown: Final Poster Presentation [Peer-graded] Overview Each team creates a single poster for the whole team. Each team member separately prepares and creates a 3-minute video presentation (i.e., one presentation per learner). Thus, every team member should be very similar with the team’s project very well. Each team member should plan their own presentation separately, and team members should not share presentation scripts. Your video should show your poster with voice narration (e.g., as pdf on your computer screen via screen capture, say using MonoSnap, native screen recording software on your OS). It is up to you whether to show your face. You should be able to create this recording quickly with little effort – no need to do any special video or audio editing. You may zoom into and out of the poster as you present, so the viewer can more easily see the poster content. Demo: optional but encouraged. Demo time counts towards presentation time. Upload your video as an unlisted YouTube video (NOT “private” or “public”). Unlisted videos can be viewed by anyone (in this case, peer-graders who grade your presentation) with the link to your unlisted video. Submit the URL (web link) of your own unlisted YouTube video via Canvas. Your graders will use this URL to view your video. To double-check that your URL works, visit that URL using a separate web browser that has been fully logged out of Google services (e.g., all cache cleared, use “Incognito” mode in Chrome, etc.) Set the title of your YouTube video to teamXXXposter-YY, where XXX is the team number (e.g., 001 for team 1), and YY is the student's last name (e.g., smith). It is OK if multiple members have the same last name, because videos are uniquely identified by their URLs. IMPORTANT: you need a Google Account to upload a video to YouTube. To access YouTube, depending on your geographic location, you may need to use VPN (e.g., Georgia Tech’s VPN). Uploading a large video file can take a lot of time; VPN can further slow that down. Make sure you finish creating and uploading your video early, so you have ample time to verify that your submission is successful. Each learner will grade several other video presentations by learners from other teams. Peer grading is NOT anonymous. That is, a presenter knows who the graders are, and a grader knows who the presenters are. If a grader does not finish all the assigned peer grading, that grader may NOT receive all or part of the grader’s own final poster presentation grade (i.e., up to 7.5% of final course grade), since the peer grading is an integral part of the project presentation. We will compute a student's final poster presentation score as follows. For each rubric item, we take the average of the two highest scores. Then, we sum up these "averages" as the student's final score. This formula should heavily suppress "outlier" scores. After the peer grading ends, as additional safeguard, we will go through everyone's scores. For example, suppose for "What is the problem(no jargon)?", the student receives 4, 2, 3; and for "Why is it important and why should we care?", the student receives 4, 4, 5; the final score is computed as: final score = sum(avg(4, 3) + avg(4, 5) + ... ) where "4, 3" are the two highest scores from "4, 2, 3"; and "4, 5" are the two highest from "4, 4, 5". Why unlisted YouTube video? In previous semesters, students submitted video files via Canvas. That did not work well, creating significant challenges for both our students and the teaching team. A common problem was that graders could not view a video (while the video creator could), causing significant confusion for everyone involved, and overhead in fixing the problems. Some students also had difficulty uploading their video to Canvas or downloading them for grading. Why peer grading is not anonymous? We want students to learn and practice delivering constructive criticism, for any concerns and weaknesses identified. People rarely like to hear about negative comments, even if they are accurate and helpful. Giving negative news is always hard, but that is part of life! This means we should carefully phrase our comments as constructive criticism. For example, instead of saying “too much text and not enough figures”, you could say "Fig 1 to 3 are important figures in this project; currently they are not easy to see (images are too small; text is not legible). Suggest reducing the amount of text, e.g., into succinct, bullet points to create space for the figures". Similarly, avoid "I don't think that the visualization is anything new or how it is helpful," which is highly subjective. Instead, justify your comments; if the presenter did not clarify the novelty or significance of an approach (it is probably new, but just that the presenter did not point it out), you could say "it's unclear from the presentation and poster whether the proposed visualization is an improvement over the state of the art (it seems to be a standard design); more clarification is needed." There are pros and cons for both anonymous and open review. It is still an open research problem. For example, one potential benefit for open review is reviewers could be more tactful and constructive, where anonymous reviewers could be more critical (sometimes not in a good way) and may do less work than they should. Poster Design Design your team’s poster *well before* the submission deadline, to avoid last-minute rush. The poster must be in portrait orientation, 30 inches wide and 40 inches tall. We suggest using 18pt font size and larger. A deck of PowerPoint slides is not acceptable as a poster. See the illustration below for what is allowed and what is not. Grading Your poster presentation should cover the following parts (point distribution shown on the left). Thus, the grading is about both your presentation delivery (e.g., what you say, where you direct the audience’s attention), and the poster content. If you overrun, besides losing points for the rubric item “5% Finished on time?”, you may lose additional points for the required content that you have not covered within the time limit --- imagine you are delivering a presentation in person and you are alloted 3 minutes, once that time is up, you would need to stop and would not be able to present additional content (thus, that content will not be graded). 10% Motivation/Introduction: 5% What is the problem (no jargon)? 5% Why is it important and why should we care? 20% Your approaches (algorithm and interactive visualization): 5% What are they? 5% How do they work? 5% Why do you think they can effectively solve your problem (i.e., what is the intuition behind your approaches)? 5% What is new in your approaches? 10% Data: 5% How did you get it? (Download? Scrape?) 5% What are its characteristics (e.g., size on disk, # of records, temporal or not, etc.) 25% Experiments and results: 5% How did you evaluate your approaches? 10% What are the results? 10% How do your methods compare to other methods? 10% Presentation delivery: 5% Finished on time? 5% Spoke clearly and at a good pace? 25% Poster Design: 5% Layout/organization (Clear headings? Easy to follow?) 5% Use of text (Succinct or verbose?) 5% Use of graphics (Are they relevant? Do they help you better understand the project's approaches and ideas?) 5% Legibility (Is the text and figures too small?) 5% Grammar and spelling Possible software to create posters Figma -- free for students (Polo highly recommends) Powerpoint/Word (save as pdf) -- GT's Office365 Powerpoint supports collaboration. Apple Pages (FREE) supports real-time collaboration (via iCloud and desktop software) Inkscape (free, cross platform) Polo uses Affinity Designer (Mac and windows) Google Drawings (File > Page Setup to set document size) draw.io (File > Page Setup to set document size) Example poster design The following posters were for research projects conducted at the Polo Club of Data Science, and were not for projects from this class. They do not strictly follow the format described in our grading rubric. Apolo graph exploration Insider trading pattern discovery Comment spam detection Where to print posters on Atlanta campus? (Not applicable this semester) Paper and clay http://studentcenter.gatech.edu/seedo/paperandclay/Pages/default.aspx Poster printing is available for free at the GVU, but you have to physically go to the machine, log in, and upload your pdf http://gvu.gatech.edu/wiki/index.php/Poster_Printing_FAQ Poster printing is also available the library http://librarycommons.gatech.edu/lwc/multimedia.php use all this informatuion and idk telkl me what you need do u need data or something else Efficiency Audit on the Taxis of New York City Saif Ashfaq, Cameron Erdman, Cole Sheridan, Omar Alhabbal 1 INTRODUCTION The taxi market has been historically difficult to model and predict, making it difficult for customers to determine what the normal taxi trips should be. As a result, irregularities in taxi trips have led to increasing accusations of fraud in the taxi industry. However, without the concept of a normal taxi trip, it is near-impossible to determine what is and isn’t fraud. A person who wants to take a taxi wants to know how much their trip will cost, how long it will take, and if the taxi driver will take the best possible path. To alleviate this issue, we plan to utilize taxi trip data from New York City over the past 13 years to attempt to define what normal should be. 2 PROBLEM DEFINITION Define normal taxi trips and determine the probable causes of inefficiency in taxi trips using data from taxi trips in New York City from 2011-2024. 3 LITERATURE SURVEY When discussing anomaly detection for taxi rides, fraud is always a concern as to why a trip might have taken longer or cost more than usual. In a paper by Al-Sudani et al, some of the following abnormalities are: charging excessive fees, meter tampering, and taking longer routes. Previous works in this field have provided us some insights that we planned to use with our initial proposal. Prior work in the field of AI comes from Jindal et al, in which they propose a system to effectively utilize carpools to maximize carpool efficiency. They utilize a trained neural network model to predict taxi trip time with an Rˆ2 of 0.75, after which they utilize reinforcement learning to learn an effective carpooling strategy. This method assumes the baseline for taxi trips is constant and pays no attention to the impact of fraud or traffic. Similar work is done in Ozeki et al, wherein they propose a framework for ride share demand forecasting. They recognize the growing demand for taxis and related services while also recognizing the limited research in the space. They propose a framework in which a graph is built to represent physical locations, this graph is then fed into autoencoders which feed into region classifiers and demand predictors. They found this framework to be minutely better than existing methods but believe it to be far more comprehensive and robust. They however also pay no attention to traffic, fraud, or related deviations from normality. As stated by Li et al, utilizing Dijkstra’s method could confine our generated routes to always stay on the real road network. Theoretically, if we could obtain realtime data on traffic conditions throughout our dataset, we could include these parameters as weights in the calculation. For us, this made Dijkstra’s seem more optimal than utilizing an AI. When it comes to understanding the normal of taxi trip data, there are a plethora of studies that use complex models to perform data interpolation and compute other advanced metrics. In a paper by Ravish et al, over a dozen algorithms are compared and contrasted in order to understand the practicalities for each model in the context of monitoring traffic data. This study becomes useful because it highlights the need to purposefully pick out models that match the data and the purpose of the study. Cai et al previously did work on determining regular operational distances for taxi drivers in Beijing. This study sought to determine the effective operational distance for taxi drivers without a dispatch hub. The results of this study would be more comparable to Uber in the US rather than the taxi companies we study, though it does help us determine if there are any differences between taxis and Ubers. However, as Harding et al have stated in their analysis of the market for taxi journeys, there is a large variance of information available for the taxi market, and it is difficult to demonstrate the changes the market sees as a result. For customers, they state “If passengers do not have the ability to judge quality in relation to price before the ride, then they should be protected by minimum standards and price controls which maintain a stable pricing structure and service standards. The pricing structure should allow for some level of differentiation. However, this is restricted as regulators typically have limited resources to check and enforce the service standards." They propose a solution to this dilemma in Billhardt et al; in which they address many of the inefficiencies in current taxi dispatching strategies by proposing theoretical solutions to reward drivers and riders for accepting risk and discount those who do not. They however quickly acknowledge the lack of previous analytical work to test these on. They resort to doing simple experiments and have to make many assumptions which they admit inhibits the usefulness of their findings. Aside from creating the model, one must be able to make conclusions based off of their algorithms. In a study by Bakhshi et al, there are many ways to detect fraud. This study is very relevant because it uses DBSCAN, which we plan to use, and it also uses labeled data. Utilizing such a study gives our project a guide for how to pick our data and design our model. 4 PROPOSED METHOD 4.1 Intuition As mentioned in the literature survey, current models have an issue with the availability of taxi data and a lack of a defined normal for the data. This issue is why we see problems with data validation, and why our original proposed methods had to be changed. However, in determining this, we can say there is novelty in itself in the attempt to define a normal for taxi trip data beyond the basic analysis and in determining likely causes for abnormalities in the dataset. Our study will tackle this problem on two fronts: 1) Using anomaly detection, via DBSCAN, to determine abnormalities within taxi trips, and 2) observe how the time of day affects the "normal" for taxi trips. Studies in this field often use their algorithms to predict data points, rather than as a study on taxi trip behaviors. 4.2 Approaches Initially, Dijkstra’s algorithm was to be used on an edgenode graph of New York City streets to calculate the most efficient route between each start and end point while adhering to the streets of NewYork City. However, the sheer number of streets made Dijkstra’s computationally inefficient, and the lack of a determinate start and end point meant only an estimate of the shortest distance could be obtained, which varied in accuracy drastically based on the pickup and drop off zones’ sizes. This same issue made attempts to use Google Maps API also inefficient. This limitation is reflected in the literature survey, as multiple articles denote this issue when it comes to a lack of ground truth with taxi trip data. What this leaves us with is a data clustering problem in order to determine abnormalities in taxi trip data, and to evaluate the potential causes. Our current approach utilizes DBSCAN as well as Z-scores to filter outliers in the data.We will be examining the data by its respective pickup and drop-off zone across three time groups: rush hours (7am to 9am and 3pm to 8pm), mid-day (9am to 3pm), and night shift (8pm to 7am).We may additionally group the data by its corresponding borough for each zone should we find looking at routes and time groups to be too specific to find discrepancies in the data. The first step we conducted was data cleaning. This includes filtering out any taxi trips that originate or terminate outside of NYC. Additionally, any trips with seemingly incorrect data, such as trips with 0 distance traveled or start/end locations being in the same place. The data was centered and scaled to reduce the computational cost of the DBSCAN algorithm. To understand what data to use in our algorithm, we have been testing various variables using 3 types of correlation tests: Pearson, Kendall, and Spearman. These tests give us a good idea as to how our different variables interact with each other, and if they have relevance in our model. Having 3 different correlation tests allows us to understand our data in multiple manners, as the Pearson test looks for linear relationships, while the Spearman rank correlation test determines the degrees of relation between variables. Variables with high coefficients are added to the model for further research. Initial clustering using DBSCAN used fare amount and trip distance, as these two variables shared the highest correlation. Clusters were based on where points had at least 1/5 of the total number of trips following the same route within a Z score (deviation from the mean) of 0.75 (to account for the cluster expanding as more points are added). For the visualization, we had to source external geometric location data and project it to latitude/longitudinal coordinates in order to construct polygons representations of our taxi zones. Doing so allows us to map them onto the surrounding New York City region correctly. From there we convert our data to a geometric data frame and use it in a choropleth map to build the framework of our visualization. The aspects that follow 2 such as interactive hover statistics, color scales, and time dimensions work as they would in any plot. The final part of this implementation is the user interface that utilizes the algorithm and visualization. Our plan is to take advantage of visual nature of clusters in unison with the previously mentioned visualization of NYC. The hope is that users will be able to interact with the model and get a visualization of how taxi trips change during different times of the day. The clustering will additionally highlight any anomalies that occur for the user to see. 5 EVALUATION Our dataset consists of Yellow Taxi trip data from 2011 to 2024 for the month of January. Due to memory constraints, we use a subset of this dataset, consisting of the first approximately 1.5 million trips each year within the boarders of New York City. To prevent a skew towards any time zone, the initial dataset was analyzed to find the distribution of data between time zones, and the subset was collected using the same distribution (42.07% from Rush House 29.84% from mid-day, and 28.09% from night shift). 5.1 QuestionsWe Intent To Answer • Q: What variables in our dataset contribute most toward abnormal values? • Q: What is normal for taxi trips? What is abnormal? • Q: How does this vary with time of day? • Q: How has the normal taxi trip changed over time? 5.2 Experiments and Observations Initial plans regarding the data were determined to be too ambitious for the dataset that we have. Plans were based on the dataset prior to the year 2011, when taxi trip recorded their precise pickup and drop-off latitude and longitude rather than a zone indicator for each. Additionally, based on the dataset’s use of GPS data, we had assumed that the route would be provided. Due to the new format for the dataset introduced in 2011, we chose to omit the 2009 and 2010 data from our project. We are currently working to determine the best way to categorize data for determining the normal. Current investigations categorize data by route (start and end Figure 1: fare amount vs. trip distance in 2012 zone), though we do not wish to make definitive claims to answer research questions at this time. By normalizing the data, we noticed that the DBSCAN algorithm performed significantly better in comparison to without. The variables with the highest correlation, across all three tests,were trip_distance to trip_time , trip_distance to fare_amount, and toll_amount to trip_distance. These results are what one would expect when it comes to variables correlating with one another. We expect that all of these variables will contribute towards defining what a "normal" taxi trip means. Some variables, such as passenger count, were initially expected to have high correlation with tip amount or trip time, but actually ended up having low correlation with almost every other variable. On the other hand, toll amount has a high degree of influence over other variables that are also tightly coupled, such as trip distance and trip time. One key observation on taxi trip behavior is that many start and end locations ended up within the same zone (zones are geographic areas that taxis use to generalize parts of the city). For example, there is a high volume of taxi trips going from zone 4, which is a residential area, to zone 141, which contains many points of interest (hospitals, cultural spots, restaurants). While this observation will not have an impact in our algorithm, they help define why multiple people might opt to take a taxi over their own vehicle or the subway. It was observed that the split among times in the day for taxi trips is as follow: • Rush Hour: 42.0700% • Mid Day: 29.8441% • Night Shift: 28.0859% 3 Figure 2: DBSCAN Test for zone 10, mid-day, 2024 As shown in Figure 1, the relationship between fare amount and trip_distance is fairly linear and it is easy to pick out abnormalities. For instance, it is hard to imagine why a trip with a distance of 1 mile would cost $100. While this is an extreme, it does highlight the kind of trips that we are looking out for. By introducing multiple variables, it will highlight abnormal trips on multiple dimensions, with the help of unsupervised learning. Figure 2 displays a test implementation of DBSCAN with z-scores, using a small subset of our data from 2024. This test was performed prior to finalizing the parameters for our model. While this does not extract any meaningful results, the hope is that clustering will highlight patterns within the data as we configure the parameters we put in, the amount of data points we use, and other metrics. 6 CONCLUSIONS AND DISCUSSION Our visualization gives a brief overview of the demographics of our dataset for each year. When the user hovers over a zone on the map, additional information about said zone is displayed as a tooltip, with further clarifying graphs appearing below the map. See Figure 3 for a quick visualization of average taxi trip time for all zones for 2024. While the draft only depicts 2024, the final visualization will allow the user to: - Select the time of day (morning, evening, night shift, rush hours). - Select the year to visualize. - Obtain statistics for trips between a selected start and end zone. Figure 3: Visualization of taxi trips in NYC REFERENCES Al-Sudani, Zainab S., and Musaab Riyadh. "Detecting Fraudulent Taxi Drivers: Overview." Al-Sadiq International Conference on Communication and Information Technology (2023): 115–20. Bakhshi, Kosar, Behnam Bahrak and Hamid Mahini. "Fraud Detection System in Online Ride-Hailing Services." 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS). Tehran, Iran, 2021. Billhardt, Holger and Fernández, Alberto and Ossowski, Sascha and Palanca, Javier and Bajo, Javier. "Taxi dispatching strategies with compensations." Expert Systems with Applications (2019): 173-182. Cai, Hua, et al. "Understanding Taxi Travel Patterns." Physica A 457.C (2016): 590-597. Chenguang Zhu, Balaji Prabhakar. Reducing Inefficiencies in Taxi Systems. 56th IEEE Conference on Decision and Control (CDC), 2017. Chou, Ka Seng, et al. "Taxi Demand and Fare Prediction with Hybrid Models: Enhancing Efficiency and User Experience in City Transportation." Applied Sciences 13.18 (2023): 10192 - . Department of Transportation (DOT). Traffic Volume Counts. 27 May 2022. NYC Open Data . 25 February 2025. <https://data.cityofnewyork.us/Transportation/ Traffic-Volume-Counts/btm5-ppia/about_data>. Freudenberg Sealing Technologies. Street Network of New York in GraphML. 2017. Freudenberg Sealing Technologies. 25 February 2025. <https://www.kaggle.com/ datasets/crailtap/street-network-of-new-york-in-graphml>. Harding, Simon, Milind Kandlikar and Sumeet Gulati. "Taxi apps, regulations, and the market for taxi journeys." Transportation Research. Part A, Policy and Practice 88 (2016): 15-25. Li, Qingquan, et al. "A Hybrid Link-Node Approach for Finding Shortest Paths in Road Networks with Turn Restrictions." Transactions in GIS 19.6 (2015): 915–29. 4 NYC Taxi and Limmosine Commission. TLC Trip Record Data. 25 February 2025. <https://www.nyc.gov/ site/tlc/about/tlc-trip-record-data.page>. NYPD. Motor Vehicle Collisions - Crashes. 24 February 2025. NYC Open Data. 25 February 2025. <https:// data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions- Crashes/h9gi-nx95/about_data>. Ozeki, Ren, et al. "One Model Fits All: Cross-Region Taxi-Demand Forecasting." Proceedings of the 31stACM International Conference on Advances in Geographic Information System. ACM, 2023. Ravish, Roopa and Shanta Ranga Swamy. "Intelligent Traffic Management: A Review of Challenges, Solutions, and Future Perspective." Transport and Telecommunication 22.2 (2021): 163-182. Ye, Ishan Jindal and Zhiwei Qin and Xuewen Chen and Matthew Nokleby and Jieping. "Optimizing Taxi Carpool Policies via Reinforcement Learning and Spatio- Temporal Mining." 2018. 25 February 2025. <https:// arxiv.org/abs/1811.04345>. Zhou, Xun, et al. "Optimizing Taxi Driver Profit Efficiency: A Spatial Network-Based Markov Decision Process Approach." IEEE Transactions on Big Data 6.1 (2020): 145-58. 5

Image Details

Aspect Ratio: 3:4