HomeОбразованиеRelated VideosMore From: Siraj Raval

Kaggle Challenge (LIVE)

1400 ratings | 52079 views
Join me as I attempt a Kaggle challenge live! In this stream, i'm going to be attempting the NYC Taxi Duration prediction challenge. I'll by using a combination of Pandas, Matplotlib, and XGBoost as python libraries to help me understand and analyze the taxi dataset that Kaggle provides. The goal will be to build a predictive model for taxi duration time. I'll also be using Google Colab as my jupyter notebook. Get hype! Code for this video: https://github.com/llSourcell/kaggle_challenge Please Subscribe! And like. And comment. That's what keeps me going. Want more education? Connect with me here: Twitter: https://twitter.com/sirajraval instagram: https://www.instagram.com/sirajraval Facebook: https://www.facebook.com/sirajology This video is apart of my Machine Learning Journey course: https://github.com/llSourcell/Machine_Learning_Journey More Learning Resources: https://www.kaggle.com/kanncaa1/machine-learning-tutorial-for-beginners https://www.kaggle.com/rtatman/beginner-s-tutorial-python https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ Join us in the Wizards Slack channel: http://wizards.herokuapp.com/ Sign up for the next course at The School of AI: https://www.theschool.ai And please support me on Patreon: https://www.patreon.com/user?u=3191693 Signup for my newsletter for exciting updates in the field of AI: https://goo.gl/FZzJ5w Hit the Join button above to sign up to become a member of my channel for access to exclusive content!
Html code for embedding videos on your blog
Text Comments (108)
Deepen Kshetri (16 days ago)
Name error: name xgb_train is not defined
Alexander Trists (1 month ago)
+Siraj Raval you didnt actually run your final code. i changed the underscore to a period and still got errors. whats the fix?
Chelsea Yang (1 month ago)
Hey Alexianer, do you have these files (fastest_routes_train_part_1.csv, fastest_routes_train_part_2.csv, and fastest_routes_test.csv) to run this model? The link to access these datasets are not valid now, if you have, could you please send them to me? Thank you soooo much!
Maxim Shalankin (3 months ago)
Great work! Hello from Russia
Jhonatan da silva (3 months ago)
Hey! I've also created a tutorial on the 8 Major Steps to create your model on a Kaggle Kernel, including the training part! https://www.youtube.com/watch?v=AXcTm4gFerE How to deal with large image datasets https://www.youtube.com/watch?v=myYMrZXpn6U I also have a FREE course on how to create a CNN to the MNIST competition https://jhonatandasilva.com/university/10-steps-cnn , in the end I create the submission file that you can create yourself and submit to Kaggle!
Mister Seitan (4 months ago)
yo you have utorrent
Sanyam Lakhanpal (5 months ago)
is it recommended to use google colab?? i faced many issues with it . Its disconnecting my session at times when i give some heavy work. Does anyone else faced the same problem ??? The other way is to use AWS but they charge for it , i need something cheap and good! Pls tell me about this...!
boy boy (6 months ago)
hey siraj dataset of fastest routes can't be found no more can you help
EL Mesaoudi Zakariae (6 months ago)
Wow ! What a fraud !
Aqil Khan (7 months ago)
you awesome
Ankit Khandelwal (7 months ago)
I have friend with surname raval Is siraj of indian origin?
Rohan K (30 days ago)
Nope it's chinese actually.
atul rao (7 months ago)
Very Informative video. Please upload more like this :)
lit ghost (7 months ago)
how much does docker cost for 1 user?
Being A Pro (7 months ago)
I couldn't find your video on XGBoost, Can you please do one?
Lion King (7 months ago)
Hi I need some help , Can someone let me know the software and equipment he is using to be able to share his screen and also crop his background out. I would like to try this out for something. Thank you in advance.
Phoenix Simurg (8 months ago)
Hi, I have a problem with reading train.csv from gdrive. Run time died! pup-ups and the kernel restarts(all variables lost) while reading train.csv (~5 GB) from Gdrive. No problem with test.csv which is relatively too small. Any help will be appreciated. Thanks train_downloaded=drive.CreateFile({'id':'1rJjSqP-c2uptNbtmRuf2m6cMT3uvV3Pp'}) train_downloaded.GetContentFile('train.csv') test_downloaded=drive.CreateFile({'id':'1Bwg1CEfE3bPEiNnpo_M6j6rms7Im2Mav'}) test_downloaded.GetContentFile('test.csv') df_test=pd.read_csv('test.csv') df_train=pd.read_csv('train.csv')
Phoenix Simurg (8 months ago)
By the way where is the trip_duration feature? Do we have it in datasets?
Sarab A (8 months ago)
Vishal Borana (8 months ago)
You could have completed the video siraj, its actually difficult for beginners to figure out the next part to it. I got an error when i trained the xgb model using xgb.train('default',df_train). How am i supposed to solve this?
Vasil Yordanov (8 months ago)
what a waste of time ...
AmèNii AbdelHamiid (8 months ago)
Great work thank you!! Is it possible that we use PHP instead of python?
Tim Boland (8 months ago)
Im a Big Fan and I love you....Ive watched all your videos and will continue to do so....but if you already have the code typed out...its better to just go over with us instead of typing it out....you'll be able to show us a lot more in alot less time
upendra kumar Devisetty (8 months ago)
Hi Siraj, why did you split the train data into train and test since you have test data available separately?
Michael Sinclair (8 months ago)
Great video, although it would be good to see how accurate your predictions were and what your Kaggle score was. Around 28 minutes you plot the pickups as a scatter plot. Both plots are the same training data and not train / test as you were aiming for. All in all it was a Good video
shanto mathew (8 months ago)
where did the trip_duration column come from?
siddhesh tiwari (8 months ago)
Not satisfactory, could you do more of these with more emphasis on feature engineering. Deep learning is great but still in many industries people need key drivers from models and little bit of naive explanation to non technical people , can you do some videos on those too. I know AI is your forte but this could be helpful to people in general
hassan ahmed (8 months ago)
Bro genuinely, thanks for the video💐
Florian Makaïa (8 months ago)
Hi Siraj I Really like your video and the Code you've made for this one but it would be better if you will add more comments I guess Thanks
Ajay Prajapati (8 months ago)
you are perfectly doing a great job man,its been only one week me started exploring ML and AI stufss and all your videos shows exactly what all things are require and what all things we have to do.u not only stop here giving us hands on various new tools and technology but ur new big thing 'School of AI' is seriously going to help us a lot in long term that too free of cost. Thankyou soo much for videos and cool stuffs.. hats off your great work!!
Emir Nurmatbekov (8 months ago)
Just so you know how wide you are spreading the knowledge. Hi from Kyrgyz Republic, Central Asia ^ ^
Fingerdraw (8 months ago)
keep getting this error TypeError: invalid cache item: DataFrame ----> 5 model = xgb.train('default', df_train)
Sachin Kamath (8 months ago)
I left at 16:48. Need to remember this.
Thiago Leite (8 months ago)
Hi Siraj, in second part please enable YouTube's automatic caption. Thanks.
Hadrhune0 (8 months ago)
Kill the chat, especially when questions are not related to the activity your managing. It always derails your train of thought.
J J (8 months ago)
Hi Siraj, Just want to let you know that what you are doing is awesome. Sharing this much content, approaches, and concepts (even how to use them) in this short amount of time, is wonderful! I’m currently making similar projects with related problems (for the pharmaceutical industry). Everyday I’m struggling with new constraints and I always love to learn more about how to deal with them. You helped me a lot. Hit me up, if you want to share some thoughts on how to address different field problems with multiple features (such as text and number problems combined). Keep up to good work! Greetings from Amsterdam!
fgfanta (8 months ago)
I am glad you covered gradient boosting, so many winning Kaggle competitions used it, and still it gets so little coverage on-line, compared to Deep Learning. Consider looking into CatBoost as well, it automatically encodes categorical features, one of the complicated things to do when using boosted trees.
shanto mathew (8 months ago)
can we download the data directly into the google drive from kaggle? if yes how?
Gowri Shankar (8 months ago)
Is Colab provides sufficient RAM. This is my third attempt to load large datasets(diverse datasets). It fails after sometime. Below is my script... I repeatedly get "Kernel Restarting" error while "pd.read_csv" call. !pwd !mkdir .kaggle !touch .kaggle/kaggle.json !echo '{"username":"gowrishankarin","key":"my key"}' > .kaggle/kaggle.json !kaggle competitions download -c new-york-city-taxi-fare-prediction train = pd.read_csv( '.kaggle/competitions/new-york-city-taxi-fare-prediction/train.csv.zip', compression = 'zip', engine='c' ) train.head()
Amine Yaiche (8 months ago)
You stopped the video at the most important part !!
Durlov (11 days ago)
Yeah, I know. He should have rapped for much longer. I couldn't get my jam on :(
Jim Arnold (8 months ago)
Siraj you are awesome! I've been learning a ton from your videos and I'm going to start doing your weekly challenges. Thank you for being a great teacher and sharing your knowledge.
College Guide (8 months ago)
Hit likes for siraj!!! please keep posting such kaggle challenges even more..
Siraj Raval (8 months ago)
Daniel Okey-Okoro (8 months ago)
Steven Roberts (8 months ago)
hey Siraj, do as many of these as possible
Siraj Raval (8 months ago)
will do
Abhishek Kumar (8 months ago)
Thank u for this awesome video
manohar manu (8 months ago)
Its a good start to navigate through these kind of stuff too, keep going !!
MrDonald911 (8 months ago)
Please do a video about OCR on scanned documents ! <3
keghn feem (8 months ago)
Great stuff. But for people need to get to the meat and potatoes right away: https://www.kaggle.com/competitions
Amit Prasad (8 months ago)
So much to learn. It's good its bad. It's very difficult to cope up. I have started ML a year ago, spend hours and hours every single day, and I still know nothing.
Karnage Ghosh (7 months ago)
implement what you already know....you will realize you actually do know something
Marcos Pereira (8 months ago)
Amit Prasad Stop watching start doing!
Y J (8 months ago)
Man the things you teach us would cost thousands of dollars for us to get elsewhere
Patricia Inés Basualdo (8 months ago)
Thank you so much! I learn a lot from you!!
i3cons (8 months ago)
Brad Ezard (8 months ago)
"No strings attached!" - except that you will normally only get access to ~500MB of VRAM on the K80, which makes it nearly impossible to work on....
Ashwin Kolhatkar (8 months ago)
Yeah!!!! These livestreams are super helpful! Do keep 'em coming!
Anirudh Reddy (8 months ago)
for loading data in colab , you could've done this !wget <<download link that you can get by right clicking on download button on kaggle page>> and then do a !ls that it you'll see the file
Usman Ahmed (8 months ago)
Siraj what is your testing error? Why did you used default setting ? You needed to optimize it for better score !
Siraj Raval (8 months ago)
yes you're right will get better
FuZZbaLLbee (8 months ago)
This video also talks about GXboost https://youtu.be/6tQhoUuQrOw
Vishal Gahlot (8 months ago)
At 28:42 u have used train data for checking train and test scatter plot , no test data is used and your reaction is like omg what the hack we got hahaaha XD Kudos :P
Tushar Seth (1 month ago)
u can plot with test dataset and can observer , its quiet similar. Thats why siraj would have missed it
Shravankumar Shetty (8 months ago)
Hey Siraj, while plotting the scatter plot from matplotlib.pyplot, you plotted the same plots for train and test data, I mean you used the same df_train dataframe to plot both the plots, hence you saw both the plots to be same :D It's not that both the train and test datasets have similarities. https://drive.google.com/open?id=1YnPWzFKW3-fbcegSlbFw9Jl_ny7ugIOS
Kevin Nelson (8 months ago)
Logarithms help negotiate the the dichotomy between the finite span of possible below the mean and the infinity above for real positive sample sets.
Kevin Nelson (8 months ago)
Thank you indeed.
Saloni Agarwal (8 months ago)
The code for this video is not yet available. Can you please upload it.
ForevermoreFree (8 months ago)
Saw the same thing. I have typed it out in a prior post for convenience. :)
sum guy (8 months ago)
thank you
ForevermoreFree (8 months ago)
#Thanks Siraj, I followed your code till the end. The code you have just written isn't on your github, so here it is for everyone. Keep up the good work! #Code transcribed from Siraj Raval's Kaggle Challenge (LIVE) uploaded 2018.07.27 !pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials #data dependencies import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.rcParams['figure.figsize']=[16,10] import seaborn as sns from sklearn.model_selection import train_test_split import xgboost as xgb %matplotlib inline plt.rcParams['axes.unicode_minus']=False #--- #auth auth.authenticate_user() gauth=GoogleAuth() gauth.credentials=GoogleCredentials.get_application_default() drive=GoogleDrive(gauth) #--- train_id='your_train_id' test_id='your_test_id' train_downloaded=drive.CreateFile({'id':train_id}) train_downloaded.GetContentFile('train.csv') test_downloaded=drive.CreateFile({'id':test_id}) test_downloaded.GetContentFile('test.csv') df_train=pd.read_csv('train.csv') df_test=pd.read_csv('test.csv') #--- #datapreprocessing ##How long is the average trip df_train['log_trip_duration']=np.log(df_train['trip_duration'].values+1) plt.hist(df_train['log_trip_duration'].values, bins=100) plt.xlabel('log(trip_duration)') plt.ylabel('number of training records') plt.show #--- N=10000 city_long_border=(-75,-75) city_lat_border=(40, 40) fig, ax=plt.subplots(ncols=2, sharex=True, sharey=True) ax[0].scatter(df_train['pickup_longitude'].values[:N], df_train['pickup_latitude'].values[:N], color='blue', s=1, label='train', alpha=.1) ax[1].scatter(df_train['pickup_longitude'].values[:N], df_train['pickup_latitude'].values[:N], color='green', s=1, label='train', alpha=.1) plt.show() #--- #train model feature_names=list(df_train.columns) y = np.log(df_train['trip_duration'].values+1) Xtr, Xtx, ytr, yv=train_test_split(df_train[feature_names].values, y, test_size=.2, random_state=1987) model=xgb_train('default', df_train)
Sean Fernandez (3 months ago)
Any chance you got the model=xgb_train('default', df_train) line working? I'm getting a TypeError: invalid cache item: DataFrame
Varun Kuntal (8 months ago)
Could you please live stream 4am or 5am IST or even night before 12. It will just be convenient for Indian viewers. Thanks for the content. PS You are great sir
Siraj Raval (8 months ago)
will do
rough rough (8 months ago)
Varun Kuntal majority viewers are might not be in India
Ashutosh Patel (8 months ago)
GREAT WORK... please do more challenges like this..... Thank you
Siraj Raval (8 months ago)
thanks will do
Hazzehh (8 months ago)
can you actually finish next time, what a waste of 40 minutes
Siraj Raval (8 months ago)
yes i will next time
Kumar Rohit (8 months ago)
rough rough you have a point there
Alejandro SI (8 months ago)
What follow is your part. It's a competition.
rough rough (8 months ago)
Kumar Rohit famous artist like picasso said in order to be a great artist start copying other artist thus from a beginner perspective they can learn so much faster and comprehend better by copying the code and looking at how other people test or train the data and over time this exposure will lead them to read the documentations
Kumar Rohit (8 months ago)
if you want everything spoon-fed.. it will do you more harm than good.. just saying.
William John (8 months ago)
siraj please do tutorial on Nmapping and metaspolit
Conrad Thiele (8 months ago)
logarithms - instead of multiplying the values, addition is used for the same formula for like you said data distributions where the range between min and max are humungous i.e. min 0.000001 max 10kkk (Siraj's bitcoin account).
Alejandro SI (8 months ago)
João Marcos Cardoso (8 months ago)
At 30:00 I believe you plotted the same dataset twice
Siraj Raval (8 months ago)
yes indeed thanks
Vaibhav Vats (8 months ago)
You stopped when the most interesting part began 😔 But thanks, atleast I now know how to approach such problems.
Siraj Raval (8 months ago)
yes i wont stop next time
justin (8 months ago)
Great video. Please do some more of this
Chris Ray (8 months ago)
This was a waste of time. You bailed before explaining how to use xgboost. Honestly, it seems like you don’t actually understand the material.
Eliad Levy (8 months ago)
Siraj cheer up, it is well known that live demos are hard to do, and yet you chose to get out of your comfort zone and tried it anyway. I appreciate your time and effort, as well as your intentions of doing a followup video. I'm sure you'll learn from mistakes and get better with time. Keep it up! :)
Siraj Raval (8 months ago)
i didn't plan out my time well enough, i'll get better at live streaming and make a dedicated video about xgboost as well
Rahul Ahuja (8 months ago)
Can you do tutorial on Vowpal Wobbit? https://www.analyticsvidhya.com/blog/2018/01/online-learning-guide-text-classification-vowpal-wabbit-vw/
manan mongia (8 months ago)
colab kernel dies whenever i try to import data set by train_downloaded = drive.CreateFile({'id': '1_n_ipWxmpTNJHdeIpn-0F6cGrofFJkQc'}) any solution for this?
username (8 months ago)
where is the rest ? the most significant part where you train and test the model ?
Biranchi Narayan Nayak (8 months ago)
We can also upload and download files directly from Colab as below: from google.colab import files uploaded = files.upload() files.download( "submission.csv" )
Florian Makaïa (8 months ago)
Siraj Raval (8 months ago)
very good
Ali Ganbarov (8 months ago)
omg watched 40 mins just to see you stop at the most interesting part.. Is there a second part where you will explain training and testing?
Yassine Lazaar (4 months ago)
+Siraj Raval Hey Siraj...What's the follow up on this? Thanks for the course btw :)
Guilhem Forey (8 months ago)
Lookin gforward to it, I second that ! Thank you Siraj
Siraj Raval (8 months ago)
noted, i'll live stream again this week and focus on the most interesting parts at the start. thanks!
Vikas Sawant (8 months ago)
Awesome ........
hussien albared (8 months ago)
Thank you for your great effort keep nice work
Siraj Raval (8 months ago)

Would you like to comment?

Join YouTube for a free account, or sign in if you are already a member.