天天艹综合,午夜一区二区三区

Since many potential pandas users have some familiarity withSQL, this page is meant to provide some examples of howvarious SQL operations would be performed using pandas.

If you’re new to pandas, you might want to first read through 10 Minutes to pandasto familiarize yourself with the library.

As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pdIn [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll readthe data into a DataFrame called tips and assume we have a database table of the same name andstructure.

In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'In [4]: tips = pd.read_csv(url)In [5]: tips.head()Out[5]:    total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

SELECT?

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a *to select all columns):

SELECT total_bill, tip, smoker, timeFROM tipsLIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)Out[6]:    total_bill   tip smoker    time0       16.99  1.01     No  Dinner1       10.34  1.66     No  Dinner2       21.01  3.50     No  Dinner3       23.68  3.31     No  Dinner4       24.59  3.61     No  Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s*).

WHERE?

Filtering in SQL is done via a WHERE clause.

SELECT *FROM tipsWHERE time = 'Dinner'LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is usingboolean indexing.

In [7]: tips[tips['time'] == 'Dinner'].head(5)Out[7]:    total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

The above statement is simply passing a Series of True/False objects to the DataFrame,returning all rows with True.

In [8]: is_dinner = tips['time'] == 'Dinner'In [9]: is_dinner.value_counts()Out[9]: True     176False     68Name: time, dtype: int64In [10]: tips[is_dinner].head(5)Out[10]:    total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and &(AND).

-- tips of more than $5.00 at Dinner mealsSELECT *FROM tipsWHERE time = 'Dinner' AND tip > 5.00;

# tips of more than $5.00 at Dinner mealsIn [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]Out[11]:      total_bill    tip     sex smoker  day    time  size23        39.42   7.58    Male     No  Sat  Dinner     444        30.40   5.60    Male     No  Sun  Dinner     447        32.40   6.00    Male     No  Sun  Dinner     452        34.81   5.20  Female     No  Sun  Dinner     459        48.27   6.73    Male     No  Sat  Dinner     4116       29.93   5.07    Male     No  Sun  Dinner     4155       29.85   5.14  Female     No  Sun  Dinner     5170       50.81  10.00    Male    Yes  Sat  Dinner     3172        7.25   5.15    Male    Yes  Sun  Dinner     2181       23.33   5.65    Male    Yes  Sun  Dinner     2183       23.17   6.50    Male    Yes  Sun  Dinner     4211       25.89   5.16    Male    Yes  Sat  Dinner     4212       48.33   9.00    Male     No  Sat  Dinner     4214       28.17   6.50  Female    Yes  Sat  Dinner     3239       29.03   5.92    Male     No  Sat  Dinner     3

-- tips by parties of at least 5 diners OR bill total was more than $45SELECT *FROM tipsWHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]Out[12]:      total_bill    tip     sex smoker   day    time  size59        48.27   6.73    Male     No   Sat  Dinner     4125       29.80   4.20  Female     No  Thur   Lunch     6141       34.30   6.70    Male     No  Thur   Lunch     6142       41.19   5.00    Male     No  Thur   Lunch     5143       27.05   5.00  Female     No  Thur   Lunch     6155       29.85   5.14  Female     No   Sun  Dinner     5156       48.17   5.00    Male     No   Sun  Dinner     6170       50.81  10.00    Male    Yes   Sat  Dinner     3182       45.35   3.50    Male    Yes   Sun  Dinner     3185       20.69   5.00    Male     No   Sun  Dinner     5187       30.46   2.00    Male    Yes   Sun  Dinner     5212       48.33   9.00    Male     No   Sat  Dinner     4216       28.15   3.00    Male    Yes   Sat  Dinner     5

NULL checking is done using the notna() and isna()methods.

In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],   ....:                       'col2': ['F', np.NaN, 'G', 'H', 'I']})   ....: In [14]: frameOut[14]:   col1 col20    A    F1    B  NaN2  NaN    G3    C    H4    D    I

Assume we have a table of the same structure as our DataFrame above. We can see only the recordswhere col2 IS NULL with the following query:

SELECT *FROM frameWHERE col2 IS NULL;

In [15]: frame[frame['col2'].isna()]Out[15]:   col1 col21    B  NaN

Getting items where col1 IS NOT NULL can be done with notna().

SELECT *FROM frameWHERE col1 IS NOT NULL;

In [16]: frame[frame['col1'].notna()]Out[16]:   col1 col20    A    F1    B  NaN3    C    H4    D    I

GROUP BY?

In pandas, SQL’s GROUP BY operations are performed using the similarly namedgroupby() method. groupby() typically refers to aprocess where we’d like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together.

A common SQL operation would be getting the count of records in each group throughout a dataset.For instance, a query getting us the number of tips left by sex:

SELECT sex, count(*)FROM tipsGROUP BY sex;/*Female     87Male      157*/

The pandas equivalent would be:

In [17]: tips.groupby('sex').size()Out[17]: sexFemale     87Male      157dtype: int64

Notice that in the pandas code we used size() and notcount(). This is becausecount() applies the function to each column, returningthe number of not null records within each.

In [18]: tips.groupby('sex').count()Out[18]:         total_bill  tip  smoker  day  time  sizesex                                             Female          87   87      87   87    87    87Male           157  157     157  157   157   157

Alternatively, we could have applied the count() methodto an individual column:

In [19]: tips.groupby('sex')['total_bill'].count()Out[19]: sexFemale     87Male      157Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amountdiffers by day of the week - agg() allows you to pass a dictionaryto your grouped DataFrame, indicating which functions to apply to specific columns.

SELECT day, AVG(tip), COUNT(*)FROM tipsGROUP BY day;/*Fri   2.734737   19Sat   2.993103   87Sun   3.255132   76Thur  2.771452   62*/

In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})Out[20]:            tip  dayday                Fri   2.734737   19Sat   2.993103   87Sun   3.255132   76Thur  2.771452   62

Grouping by more than one column is done by passing a list of columns to thegroupby() method.

SELECT smoker, day, COUNT(*), AVG(tip)FROM tipsGROUP BY smoker, day;/*smoker dayNo     Fri      4  2.812500       Sat     45  3.102889       Sun     57  3.167895       Thur    45  2.673778Yes    Fri     15  2.714000       Sat     42  2.875476       Sun     19  3.516842       Thur    17  3.030000*/

In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})Out[21]:               tip                       size      meansmoker day                 No     Fri    4.0  2.812500       Sat   45.0  3.102889       Sun   57.0  3.167895       Thur  45.0  2.673778Yes    Fri   15.0  2.714000       Sat   42.0  2.875476       Sun   19.0  3.516842       Thur  17.0  3.030000

JOIN?

JOINs can be performed with join() or merge(). By default,join() will join the DataFrames on their indices. Each method hasparameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or thecolumns to join on (column names or indices).

In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],   ....:                     'value': np.random.randn(4)})   ....: In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],   ....:                     'value': np.random.randn(4)})   ....:

Assume we have two database tables of the same name and structure as our DataFrames.

Now let’s go over the various types of JOINs.

INNER JOIN?

SELECT *FROM df1INNER JOIN df2  ON df1.key = df2.key;

# merge performs an INNER JOIN by defaultIn [24]: pd.merge(df1, df2, on='key')Out[24]:   key   value_x   value_y0   B -0.318214  0.5435811   D  2.169960 -0.4260672   D  2.169960  1.138079

merge() also offers parameters for cases when you’d like to join one DataFrame’scolumn with another DataFrame’s index.

In [25]: indexed_df2 = df2.set_index('key')In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)Out[26]:   key   value_x   value_y1   B -0.318214  0.5435813   D  2.169960 -0.4260673   D  2.169960  1.138079

LEFT OUTER JOIN?

-- show all records from df1SELECT *FROM df1LEFT OUTER JOIN df2  ON df1.key = df2.key;

# show all records from df1In [27]: pd.merge(df1, df2, on='key', how='left')Out[27]:   key   value_x   value_y0   A  0.116174       NaN1   B -0.318214  0.5435812   C  0.285261       NaN3   D  2.169960 -0.4260674   D  2.169960  1.138079

RIGHT JOIN?

-- show all records from df2SELECT *FROM df1RIGHT OUTER JOIN df2  ON df1.key = df2.key;

# show all records from df2In [28]: pd.merge(df1, df2, on='key', how='right')Out[28]:   key   value_x   value_y0   B -0.318214  0.5435811   D  2.169960 -0.4260672   D  2.169960  1.1380793   E       NaN  0.086073

FULL JOIN?

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not thejoined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

-- show all records from both tablesSELECT *FROM df1FULL OUTER JOIN df2  ON df1.key = df2.key;

# show all records from both framesIn [29]: pd.merge(df1, df2, on='key', how='outer')Out[29]:   key   value_x   value_y0   A  0.116174       NaN1   B -0.318214  0.5435812   C  0.285261       NaN3   D  2.169960 -0.4260674   D  2.169960  1.1380795   E       NaN  0.086073

UNION?

UNION ALL can be performed using concat().

In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],   ....:                     'rank': range(1, 4)})   ....: In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],   ....:                     'rank': [1, 4, 5]})   ....:

SELECT city, rankFROM df1UNION ALLSELECT city, rankFROM df2;/*         city  rank      Chicago     1San Francisco     2New York City     3      Chicago     1       Boston     4  Los Angeles     5*/

In [32]: pd.concat([df1, df2])Out[32]:             city  rank0        Chicago     11  San Francisco     22  New York City     30        Chicago     11         Boston     42    Los Angeles     5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECT city, rankFROM df1UNIONSELECT city, rankFROM df2;-- notice that there is only one Chicago record this time/*         city  rank      Chicago     1San Francisco     2New York City     3       Boston     4  Los Angeles     5*/

In pandas, you can use concat() in conjunction withdrop_duplicates().

In [33]: pd.concat([df1, df2]).drop_duplicates()Out[33]:             city  rank0        Chicago     11  San Francisco     22  New York City     31         Boston     42    Los Angeles     5

Pandas equivalents for some SQL analytic and aggregate functions?

Top N rows with offset?

-- MySQLSELECT * FROM tipsORDER BY tip DESCLIMIT 10 OFFSET 5;

In [34]: tips.nlargest(10+5, columns='tip').tail(10)Out[34]:      total_bill   tip     sex smoker   day    time  size183       23.17  6.50    Male    Yes   Sun  Dinner     4214       28.17  6.50  Female    Yes   Sat  Dinner     347        32.40  6.00    Male     No   Sun  Dinner     4239       29.03  5.92    Male     No   Sat  Dinner     388        24.71  5.85    Male     No  Thur   Lunch     2181       23.33  5.65    Male    Yes   Sun  Dinner     244        30.40  5.60    Male     No   Sun  Dinner     452        34.81  5.20  Female     No   Sun  Dinner     485        34.83  5.17  Female     No  Thur   Lunch     4211       25.89  5.16    Male    Yes   Sat  Dinner     4

Top N rows per group?

-- Oracle's ROW_NUMBER() analytic functionSELECT * FROM (  SELECT    t.*,    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn  FROM tips t)WHERE rn < 3ORDER BY day, rn;

In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)   ....:                     .groupby(['day'])   ....:                     .cumcount() + 1)   ....:      .query('rn < 3')   ....:      .sort_values(['day','rn'])   ....: )   ....: Out[35]:      total_bill    tip     sex smoker   day    time  size  rn95        40.17   4.73    Male    Yes   Fri  Dinner     4   190        28.97   3.00    Male    Yes   Fri  Dinner     2   2170       50.81  10.00    Male    Yes   Sat  Dinner     3   1212       48.33   9.00    Male     No   Sat  Dinner     4   2156       48.17   5.00    Male     No   Sun  Dinner     6   1182       45.35   3.50    Male    Yes   Sun  Dinner     3   2197       43.11   5.00  Female    Yes  Thur   Lunch     4   1142       41.19   5.00    Male     No  Thur   Lunch     5   2

the same using rank(method=’first’) function

In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']   ....:                      .rank(method='first', ascending=False))   ....:      .query('rnk < 3')   ....:      .sort_values(['day','rnk'])   ....: )   ....: Out[36]:      total_bill    tip     sex smoker   day    time  size  rnk95        40.17   4.73    Male    Yes   Fri  Dinner     4  1.090        28.97   3.00    Male    Yes   Fri  Dinner     2  2.0170       50.81  10.00    Male    Yes   Sat  Dinner     3  1.0212       48.33   9.00    Male     No   Sat  Dinner     4  2.0156       48.17   5.00    Male     No   Sun  Dinner     6  1.0182       45.35   3.50    Male    Yes   Sun  Dinner     3  2.0197       43.11   5.00  Female    Yes  Thur   Lunch     4  1.0142       41.19   5.00    Male     No  Thur   Lunch     5  2.0

-- Oracle's RANK() analytic functionSELECT * FROM (  SELECT    t.*,    RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk  FROM tips t  WHERE tip < 2)WHERE rnk < 3ORDER BY sex, rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2).Notice that when using rank(method='min') functionrnk_min remains the same for the same tip(as Oracle’s RANK() function)

In [37]: (tips[tips['tip'] < 2]   ....:      .assign(rnk_min=tips.groupby(['sex'])['tip']   ....:                          .rank(method='min'))   ....:      .query('rnk_min < 3')   ....:      .sort_values(['sex','rnk_min'])   ....: )   ....: Out[37]:      total_bill   tip     sex smoker  day    time  size  rnk_min67         3.07  1.00  Female    Yes  Sat  Dinner     1      1.092         5.75  1.00  Female    Yes  Fri  Dinner     2      1.0111        7.25  1.00  Female     No  Sat  Dinner     1      1.0236       12.60  1.00    Male    Yes  Sat  Dinner     2      1.0237       32.83  1.17    Male    Yes  Sat  Dinner     2      2.0

UPDATE?

UPDATE tipsSET tip = tip*2WHERE tip < 2;

In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE?

DELETE FROM tipsWHERE tip > 9;

In pandas we select the rows that should remain, instead of deleting them

In [39]: tips = tips.loc[tips['tip'] <= 9]

本站僅提供存儲服務，所有內容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權內容，請點擊舉報。

九色国产,午夜在线视频,新黄色网址,九九色综合,天天做夜夜做久久做狠狠,天天躁夜夜躁狠狠躁2021a,久久不卡一区二区三区