轉(zhuǎn)載聲明 本文為燈塔大數(shù)據(jù)原創(chuàng)內(nèi)容,歡迎個(gè)人轉(zhuǎn)載至朋友圈,其他機(jī)構(gòu)轉(zhuǎn)載請(qǐng)?jiān)谖恼麻_頭標(biāo)注: “轉(zhuǎn)自:燈塔大數(shù)據(jù);” 近幾年來,Python在數(shù)據(jù)科學(xué)界受到大量關(guān)注,我們?cè)谶@里為數(shù)據(jù)科學(xué)界的科學(xué)家和工程師列舉出了最頂尖的Python庫(kù)。(文末更多往期譯文推薦)
因?yàn)檫@里提到的所有的庫(kù)都是開源的,所以我們還備注了每個(gè)庫(kù)的貢獻(xiàn)資料數(shù)量、貢獻(xiàn)者人數(shù)以及其他指數(shù),可對(duì)每個(gè)Python庫(kù)的受歡迎程度加以輔助說明。
Scikits 是 SciPy Stack 的附加軟件包,專為特定功能(如圖像處理和輔助機(jī)器學(xué)習(xí))而設(shè)計(jì)。在后者方面,其中最突出的一個(gè)是 scikit-learn。該軟件包構(gòu)建于 SciPy 之上,并大量使用其數(shù)學(xué)操作。
scikit-learn 有一個(gè)簡(jiǎn)潔和一致的接口,可利用常見的機(jī)器學(xué)習(xí)算法,讓我們可以簡(jiǎn)單地在生產(chǎn)中應(yīng)用機(jī)器學(xué)習(xí)。該庫(kù)結(jié)合了質(zhì)量很好的代碼和良好的文檔,易于使用且有著非常高的性能,是使用 Python 進(jìn)行機(jī)器學(xué)習(xí)的實(shí)際上的行業(yè)標(biāo)準(zhǔn)。
深度學(xué)習(xí):Keras / TensorFlow / Theano
在深度學(xué)習(xí)方面,Python 中最突出和最方便的庫(kù)之一是 Keras,它可以在 TensorFlow 或者 Theano 之上運(yùn)行。讓我們來看一下它們的一些細(xì)節(jié)。
首先,讓我們談?wù)?Theano。Theano 是一個(gè) Python 包,它定義了與 NumPy 類似的多維數(shù)組,以及數(shù)學(xué)運(yùn)算和表達(dá)式。該庫(kù)是經(jīng)過編譯的,使其在所有架構(gòu)上能夠高效運(yùn)行。這個(gè)庫(kù)最初由蒙特利爾大學(xué)機(jī)器學(xué)習(xí)組開發(fā),主要是為了滿足機(jī)器學(xué)習(xí)的需求。
要注意的是,Theano 與 NumPy 在底層的操作上緊密集成。該庫(kù)還優(yōu)化了 GPU 和 CPU 的使用,使數(shù)據(jù)密集型計(jì)算的性能更快。
效率和穩(wěn)定性調(diào)整允許更精確的結(jié)果,即使是非常小的值也可以,例如,即使 x 很小,log(1+x) 也能得到很好的結(jié)果。
TensorFlow 來自 Google 的開發(fā)人員,它是用于數(shù)據(jù)流圖計(jì)算的開源庫(kù),專門為機(jī)器學(xué)習(xí)設(shè)計(jì)。它是為滿足 Google 對(duì)訓(xùn)練神經(jīng)網(wǎng)絡(luò)的高要求而設(shè)計(jì)的,是基于神經(jīng)網(wǎng)絡(luò)的機(jī)器學(xué)習(xí)系統(tǒng) DistBelief 的繼任者。然而,TensorFlow 并不是谷歌的科學(xué)專用的——它也足以支持許多真實(shí)世界的應(yīng)用。
TensorFlow 的關(guān)鍵特征是其多層節(jié)點(diǎn)系統(tǒng),可以在大型數(shù)據(jù)集上快速訓(xùn)練人工神經(jīng)網(wǎng)絡(luò)。這為 Google 的語(yǔ)音識(shí)別和圖像識(shí)別提供了支持。
最后,我們來看看 Keras。它是一個(gè)使用高層接口構(gòu)建神經(jīng)網(wǎng)絡(luò)的開源庫(kù),它是用 Python 編寫的。它簡(jiǎn)單易懂,具有高級(jí)可擴(kuò)展性。它使用 Theano 或 TensorFlow 作為后端,但 Microsoft 現(xiàn)在已將 CNTK(Microsoft 的認(rèn)知工具包)集成為新的后端。
其簡(jiǎn)約的設(shè)計(jì)旨在通過建立緊湊型系統(tǒng)進(jìn)行快速和容易的實(shí)驗(yàn)。
Keras 極其容易上手,而且可以進(jìn)行快速的原型設(shè)計(jì)。它完全使用 Python 編寫的,所以本質(zhì)上很高層。它是高度模塊化和可擴(kuò)展的。盡管它簡(jiǎn)單易用且面向高層,但 Keras 也非常深度和強(qiáng)大,足以用于嚴(yán)肅的建模。
Keras 的一般思想是基于神經(jīng)網(wǎng)絡(luò)的層,然后圍繞層構(gòu)建一切。數(shù)據(jù)以張量的形式進(jìn)行準(zhǔn)備,第一層負(fù)責(zé)輸入張量,最后一層用于輸出。模型構(gòu)建于兩者之間。
自然語(yǔ)言處理
這套庫(kù)的名稱是 Natural Language Toolkit(自然語(yǔ)言工具包),顧名思義,它可用于符號(hào)和統(tǒng)計(jì)自然語(yǔ)言處理的常見任務(wù)。NLTK 旨在促進(jìn) NLP 及相關(guān)領(lǐng)域(語(yǔ)言學(xué)、認(rèn)知科學(xué)和人工智能等)的教學(xué)和研究,目前正被重點(diǎn)關(guān)注。
NLTK 允許許多操作,例如文本標(biāo)記、分類和 tokenizing、命名實(shí)體識(shí)別、建立語(yǔ)語(yǔ)料庫(kù)樹(揭示句子間和句子內(nèi)的依存性)、詞干提取、語(yǔ)義推理。所有的構(gòu)建塊都可以為不同的任務(wù)構(gòu)建復(fù)雜的研究系統(tǒng),例如情緒分析、自動(dòng)摘要。
這是一個(gè)用于 Python 的開源庫(kù),實(shí)現(xiàn)了用于向量空間建模和主題建模的工具。這個(gè)庫(kù)為大文本進(jìn)行了有效的設(shè)計(jì),而不僅僅可以處理內(nèi)存中內(nèi)容。其通過廣泛使用 NumPy 數(shù)據(jù)結(jié)構(gòu)和 SciPy 操作而實(shí)現(xiàn)了效率。它既高效又易于使用。
Gensim 的目標(biāo)是可以應(yīng)用原始的和非結(jié)構(gòu)化的數(shù)字文本。Gensim 實(shí)現(xiàn)了諸如分層 Dirichlet 進(jìn)程(HDP)、潛在語(yǔ)義分析(LSA)和潛在 Dirichlet 分配(LDA)等算法,還有 tf-idf、隨機(jī)投影、word2vec 和 document2vec,以便于檢查一組文檔(通常稱為語(yǔ)料庫(kù))中文本的重復(fù)模式。所有這些算法是無監(jiān)督的——不需要任何參數(shù),唯一的輸入是語(yǔ)料庫(kù)。
數(shù)據(jù)挖掘與統(tǒng)計(jì)
Scrapy 是用于從網(wǎng)絡(luò)檢索結(jié)構(gòu)化數(shù)據(jù)(如聯(lián)系人信息或 URL)的爬蟲程序(也稱為 spider bots)的庫(kù)。它是開源的,用 Python 編寫。它最初是為 scraping 設(shè)計(jì)的,正如其名字所示的那樣,但它現(xiàn)在已經(jīng)發(fā)展成了一個(gè)完整的框架,可以從 API 收集數(shù)據(jù),也可以用作通用的爬蟲。
該庫(kù)在接口設(shè)計(jì)上遵循著名的 Don』t Repeat Yourself 原則——提醒用戶編寫通用的可復(fù)用的代碼,因此可以用來開發(fā)和擴(kuò)展大型爬蟲。
Scrapy 的架構(gòu)圍繞 Spider 類構(gòu)建,該類包含了一套爬蟲所遵循的指令。
statsmodels 是一個(gè)用于 Python 的庫(kù),正如你可能從名稱中猜出的那樣,其讓用戶能夠通過使用各種統(tǒng)計(jì)模型估計(jì)方法以及執(zhí)行統(tǒng)計(jì)斷言和分析來進(jìn)行數(shù)據(jù)探索。
許多有用的特征是描述性的,并可通過使用線性回歸模型、廣義線性模型、離散選擇模型、穩(wěn)健的線性模型、時(shí)序分析模型、各種估計(jì)器進(jìn)行統(tǒng)計(jì)。
該庫(kù)還提供了廣泛的繪圖函數(shù),專門用于統(tǒng)計(jì)分析和調(diào)整使用大數(shù)據(jù)統(tǒng)計(jì)數(shù)據(jù)的良好性能。
結(jié)論
這個(gè)列表中的庫(kù)被很多數(shù)據(jù)科學(xué)家和工程師認(rèn)為是最頂級(jí)的,了解和熟悉它們是很有價(jià)值的。這里有這些庫(kù)在 GitHub 上活動(dòng)的詳細(xì)統(tǒng)計(jì):
Top 15 Python Libraries for Data Science in 2017
As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.
And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.
The scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems. The library combines quality code and good documentation, ease of use and high performance and is de-facto industry standard for machine learning with Python.
Firstly, let’s talk about Theano.
Theano is a Python package that defines multi-dimensional arrays similar to NumPy, along with math operations and expressions. The library is compiled, making it run efficiently on all architectures. Originally developed by the Machine Learning group of Université de Montréal, it is primarily used for the needs of Machine Learning.
The important thing to note is that Theano tightly integrates with NumPy on low-level of its operations. The library also optimizes the use of GPU and CPU, making the performance of data-intensive computation even faster.
Efficiency and stability tweaks allow for much more precise results with even very small values, for example, computation of log(1+x) will give cognizant results for even smallest values of x.
Coming from developers at Google, it is an open-source library of data flow graphs computations, which are sharpened for Machine Learning. It was designed to meet the high-demand requirements of Google environment for training Neural Networks and is a successor of DistBelief, a Machine Learning system, based on Neural Networks. However, TensorFlow isn’t strictly for scientific use in border’s of Google?—?it is general enough to use it in a variety of real-world application.
The key feature of TensorFlow is their multi-layered nodes system that enables quick training of artificial neural networks on large datasets. This powers Google’s voice recognition and object identification from pictures.
The minimalistic approach in design aimed at fast and easy experimentation through the building of compact systems.
Keras is really eased to get started with and keep going with quick prototyping. It is written in pure Python and high-level in its nature. It is highly modular and extendable. Notwithstanding its ease, simplicity, and high-level orientation, Keras is still deep and powerful enough for serious modeling.
The general idea of Keras is based on layers, and everything else is built around them. Data is prepared in tensors, the first layer is responsible for input of tensors, the last layer is responsible for output, and the model is built in between.
The functionality of NLTK allows a lot of operations such as text tagging, classification, and tokenizing, name entities identification, building corpus tree that reveals inter and intra-sentence dependencies, stemming, semantic reasoning. All of the building blocks allow for building complex research systems for different tasks, for example, sentiment analytics, automatic summarization.
Gensim is intended for use with raw and unstructured digital texts. Gensim implements algorithms such as hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), as well as tf-idf, random projections, word2vec and document2vec facilitate examination of texts for recurring patterns of words in the set of documents (often referred as a corpus). All of the algorithms are unsupervised?—?no need for any arguments, the only input is corpus.
It is open-source and written in Python. It was originally designed strictly for scraping, as its name indicate, but it has evolved in the full-fledged framework with the ability to gather data from APIs and act as general-purpose crawlers.
The library follows famous Don’t Repeat Yourself in the interface design?—?it prompts its users to write the general, universal code that is going to be reusable, thus making building and scaling large crawlers.
The architecture of Scrapy is built around Spider class, which encapsulates the set of instruction that is followed by the crawler.
Among many useful features are descriptive and result statistics via the use of linear regression models, generalized linear models, discrete choice models, robust linear models, time series analysis models, various estimators.
The library also provides extensive plotting functions that are designed specifically for the use in statistical analysis and tweaked for good performance with big data sets of statistical data.
聯(lián)系客服