R Programming for beginners

File - new R Script
comments: #

Create a value/vector

1+1 => [1] 2
x = 1 + 1 ; 
=> [1] 2
x = 1 + 1  => nothing
grades = c(100,90,85,95); grades => [1] (100,90,85,95)
grades + 5=>
  - function: class(X) => data types ofunction X(Numeric/Character)
  - y = c(5,10,"15",20) ; y is not numerial anymore though only one element is char, now y is chr and change into [1]"5","10","15","20"
  - function: char to numeric： y = as.numeric(y) => [1]  5 10 15 20
  - function: numeric to char as.character
  - Trouble 1: run previous results neccessary; solution: #comments the variable
  - Trouble 2: select rows in code=> run only what you want


function table()& summarize() to calculate the frequency of Vector/Matrix
Print some positions of V: 
Print some positions of V: 
V[1,2] correct into V[c(1,2)]
V[-2] is OK，but not V[-1,-2], it prints positions except positon 2
function sequence: 1:10; seq(1,10);seq(to = 10, from = 1);seq(10,1);seq(from = 1, to = 10, by = +-2)
function repeat: rep("HI",10)
  // will not interfere with embedded <a href="#voila2">tags</a>.
  
NA means sth does not exist
function 把一大组数归纳总结频率table(),转成表as.data.frame(table())，在表里加一些colomn比如Order:cbind.data.frame
function 排序 order()升序，order(-)降序
  - Remember 4: Vector[c(1,2,3...)]获取数值，而matrix$..来获取！ 注意！
  - Remember 5: Vector直接 mean(Vector)就好！
  - Remember 1: Variable[1] not V[0] prints the first one
  - Remember 2: Variable's name should be combined without space between
  - Remember 3: Factors can not be allowed to compare value (=<>)

Metrix:


function data.frame(v1,v2) , if length ofunction v1 is larger than v2 then v2's vector in metrix will be repeated to fill the space.
add another colomn to metrix: function cbind(data.frame(v1,v2), v3)
1. 先按照要求造个vector,再作为一个colomn加上去！！！！
function colnames/rownames(M)  = c("","") to change the names
function M[A,B] Row 1st, colomn 2nd show numbr on that position, A or B can be left empty to show the full colomn/row, B can be 1:2 or c(1,2) both OK!
function: View(metrix) -> show the table 
function length, nrow, ncol, head() tail() names()    #head rows, 
Sum Fuction: sum(data[,5]); sum(data$population)
  - function 不是countif和count，是nrow()! : X[X$region==1,] Return all information on states located in only 1(region ==1). nrow()to count . 
Add conditions on colomn show only colomn: X[X$region==1,".."or 1 ]
  - function data[data$murder==max(data$murder), "state"] return the state ofunction largest murder
  
function: T/function matrix:  data[data$murder, 1]==max(data$murder) not data[data$murder==max(data$murder), ]
function:ifelse()ifelse(data$illiteracy<1, "low","high")
function 提取table的字/标题 names(table), 提取table的数字 numeric(table)

FOR LOOP + 提取colomn + 创建结果为colomn（各个添加）


Create a new column “Age.Range” 
age = hw1$Age
Age.Range = ""
for( i in 1:length(age)){if (age[i] < 20){Age.Range[i]="Below 20 years"} else if(25> age[i]){Age.Range[i] = "[20-25) years"}else if(30 > age[i]){Age.Range[i] = "[25-30) years"} 
else if(30< age[i]){Age.Range[i] = "Above 30 years"}}
Age.Range

import outside data source

change/get working directory
function getwd(); setwd("D:/AW/MSBA&Programming合集/R");  
open and save file: function data= read.csv("statedata.csv")];
{% endcodeblock %}

CODES

{% codeblock lang:R %}
[
x = 1+1
#x
 
grades = c(100,90, 85, 95)
# grades
classes = c("DS", "ss")
# classes
class(classes)
grades + 5
y = c(5,10,"15",20)
y = as.numeric(y)
#y
#y[1]
#grades[c(1,2,4)]
#grades[-3,-4]
Metrix1 = cbind(data.frame(y,classes),grades)
#View(Metrix1)
rownames(Metrix1) = c("stu1","stu2","stu3","stu4")
View(Metrix1)
nrow(Metrix1)
2:10
seq(1,10)
seq(to = 10, from = 1)
seq(10,1) 
seq(from = 1, to = 10, by = 2)
rep("HI",10)
data= read.csv("statedata.csv")
View(data)
tail(data)
sum(data[,5])
sum(data$population)
nrow(data[data$region== 1 ,])
data[data$illiteracy<0.7, "state"]
#data[data$murder==max(data$murder) or data$murder==min(data$murder), "state" ]
data[data$murder==max(data$murder) and data$murder==min(data$murder) ,1]
data[data$murder, 1]==max(data$murder)
data$illiteracyCategory = ifelse(data$illiteracy<1, "low","high")
data[data$illiteracyCategory=="low",] ];
{% endcodeblock %}

conditional Loops in R — Stupid guy

Age.Range = if(hw1$Age < 20){“Below 20 years”} else{
if(25> hw1$Age >= 20){“[20-25) years”}
else{
if(30 > hw1$Age >= 25){“[25-30) years”}
else{“Above 30 years”}}}
错了！为啥？？？

因为elseif 应该是 else if
else if 最后一项也是，不能是else 要写清条件

R Analysis for most popular clothes On Stylenanda Shop

Beginning

So today we gonna come through some simple data analysis to find the most popular clothes ofunction famous brand “Stylenanda”. (Data from http://en.stylenanda.com/)

Step 1: catch info

All the informations we need are the name price, sales and comments ofunction all the clothes.

Library(rvest)
Library(xml2)
> as<-seq(60,420,60) 
> for(s in as){ 
moda<-read_html("Stylenandas=60&q=%C1%AC%D2%C2%C8%B9&sort=s&style=g&from=sn_1_brand-qp&active=2&industryCatId=50025135&type=pc#J_Filter",encoding="GBK") +dress_name<-html_nodes(moda,'.productTitle a')%>%html_text
dress_name<-html_nodes(stylenanda,'.productTitle a')%>%html_text 
dress_price<-html_nodes(stylenanda,'.productPrice em')%>%html_text sale_number<-html_nodes(stylenanda,'.productStatus em')%>%html_text
> the_comment<-html_nodes(stylenanda,'.productStatus a')%>%html_text 
+ }
> dress<-data.frame(dressname,dressprice,salenumber,thecomment)

Step 2: Integrate and Standardize our data

> library(plyr) 
> Unify variable name
dress1<rename(dress1,c(dress_name="dressname",dress_price="dressprice",sale_number="salenumber",the_comment="thecomment") 
> Modadress <-rbind(dress,dress1)

We got the output for “Modadress$dressprice” , new column for the dress price

[1] ￥399.00 ￥299.50 ￥749.00 ￥699.00 ￥112.00 ￥306.00 ￥298.00 ￥649.00
  [9] ￥299.00 ￥279.00 ￥249.00 ￥299.50 ￥374.50 ￥324.50 ￥274.50 ￥324.50
 [17] ￥349.50 ￥199.00 ￥349.50 ￥699.00 ￥279.00 ￥349.00 ￥749.00 ￥199.00
 [25] ￥299.50 ￥324.50 ￥349.50 ￥349.50 ￥349.50 ￥324.50 ￥324.50 ￥299.0

The number ofunction dress sales :

[1] 1198笔 1394笔 0笔    0笔    1笔    0笔    1934笔 64笔   836笔  1848笔 1203笔
 [12] 566笔  573笔  1380笔 1808笔 609笔  1776笔 1621笔 1162笔 348笔  470笔  397笔 
 [23] 446笔  364笔  468笔  579笔  666笔  362笔  776笔  770笔  617笔  611笔  791笔 
 [34] 154笔  781笔  275笔  1118笔 688笔  916笔  772笔  569笔  55笔   114笔  927笔 
 [45] 471笔  750笔  1117笔 734笔  798笔  208笔  620笔  492笔  298笔  160笔  846笔 
 [56] 902笔  109笔  345笔  802笔  101笔  688笔  912笔  774笔  569笔  55笔   115笔

Then we need to filter and delete the unnecessary info

1 2	Modadress$dressprice<-gsub("￥","",Modadress$dressprice) Modadress$salenumber<-gsub("笔","",Modadress$salenumber)

Find their properties

> class(modadress$salenumber)
[1] "factor"
>class(modadress$thecomment)
[1] "factor" 
> class(dressprice) 
[1] "character"

Change strings into number format:

1
2
3

> Modadress$dressprice<-as.numeric(modadress$dressprice) 
> Modadress$salenumber<-as.numeric(modadress$salenumber)
> Modadress$thecomment<-as.numeric(modadress$thecomment)

Step3: Summarize and Visualize data

Summarize data:

group_by(Modadress,priceleve)%>%summarise(price=mean(dressprice,na.rm=T),number=mean(salenumber,na.rm=T),comment=mean(thecomment,na.rm=T))
# A tibble: 2 x 4
  priceleve    price   number   comment
     <fctr>    <dbl>    <dbl>     <dbl>
1      high 722.5484 117.6774  39.90323
2    normal 316.2674 703.0562 422.12360

Visualize data in chart

ggplot(Modadress,aes(x=dressprice,y=salenumber,color=thecomment))+geom_point()+scale_x_continuous(expand=c(0,0),breaks=c(100,200,300,400,500,600,700,800,900,1000,1100),labels=c(100,200,300,400,500,600,700,800,900,1000,1100))

From the picture, a lot ofunction clothes fall into the range ofunction 200RMB-400RMB, fewer cost more than 650RMB and none cost 400RMB-650RMB.

Then in order to figure out the sales ofunction clothes falling into each range, we need to recode the price at first and add “pricelevel” variable after that.

Modadress<-within(Modadres,{
pricelevel<-"NA"
pricelevel[dressprice<450]<-"normal"
pricelevel[dressprice>500]<-"high"
})

Calculate the sales ofunction clothes falling into each range.

total<-tapply(Modadress$salenumber,Modadress$pricelevel,sum)
> total
  high normal 
  3648  62572

Differences between two levels ofunction sales are obvious. Now we shall draw into picture to show clearly.

z<-c("high","normal")
chart<-data.frame(z,total)
> wtp<-ggplot(chart,aes(x=z,y=total))+geom_bar(stat="identity",fill="lightblue",width=0.5)+geom_text(aes(label=total),vjust=-0.2,colour="black",size=8)
> wtp+xlab("price")+ylab("sales")

Step 4: Get the data we what and rule out others

At first we need to specify the meaning ofunction “Popolar clothers”. The standard I refer to is comment volume and sales volume.

According to the figure above, the lighter the color ofunction the blue dot, the larger the comment volume.

Set 1500 as dividing line and choose the best-selling products.

1 2	filter(Modadress,thecomment>1500)%>%arrange(desc(salenumber)) dressname dressprice salenumber thecomment priceleve

Output:

1 Self-Tie Collar Button-Down Shirt 279 1848 1557 normal
2 Check Pattern Buttoned Back Shirt 199 1621 1822 normal
3 Pre-Damaged Knit Cardigan 249 1203 1647 normal
4 Extended Sleeve Lettering Print Hoodie 399 1198 2059 normal

Thus we got Top4 for most popular clothes:

Self-Tie Collar Button-Down Shirt
Check Pattern Buttoned Back Shirt
Pre-Damaged Knit Cardigan
Extended Sleeve Lettering Print Hoodie

R - catch and analyze info on Website